A proposed feature for Exim on high reliability systems ======================================================= $Cambridge: hermes/doc/misc/hr-exim.txt,v 1.7 2004/01/05 17:29:18 fanf2 Exp $ There are two reliability problems with the new Hermes system as it currently stands: (1) Messages that have been delivered to a Cyrus server but not yet replicated are vulnerable to failure of that Cyrus server. (2) Messages on the queue on ppsw are vulnerable to failure of that ppsw system. This proposal deals with these problems by adding an "early delivery" feature to Exim. The basic idea is to be able to deliver a message before the final response to the sending host, instead of after as is usually the case. This was originally described to me as a performance hack for Sendmail that would allow it to reduce disk activity when running as a relay, however it allows you to do more interesting things as I will describe. There are two parts to the feature: (1) An ACL condition, early_delivery, which can be used in a DATA ACL to cause Exim to do a delivery attempt on the message there and then. There should probably be a way of turning early delivery failures into defers, either as an option to the condition or by using other ACL features. (2) A router precondition, also called early_delivery, which allows you to handle early deliveries differently. The condition would be tested early, like the verify and test preconditions. Like verify and verify_only it should probably allow routers to be used only for early deliveries, or never, or for both. Note that the transport of an early delivery must have a batch_max larger than the number of recipients, because it must succeed or fail in the same way for all of the recipients. Exim should probably fail an early delivery if the recipients are routed differently, or are routed to an inadequate transport. Another tricky aspect of this feature is that an early delivery happens before the local_scan() function and the system filter are run. This implies it might be difficult to implement without violating the principle of least surprise. Note also that an early delivery might or might not be significant, depending on whether the message's addresses were routed unseen or not. This means it can be used for data replication in our high-reliability scenario, or it can simply replicate the Sendmail performance hack that inspired the idea, or it could be used for a kind of LMTP-lite. How we could use it for improving reliability: begin acl check_smtp_data: # if the message came from another ppsw machine, immediately # deliver it locally, and be honest about the success require hosts = ppsw set acl_m3 = 1 early_delivery # if the message came from elsewhere, immediately deliver it # to another ppsw machine, and tell the other machine to # defer if something went wrong defer !hosts = ppsw set acl_m3 = 2 !early_deliver # having done that, accept the message for normal delivery accept begin routers # this router does early delivery for messages from other ppsw # machines, in which case we do a special local delivery replication_receiver: driver = accept early_delivery_only condition = ${if eq{$acl_m3}{1} {yes} {no}} transport = replication_archive # this router does early delivery for messages from elsewhere, # in which case a copy is delivered to another ppsw machine replication_sender: driver = manualroute early_delivery_only condition = ${if eq{$acl_m3}{2} {yes} {no}} route_data = ${lookup{$primary_hostname}cdb{DB/replication.cdb}} unseen transport = smtp # other routers begin transports # this transport creates a temporary archive of messages from # other ppsw machines for recovery in case of failure replication_archive: driver = appendfile batch_max = 1000 file = /var/spool/exim.repl/${sender_host_name}/${substr_0_5:$message_id}/${message_id} use_bsmtp The above ensures that there is a copy of every message on disk on a second machine (possibly in a second machine room!) before it is accepted. These copies can be used to recover data in the event of a ppsw or Cyrus machine failing. We still need some extra software to clear out the replication archive after a certain period of time. This could be done on the basis of a log scraper on each ppsw machine that looks out for message completions, and communicates them to an rmrmd (replica message removal daemon) on the other machines. The log should include the smtp_confirmation from the replication deliveries so that the scraper knows the message-ID on the remote machine (which determines the file to be removed). There needs to be some delay before messages are removed, in order to allow for the Cyrus replication engine to do its stuff -- perhaps there should be some end-to-end communication to ensure that the Cyrus replication has happened before the rmrmd removes the copy from ppsw. David suggests the delay might be as long as a couple of days. Other possibilities: The SMTP receiver of replicated messages could just leave them on a special spool using queue_only mode, which is managed by the various Exim queue-handling options. The Cyrus duplicate suppression mechanism can deal with copies of a message arriving via different routes (such as both the primary and the replica). This could make it less important to remove messages from the replica -- they can be delivered instead. However this doesn't work if the destination of the message is not Hermes (e.g. departmental or outgoing email). The configuration sketch above suggests that ppsw machines might be paired, though it would work fine if the topology was less rigid (e.g. replicate a message to another ppsw machine selected at random), which would also make it easier to deal with changes to ppsw.