A proposed feature for Exim on high reliability systems
=======================================================

$Cambridge: hermes/doc/misc/hr-exim.txt,v 1.7 2004/01/05 17:29:18 fanf2 Exp $


There are two reliability problems with the new Hermes system as it
currently stands:

(1) Messages that have been delivered to a Cyrus server but not yet
replicated are vulnerable to failure of that Cyrus server.

(2) Messages on the queue on ppsw are vulnerable to failure of that
ppsw system.

This proposal deals with these problems by adding an "early delivery"
feature to Exim. The basic idea is to be able to deliver a message
before the final response to the sending host, instead of after as is
usually the case. This was originally described to me as a performance
hack for Sendmail that would allow it to reduce disk activity when
running as a relay, however it allows you to do more interesting
things as I will describe.


There are two parts to the feature:

(1) An ACL condition, early_delivery, which can be used in a DATA ACL
to cause Exim to do a delivery attempt on the message there and then.
There should probably be a way of turning early delivery failures into
defers, either as an option to the condition or by using other ACL
features.

(2) A router precondition, also called early_delivery, which allows you
to handle early deliveries differently. The condition would be tested
early, like the verify and test preconditions. Like verify and
verify_only it should probably allow routers to be used only for early
deliveries, or never, or for both.

Note that the transport of an early delivery must have a batch_max
larger than the number of recipients, because it must succeed or fail
in the same way for all of the recipients. Exim should probably fail
an early delivery if the recipients are routed differently, or are
routed to an inadequate transport.

Another tricky aspect of this feature is that an early delivery
happens before the local_scan() function and the system filter are
run. This implies it might be difficult to implement without violating
the principle of least surprise.

Note also that an early delivery might or might not be significant,
depending on whether the message's addresses were routed unseen or
not. This means it can be used for data replication in our
high-reliability scenario, or it can simply replicate the Sendmail
performance hack that inspired the idea, or it could be used for a
kind of LMTP-lite.


How we could use it for improving reliability:

	begin acl

	check_smtp_data:

	  # if the message came from another ppsw machine, immediately
	  # deliver it locally, and be honest about the success
	  require  hosts = ppsw
		   set acl_m3 = 1
	           early_delivery

	  # if the message came from elsewhere, immediately deliver it
	  # to another ppsw machine, and tell the other machine to
	  # defer if something went wrong
          defer   !hosts = ppsw
		   set acl_m3 = 2
	          !early_deliver

	  # having done that, accept the message for normal delivery
	  accept

	begin routers

	# this router does early delivery for messages from other ppsw
	# machines, in which case we do a special local delivery
	replication_receiver:
	  driver              = accept
	  early_delivery_only
	  condition           = ${if eq{$acl_m3}{1} {yes} {no}}
	  transport           = replication_archive

	# this router does early delivery for messages from elsewhere,
	# in which case a copy is delivered to another ppsw machine
	replication_sender:
	  driver              = manualroute
	  early_delivery_only
	  condition           = ${if eq{$acl_m3}{2} {yes} {no}}
	  route_data          = ${lookup{$primary_hostname}cdb{DB/replication.cdb}}
	  unseen
	  transport           = smtp

	# other routers

	begin transports

	# this transport creates a temporary archive of messages from
	# other ppsw machines for recovery in case of failure
	replication_archive:
	  driver    = appendfile
	  batch_max = 1000
	  file      = /var/spool/exim.repl/${sender_host_name}/${substr_0_5:$message_id}/${message_id}
	  use_bsmtp

The above ensures that there is a copy of every message on disk on a
second machine (possibly in a second machine room!) before it is
accepted. These copies can be used to recover data in the event of a
ppsw or Cyrus machine failing.

We still need some extra software to clear out the replication archive
after a certain period of time. This could be done on the basis of a
log scraper on each ppsw machine that looks out for message
completions, and communicates them to an rmrmd (replica message
removal daemon) on the other machines. The log should include the
smtp_confirmation from the replication deliveries so that the scraper
knows the message-ID on the remote machine (which determines the file
to be removed).

There needs to be some delay before messages are removed, in order to
allow for the Cyrus replication engine to do its stuff -- perhaps
there should be some end-to-end communication to ensure that the Cyrus
replication has happened before the rmrmd removes the copy from ppsw.
David suggests the delay might be as long as a couple of days.


Other possibilities:

The SMTP receiver of replicated messages could just leave them on a
special spool using queue_only mode, which is managed by the various
Exim queue-handling options.

The Cyrus duplicate suppression mechanism can deal with copies of a
message arriving via different routes (such as both the primary and
the replica). This could make it less important to remove messages
from the replica -- they can be delivered instead. However this
doesn't work if the destination of the message is not Hermes (e.g.
departmental or outgoing email).

The configuration sketch above suggests that ppsw machines might be
paired, though it would work fine if the topology was less rigid (e.g.
replicate a message to another ppsw machine selected at random), which
would also make it easier to deal with changes to ppsw.