Rate limiting for Exim on ppswitch

Originally written May 2005; minor additions January 2009.

One of the security weaknesses of Cambridge's email infrastructure is a lack of mechanisms for detecting and preventing email floods going out via ppswitch. We already have a number of ways of making such floods less likely, but I expect that these will become inadequate as virus and spammer attacks become more advanced. I'm proposing to implement a rate limiting facility in Exim which will allow us to plug this hole.

After considering existing defences and possible attacks, I'll discuss the range of rate limiting policies we could implement. These are just some initial suggestions and are very much open to discussion. After that I'll describe the maths I'm intending to use to measure sending rates, unless anyone knows of a better model. Finally I'll sketch out how this will fit in to Exim's ACL configuration language.

Email flooding defences and attacks

This section reviews our current defences against outgoing email floods, and possible attacks to which we are still vulnerable.

The attacks to which we are vulnerable are as follows:

Rate limiting will give us a means to detect and prevent floods as they occur. It isn't a complete solution to the blocking of outgoing spam; it will not detect spam that is trickled out, and it may be difficult for us to block floods sent via a legitimate departmental email server before going out via ppswitch.

Suggested rate-limiting policies

The general goal will be to avoid causing inconvenience to legitimate email, including legitimate bulk email. Therefore we will probably start off with a period of measurement to work out what level to set the limit at, so that it is greater than all reasonable usage. We will have to be quite diligent about announcing this well in advance so that our users can comment on the plans and be reassured that this is not intended to restrict the amount of email they can send.

There are two ways of sending email out via ppswitch and I'll consider the policies for them separately.

smtp.hermes.cam.ac.uk

Message submission from user agents goes via smtp.hermes.cam.ac.uk. By this time next year this should be entirely using secure authenticated submission. I expect that in the process of preventing insecure message submission we will discover a number of machines that are using smtp.hermes as a smart host, and sending email not directly related to a particular user or from MUA software. They will have to be reconfigured to use ppsw.cam.ac.uk, in which case the policies in the next subsection will apply.

The rate limit for smtp.hermes will be per-user, and the same for the whole user-base. This corresponds to smtp.hermes's criterion for providing service: an authenticated user is authorized to send email.

Users that need to send bulk email from their MUAs should use either a proper @lists mailing list to do the bulk explosion, or they should send the email from a dedicated computer via ppsw.cam.ac.uk. However if rate limiting is per-message then users are less likely to encounter problems with large CC: or BCC: lists, so these "shoulds" can be recommendations rather than technical restrictions.

ppsw.cam.ac.uk

Outgoing email other than from MUAs is sent via ppsw.cam. This includes email from other email systems as well as web servers, cron output, etc. The rate limit here will be per client IP address, which again corresponds to ppsw.cam's criterion for providing service: a host on the CUDN is authorized to send email.

We will probably need to have at least two rate limits for ppsw.cam: a large limit for known bulk senders such as email servers, and a small limit for everything else. We will have to maintain a table of bulk-sending hosts, which will probably be very similar to the existing special_routes table for incoming email.

It might be more appropriate to limit the recipient rate on ppsw.cam rather than the message rate. This corresponds more closely to the harm caused by a flood, i.e. the time wasted by recipients. The client software used with ppsw.cam is probably better able to deal with slow-down feedback (such as deliberate delays or 4xx temporary errors) than the MUA software used with smtp.hermes. However MUAs are used with ppsw.cam, so this generalization might not be sufficiently correct.

Code

Here is a patch to implement the necessary feature in Exim. The rest of this document was written as part of the design process to develop the patch.

Configuration

Rate limiting will be configured using an Exim ACL condition, which will look like:

ratelimit = m / p / options / key

If the average client sending rate is greater than m messages per time period p then the condition is true, otherwise it is false.

The parameter p is the smoothing time constant, in the form of an Exim time interval, e.g. 8h for eight hours, and m is the maximum number of messages in a burst. By increasing both m and p but keeping m divided by p constant, you can allow a client to send more messages in a burst without changing its overall sending rate limit. Conversely, if m and p are both small then messages must be sent at an even rate.

The key is used to look up the client's persistent state, used to calcluate its average rate, in a database maintained by Exim in its spool directory alongside the retry database etc. For example, you can limit the sending rate of each authenticated user by setting the key to $authenticated_id. The default key is $sender_host_address.

There are two groups of options.

The per_conn option means that the client's rate is recomputed once per SMTP connection. The per_mail option means that the client's rate is recomputed once per message. The per_cmd option means that the rate is recomputed every time the ACL is processed. The alias per_rcpt is provided for use in the RCPT ACL to make it clear that the effect is to limit the rate at which recipients are accepted. In this case the rate limiting engine will see a message with many recipients as a large high-speed burst. If none of the per_* options is present the default is per_mail.

If a client's average rate is greater than the maximum, the rate limiting engine can react in two possible ways, depending on the presence of the strict and leaky options. This is independent of the other counter-measures (e.g. rejecting the message) that may be implemented by the rest of the configuration. The default mode is leaky, which avoids an over-aggressive retry rate preventing a sender from getting any email through.

The strict option means that the client's recorded rate is updated as usual. The effect of this is that Exim measures the client's average rate of attempts to send email, which can be much higher than the maximum. If the client is over the limit it will be subjected to counter-measures until they slow down below the maximum rate.

The leaky option means that the client's recorded rate is not updated. The effect of this is that Exim measures the client's average rate of successfully sent email, which cannot be greater than the maximum. If the client is over the limit it will suffer some counter-measures, but it will still be able to send email at the configured maximum rate, whatever the rate of its attempts.

As a side-effect the ratelimit condition will set the expansion variables $sender_rate containing the client's computed rate in messages per period p, $sender_rate_limit containing the value of m, and $sender_rate_period containing the value of p.

Exim's existing ACL facilities are used to define what counter-measures are taken when the rate limit is exceeded. This might be anything from logging a warning (e.g. while measuring existing sending rates in order to define our policy), through time delays to slow down fast senders, up to rejecting the message. For example,

	warn
	  ratelimit = ${lookup {$sender_host_address} \
	                cdb {DB/ratelimits.cdb} \
			{$value} {RATELIMIT} } / strict

	warn
	  ratelimit = 4 / 3600 / per_rcpt
	  delay     = ${eval: $sender_rate - $sender_rate_limit }

	defer
	  ratelimit = 4 / 3600 / key=$authenticated_id

Mathematical model

The software requires a way of computing a client's sending rate reasonably efficiently, i.e. with a minimum of persistent state which must be read and updated for each message (or recipient). In addition to that we would like to be able to set the maximum rate relatively low over long periods of time but still allow high rates for short periods, because email sending activity is very bursty. For example, 100 messages per day is a lot but it is only one every 15 minutes.

An exponentially-weighted moving average seems to fit these requirements very well. It calculates a smoothed rate which accommodates bursty behaviour. Its state (per sender) is the time of the previous message and the smoothed rate at that time. EWMA is often used by packet switching systems, e.g. in TCP to estimate the round-trip time and in various bandwidth shaping algorithms.

The usual formula for an exponentially weighted moving average is:

r_now = (1 - a) * r_inst + a * r_prev

r_now is the current smoothed rate; r_inst is the instantaneous rate at this point in time; r_prev is the smoothed rate computed at the previous sample; and a is the smoothing parameter, with larger values (up to 1) being less sensitive to bursts.

In most situations this formula is applied to data sampled at regular intervals. However our sampling points occur when messages are sent, which has variable intervals. Therefore we need to adjust a according to the time since the previous sample. Let t_now be the current time and t_prev be the time of the previous sample. Then the interval i is given by:

i = t_now - t_prev

If s is assumed to be the smoothing factor for samples at a regular interval of 1 second, then the adjusted smoothing factor for an interval of i is:

a = si

However it's more convenient for the user to set the smoothing factor using a time constant: the time c in the recent past during which messages contribute significantly to the computed rate. If we say that

sc = 1 / e

where e is the base of the natural logarithms, then this means that messages arriving more than c seconds ago contribute less than 1/e to the smoothed rate. I arbitrarily picked e as the significance factor to avoid using the pow() function (though this chioce has a useful consequence described below): it means we can simply compute s from c with

s = exp(-1 / c)

which means that the adjusted smoothing factor for an interval of i is given by

a = si = exp(-1 / c)i = exp(-i / c)

If i = c then this produces a = exp(-1) = 1 / e as we wanted.

How do we compute r_inst, the instantaneous sending rate? We can guess based on the time since the previous message, i, which gives us a sending frequency of 1 / i messages per second. However it's more convenient for the user to think of rates as messages per some longer time interval, for which we might as well use c. This gives us a rate of c / i messages every c seconds.

So to summarize, the user chooses a smoothing interval c, where longer intervals allow greater burst rates, and for each message (or recipient) the smoothed rate is computed by the following equations:

i = t_now - t_prev
r_inst = c / i
a = exp(-i / c)
r_now = (1 - a) * r_inst + a * r_prev

Permitted burst sizes

How does the measured rate change when the sending rate changes, e.g. during a burst? This is useful to know for the purpose of choosing the smoothing interval c so that a sensible amount of bursty behaviour from clients is allowed.

We'll look at a simplified scenario where the client is sending a burst of messages at a constant rate rk. We will name the client's smoothed rate at the start of the burst r0, which is the same as r_prev for the first message of the burst. Because the sending rate is constant, i is fixed for each message in the burst, so we can treat a as a constant parameter in this analysis. It will also simplify the algebra to define an abbreviation k for the first term in the equation for r_now:

k = (1 - a) * rk

After the first few messages of the burst, we get

r1 = (1 - a) * rk + a * r0
= a r0 + k
r2 = a r1 + k = a2 r0 + a k + k
r3 = a r2 + k = a3 r0 + a2 k + a k + k

which in general is

rn = an r0 + k
n-1
Σ
x = 0
ax

Using the standard formula for the sum of a geometric progression we get

rn = an r0 + k
1 - an
1 - a

and when we plug in the unabbreviated k it simplifies to

rn = an r0 + rk (1 - an) = rk + an (r0 - rk)   where
rk = c / i
a = exp(-i / c)

This is the calculated sending rate after receiving n messages at intervals of i seconds with a smoothing constant of c, following a period whose computed rate was r0. When n is large, the exponential tends to zero so the smoothed rate tends to the incoming rate rk as you would expect.

By re-arranging the formula we can work out the number of messages n that can be sent in a burst before rn reaches the maximum rate m.

m - rk
r0 - rk
= an = exp(-n i / c)
- n i / c = log
m - rk
r0 - rk
n = - rk log
m - rk
r0 - rk
= rk log
r0 - rk
m - rk

So, for example, if we start with r0 = 0, then this formula gives us the following table. I've set m / c roughly constant for each column, so that the nominal permitted rate is roughly the same. However, larger smoothing time constants c allow bigger bursts, and the burst size happens to be roughly equal to m. This shows that the configuration parameters have a very intuitive meaning, which is a great boon.

Note that the the equality between the maximum rate and the maximum burst size is a side-effect of the choice of e as the significance factor - it turned out not to be arbitrary after all.

c = 86400, m = 100 c = 18000, m = 20 c = 3600, m = 4 c = 900, m = 1
i rk n rk n rk n rk n
0.001 86400000 100 18000000 20 3600000 4 900000 1
1 86400 100 18000 20.01 3600 4.002 900 1.001
10 8640 101 1800 20.1 360 4.02 90 1.01
60 1440 104 300 20.7 60 4.14 15 1.03
300 288 123 60 24.3 12 4.87 3 1.21
600 144 171 30 33.0 6 6.59 1.5 1.65

Very fast senders

The above table includes a line for messages arriving at an interval of 1ms. This illustrates that the burst size limit is strictly applied to fast senders. An interval of 1ms is unlikely for whole messages, but very small intervals can occur if rate limiting is applied to individual RCPT commands, which can all arrive in a single packet.

The implementation will have to prevent attempts to measure the rate of RCPT commands with an interval of 0. A suitable response is to wait for one tick of the system clock (typically 10ms or 1ms) then re-compute the rate. The instantaneous rate will still easily be high enough for the burst size limit to work.

Scale factors

In the formulae above, the value c performs two functions: it is the exponential smoothing time constant, used to calculate a; and it is a scale factor for the units in which rates are measured, so they are not always per second. We set these to the same, so that users have fewer numbers to worry about and so that they don't have do deal with inconveniently small numbers of messages per second.

Other scale factors can be included in the formula without affecting its behaviour. In the formula for rn above, the scale factor is part of the value of k which is outside the sum of the geometric progression.

This can be used to scale the rate according to some measurement of size of the message, e.g. the count of bytes or recipients.

Very slow senders

The model described above calculates an accurate rate for very slow senders, but this turns out not to be particularly useful in practice especially when scale factors are used as described in the previous section. The problem occurs because the first message of a spam run contributes to a rate averaged over a long period of inactivity, so the result is close to zero. This is bad if it has a large count of bytes or recipients because it delays the detection of a spam run.

An alternative model is to regard isolated events as counting only towards the smoothing period in which they fall, ignoring previous inactivity. This is equivalent to deciding that it doesn't make sense to say the rate is less than one, except that if a high-speed burst occurs it can take more than one period to forget about it.

When the rate calculation is scaled by some count and produces a value less than the count, the alternative model sets the rate equal to the count. This produces almost the same result as performing count rate updates one at a time in quick succession, since very short intervals increase the measured rate by one.

Time to recover

A client that exceeds the maximum rate, i.e. r_now > m must ensure that its future sending rate r_fut is below the maximum m, i.e. it must satisfy one of the following. I don't currently know how to solve them analytically so I'm not sure which is the most useful.

m > (1 - a) * r_fut + a * r_now
m - r_fut > a * (r_now - r_fut)
m - r_fut
r_now - r_fut
> a = exp(-1 / r_fut)
1 > r_fut * log
r_now - r_fut
m - r_fut

Note that the final inequality has the same form as the equation for n in the previous section. The meaning is that one message isn't enough to break the maximum rate.


Tony Finch <dot@dotat.at> <fanf@exim.org> <fanf2@cam.ac.uk>
$Cambridge: hermes/doc/antiforgery/ratelimit.html,v 1.18 2009/01/20 00:01:38 fanf2 Exp $