One of the security weaknesses of Cambridge's email infrastructure is a lack of mechanisms for detecting and preventing email floods going out via ppswitch. We already have a number of ways of making such floods less likely, but I expect that these will become inadequate as virus and spammer attacks become more advanced. I'm proposing to implement a rate limiting facility in Exim which will allow us to plug this hole.

After considering existing defences and possible attacks, I'll discuss the range of rate limiting policies we could implement. These are just some initial suggestions and are very much open to discussion. After that I'll describe the maths I'm intending to use to measure sending rates, unless anyone knows of a better model. Finally I'll sketch out how this will fit in to Exim's ACL configuration language.

This section reviews our current defences against outgoing email floods, and possible attacks to which we are still vulnerable.

**Anti-virus filtering**If the flood consists of viruses then the anti-virus filter will stop it from escaping.

**Ingress-only MX**A relatively new technique, used by spammers to escape from networks (like Cambridge) that block port 25, is to look up the network's MX and attempt to use it as an outgoing relay. This does not work in Cambridge because our MX only accepts email to Cambridge email addresses.

Before this defence was implemented, this technique was used to send spam from Cambridge via an open proxy in a college then via ppswitch. We did not notice this for some time owing to a lack of monitoring for floods.

The attacks to which we are vulnerable are as follows:

**Exploitable web forms**If a spammer discovers an insecure

`formmail.pl`

(or similar) then this may be used to spam via ppswitch.**Stolen user credentials**A virus may steal a user's Hermes login credentials from their MUA settings and use them to send further copies of the virus or spam using authenticated message submission.

There are supposed to be viruses which do this already, but we do not have direct experience of them. This is probably because of our fairly thorough defences against email viruses.

**Smart host discovery**The name of our smart host (ppsw.cam.ac.uk) is fairly obscure, which makes it unlikely that any but the most determined spammer will be able to find it out in order to spam via open proxies on our networks. There are many open proxies out there which are much easier to exploit.

However, I am planning to publish CSA records in the DNS which will make it possible for spammers to automatically discover candidate host names for use as email smart hosts.

Rate limiting will give us a means to detect and prevent floods as they occur. It isn't a complete solution to the blocking of outgoing spam; it will not detect spam that is trickled out, and it may be difficult for us to block floods sent via a legitimate departmental email server before going out via ppswitch.

The general goal will be to avoid causing inconvenience to legitimate email, including legitimate bulk email. Therefore we will probably start off with a period of measurement to work out what level to set the limit at, so that it is greater than all reasonable usage. We will have to be quite diligent about announcing this well in advance so that our users can comment on the plans and be reassured that this is not intended to restrict the amount of email they can send.

There are two ways of sending email out via ppswitch and I'll consider the policies for them separately.

Message submission from user agents goes via smtp.hermes.cam.ac.uk. By this time next year this should be entirely using secure authenticated submission. I expect that in the process of preventing insecure message submission we will discover a number of machines that are using smtp.hermes as a smart host, and sending email not directly related to a particular user or from MUA software. They will have to be reconfigured to use ppsw.cam.ac.uk, in which case the policies in the next subsection will apply.

The rate limit for smtp.hermes will be per-user, and the same for the whole user-base. This corresponds to smtp.hermes's criterion for providing service: an authenticated user is authorized to send email.

Users that need to send bulk email from their MUAs should use either a proper @lists mailing list to do the bulk explosion, or they should send the email from a dedicated computer via ppsw.cam.ac.uk. However if rate limiting is per-message then users are less likely to encounter problems with large CC: or BCC: lists, so these "shoulds" can be recommendations rather than technical restrictions.

Outgoing email other than from MUAs is sent via ppsw.cam. This includes email from other email systems as well as web servers, cron output, etc. The rate limit here will be per client IP address, which again corresponds to ppsw.cam's criterion for providing service: a host on the CUDN is authorized to send email.

We will probably need to have at least two rate limits for ppsw.cam: a large limit for known bulk senders such as email servers, and a small limit for everything else. We will have to maintain a table of bulk-sending hosts, which will probably be very similar to the existing special_routes table for incoming email.

It might be more appropriate to limit the recipient rate on
ppsw.cam rather than the message rate. This corresponds more closely
to the harm caused by a flood, i.e. the time wasted by recipients. The
client software used with ppsw.cam is probably better able to deal
with slow-down feedback (such as deliberate delays or 4xx temporary
errors) than the MUA software used with smtp.hermes. However MUAs
*are* used with ppsw.cam, so this generalization might not be
sufficiently correct.

Here is a patch to implement the necessary feature in Exim. The rest of this document was written as part of the design process to develop the patch.

Rate limiting will be configured using an Exim ACL condition, which will look like:

`ratelimit =`

m/p/options/key

If the average client sending rate is greater than *m*
messages per time period *p* then the condition is true,
otherwise it is false.

The parameter *p* is the smoothing time constant, in the
form of an Exim time interval, e.g. `8h`

for eight hours,
and *m* is the maximum number of messages in a burst. By
increasing both *m* and *p* but keeping *m*
divided by *p* constant, you can allow a client to send more
messages in a burst without changing its overall sending rate limit.
Conversely, if *m* and *p* are both small then messages
must be sent at an even rate.

The *key* is used to look up the client's persistent state,
used to calcluate its average rate, in a database maintained by Exim
in its spool directory alongside the retry database etc. For example,
you can limit the sending rate of each authenticated user by setting
the key to `$authenticated_id`

. The default key is
`$sender_host_address`

.

There are two groups of options.

The `per_conn`

option means that the client's rate is
recomputed once per SMTP connection. The `per_mail`

option
means that the client's rate is recomputed once per message. The
`per_cmd`

option means that the rate is recomputed every
time the ACL is processed. The alias `per_rcpt`

is provided
for use in the `RCPT`

ACL to make it clear that the effect
is to limit the rate at which recipients are accepted. In this case
the rate limiting engine will see a message with many recipients as a
large high-speed burst. If none of the `per_*`

options is
present the default is `per_mail`

.

If a client's average rate is greater than the maximum, the rate
limiting engine can react in two possible ways, depending on the
presence of the `strict`

and `leaky`

options.
This is independent of the other counter-measures (e.g. rejecting the
message) that may be implemented by the rest of the configuration. The
default mode is `leaky`

, which avoids an over-aggressive
retry rate preventing a sender from getting any email through.

The `strict`

option means that the client's recorded
rate is updated as usual. The effect of this is that Exim measures the
client's average rate of attempts to send email, which can be much
higher than the maximum. If the client is over the limit it will be
subjected to counter-measures until they slow down below the maximum
rate.

The `leaky`

option means that the client's recorded rate
is not updated. The effect of this is that Exim measures the client's
average rate of successfully sent email, which cannot be greater than
the maximum. If the client is over the limit it will suffer some
counter-measures, but it will still be able to send email at the
configured maximum rate, whatever the rate of its attempts.

As a side-effect the `ratelimit`

condition will set the
expansion variables `$sender_rate`

containing the client's
computed rate in messages per period *p*,
`$sender_rate_limit`

containing the value of *m*,
and `$sender_rate_period`

containing the value of
*p*.

Exim's existing ACL facilities are used to define what counter-measures are taken when the rate limit is exceeded. This might be anything from logging a warning (e.g. while measuring existing sending rates in order to define our policy), through time delays to slow down fast senders, up to rejecting the message. For example,

warn ratelimit = ${lookup {$sender_host_address} \ cdb {DB/ratelimits.cdb} \ {$value} {RATELIMIT} } / strict warn ratelimit = 4 / 3600 / per_rcpt delay = ${eval: $sender_rate - $sender_rate_limit } defer ratelimit = 4 / 3600 / key=$authenticated_id

The software requires a way of computing a client's sending rate reasonably efficiently, i.e. with a minimum of persistent state which must be read and updated for each message (or recipient). In addition to that we would like to be able to set the maximum rate relatively low over long periods of time but still allow high rates for short periods, because email sending activity is very bursty. For example, 100 messages per day is a lot but it is only one every 15 minutes.

An exponentially-weighted moving average seems to fit these requirements very well. It calculates a smoothed rate which accommodates bursty behaviour. Its state (per sender) is the time of the previous message and the smoothed rate at that time. EWMA is often used by packet switching systems, e.g. in TCP to estimate the round-trip time and in various bandwidth shaping algorithms.

The usual formula for an exponentially weighted moving average is:

r_now = (1 - a) * r_inst + a * r_prev

*r_now* is the current smoothed rate;
*r_inst* is the instantaneous rate at this point in time;
*r_prev* is the smoothed rate computed at the previous sample;
and *a* is the smoothing parameter,
with larger values (up to 1) being less sensitive to bursts.

In most situations this formula is applied to data sampled at
regular intervals. However our sampling points occur when messages are
sent, which has variable intervals. Therefore we need to adjust
*a* according to the time since the previous sample. Let
*t_now* be the current time and *t_prev* be the time of
the previous sample. Then the interval *i* is given by:

i = t_now - t_prev

If *s* is assumed to be the smoothing factor for samples at
a regular interval of 1 second, then the adjusted smoothing factor for
an interval of *i* is:

a = s

^{i}

However it's more convenient for the user to set the smoothing
factor using a time constant: the time *c* in the recent past
during which messages contribute significantly to the computed rate.
If we say that

s

^{c}= 1 / e

where *e* is the base of the natural logarithms, then this
means that messages arriving more than *c* seconds ago
contribute less than *1/e* to the smoothed rate. I arbitrarily
picked *e* as the significance factor to avoid using the
`pow()`

function (though this chioce has a useful
consequence described below): it means we can simply compute
*s* from *c* with

s = exp(-1 / c)

which means that the adjusted smoothing factor for an interval of
*i* is given by

a = s

^{i}= exp(-1 / c)^{i}= exp(-i / c)

If *i* = *c* then this produces
*a* = exp(-1) = 1 / *e*
as we wanted.

How do we compute *r_inst*, the instantaneous sending rate?
We can guess based on the time since the previous message, *i*,
which gives us a sending frequency of 1 / *i* messages per
second. However it's more convenient for the user to think of rates as
messages per some longer time interval, for which we might as well use
*c*. This gives us a rate of *c / i* messages every
*c* seconds.

So to summarize, the user chooses a smoothing interval *c*,
where longer intervals allow greater burst rates, and for each message
(or recipient) the smoothed rate is computed by the following
equations:

i = t_now - t_prev

r_inst = c / i

a = exp(-i / c)

r_now = (1 - a) * r_inst + a * r_prev

How does the measured rate change when the sending rate changes,
e.g. during a burst? This is useful to know for the purpose of
choosing the smoothing interval *c* so that a sensible amount
of bursty behaviour from clients is allowed.

We'll look at a simplified scenario where the client is sending a
burst of messages at a constant rate *r _{k}*. We will
name the client's smoothed rate at the start of the burst

k = (1 - a) * r

_{k}

After the first few messages of the burst, we get

r _{1}= (1 - a) * r _{k}+ a * r_{0}= a r _{0}+ kr _{2}= a r _{1}+ k = a^{2}r_{0}+ a k + kr _{3}= a r _{2}+ k = a^{3}r_{0}+ a^{2}k + a k + k

which in general is

r _{n}= a^{n}r_{0}+ k

n-1 Σx = 0 a ^{x}

Using the standard formula for the sum of a geometric progression we get

r _{n}= a^{n}r_{0}+ k

1 - a ^{n}1 - a

and when we plug in the unabbreviated *k* it simplifies to

r _{n}= a^{n}r_{0}+ r_{k}(1 - a^{n}) = r_{k}+ a^{n}(r_{0}- r_{k})where

r _{k}= c / ia = exp(-i / c)

This is the calculated sending rate after receiving *n*
messages at intervals of *i* seconds with a smoothing constant
of *c*, following a period whose computed rate was
*r _{0}*. When

By re-arranging the formula we can work out the number of messages
*n* that can be sent in a burst before *r _{n}*
reaches the maximum rate

m - r _{k}r _{0}- r_{k}= a ^{n}= exp(-n i / c)

- n i / c = log

m - r _{k}r _{0}- r_{k}

n = - r _{k}log

m - r _{k}r _{0}- r_{k}= r _{k}log

r _{0}- r_{k}m - r _{k}

So, for example, if we start with *r _{0}* = 0, then
this formula gives us the following table. I've set

Note that the the equality between the maximum rate and the maximum
burst size is a side-effect of the choice of *e* as the
significance factor - it turned out not to be arbitrary after all.

c = 86400, m = 100 | c = 18000, m = 20 | c = 3600, m = 4 | c = 900, m = 1 | |||||

i |
r_{k} | n |
r_{k} | n |
r_{k} | n |
r_{k} | n |
---|---|---|---|---|---|---|---|---|

0.001 | 86400000 | 100 | 18000000 | 20 | 3600000 | 4 | 900000 | 1 |

1 | 86400 | 100 | 18000 | 20.01 | 3600 | 4.002 | 900 | 1.001 |

10 | 8640 | 101 | 1800 | 20.1 | 360 | 4.02 | 90 | 1.01 |

60 | 1440 | 104 | 300 | 20.7 | 60 | 4.14 | 15 | 1.03 |

300 | 288 | 123 | 60 | 24.3 | 12 | 4.87 | 3 | 1.21 |

600 | 144 | 171 | 30 | 33.0 | 6 | 6.59 | 1.5 | 1.65 |

The above table includes a line for messages arriving at an interval of 1ms. This illustrates that the burst size limit is strictly applied to fast senders. An interval of 1ms is unlikely for whole messages, but very small intervals can occur if rate limiting is applied to individual RCPT commands, which can all arrive in a single packet.

The implementation will have to prevent attempts to measure the rate of RCPT commands with an interval of 0. A suitable response is to wait for one tick of the system clock (typically 10ms or 1ms) then re-compute the rate. The instantaneous rate will still easily be high enough for the burst size limit to work.

In the formulae above, the value *c* performs two functions:
it is the exponential smoothing time constant, used to calculate
*a*; and it is a scale factor for the units in which rates are
measured, so they are not always per second. We set these to the same,
so that users have fewer numbers to worry about and so that they don't
have do deal with inconveniently small numbers of messages per
second.

Other scale factors can be included in the formula without
affecting its behaviour. In the formula for *r _{n}*
above, the scale factor is part of the value of

This can be used to scale the rate according to some measurement of size of the message, e.g. the count of bytes or recipients.

The model described above calculates an accurate rate for very slow senders, but this turns out not to be particularly useful in practice especially when scale factors are used as described in the previous section. The problem occurs because the first message of a spam run contributes to a rate averaged over a long period of inactivity, so the result is close to zero. This is bad if it has a large count of bytes or recipients because it delays the detection of a spam run.

An alternative model is to regard isolated events as counting only towards the smoothing period in which they fall, ignoring previous inactivity. This is equivalent to deciding that it doesn't make sense to say the rate is less than one, except that if a high-speed burst occurs it can take more than one period to forget about it.

When the rate calculation is scaled by some count and produces a
value less than the count, the alternative model sets the rate equal
to the count. This produces almost the same result as performing
*count* rate updates one at a time in quick succession, since
very short intervals increase the measured rate by one.

A client that exceeds the maximum rate, i.e. *r_now > m*
must ensure that its future sending rate *r_fut* is below the
maximum *m*, i.e. it must satisfy one of the following. I don't
currently know how to solve them analytically so I'm not sure which is
the most useful.

m > (1 - a) * r_fut + a * r_now m - r_fut > a * (r_now - r_fut)

m - r_fut r_now - r_fut > a = exp(-1 / r_fut) 1 > r_fut * log

r_now - r_fut m - r_fut

Note that the final inequality has the same form as the equation
for *n* in the previous section. The meaning is that one
message isn't enough to break the maximum rate.

Tony Finch <dot@dotat.at> <fanf@exim.org> <fanf2@cam.ac.uk>

$Cambridge: hermes/doc/antiforgery/ratelimit.html,v 1.18 2009/01/20 00:01:38 fanf2 Exp $