One of the security weaknesses of Cambridge's email infrastructure is a lack of mechanisms for detecting and preventing email floods going out via ppswitch. We already have a number of ways of making such floods less likely, but I expect that these will become inadequate as virus and spammer attacks become more advanced. I'm proposing to implement a rate limiting facility in Exim which will allow us to plug this hole.
After considering existing defences and possible attacks, I'll discuss the range of rate limiting policies we could implement. These are just some initial suggestions and are very much open to discussion. After that I'll describe the maths I'm intending to use to measure sending rates, unless anyone knows of a better model. Finally I'll sketch out how this will fit in to Exim's ACL configuration language.
This section reviews our current defences against outgoing email floods, and possible attacks to which we are still vulnerable.
If the flood consists of viruses then the anti-virus filter will stop it from escaping.
A relatively new technique, used by spammers to escape from networks (like Cambridge) that block port 25, is to look up the network's MX and attempt to use it as an outgoing relay. This does not work in Cambridge because our MX only accepts email to Cambridge email addresses.
Before this defence was implemented, this technique was used to send spam from Cambridge via an open proxy in a college then via ppswitch. We did not notice this for some time owing to a lack of monitoring for floods.
The attacks to which we are vulnerable are as follows:
If a spammer discovers an insecure formmail.pl
(or
similar) then this may be used to spam via ppswitch.
A virus may steal a user's Hermes login credentials from their MUA settings and use them to send further copies of the virus or spam using authenticated message submission.
There are supposed to be viruses which do this already, but we do not have direct experience of them. This is probably because of our fairly thorough defences against email viruses.
The name of our smart host (ppsw.cam.ac.uk) is fairly obscure, which makes it unlikely that any but the most determined spammer will be able to find it out in order to spam via open proxies on our networks. There are many open proxies out there which are much easier to exploit.
However, I am planning to publish CSA records in the DNS which will make it possible for spammers to automatically discover candidate host names for use as email smart hosts.
Rate limiting will give us a means to detect and prevent floods as they occur. It isn't a complete solution to the blocking of outgoing spam; it will not detect spam that is trickled out, and it may be difficult for us to block floods sent via a legitimate departmental email server before going out via ppswitch.
The general goal will be to avoid causing inconvenience to legitimate email, including legitimate bulk email. Therefore we will probably start off with a period of measurement to work out what level to set the limit at, so that it is greater than all reasonable usage. We will have to be quite diligent about announcing this well in advance so that our users can comment on the plans and be reassured that this is not intended to restrict the amount of email they can send.
There are two ways of sending email out via ppswitch and I'll consider the policies for them separately.
Message submission from user agents goes via smtp.hermes.cam.ac.uk. By this time next year this should be entirely using secure authenticated submission. I expect that in the process of preventing insecure message submission we will discover a number of machines that are using smtp.hermes as a smart host, and sending email not directly related to a particular user or from MUA software. They will have to be reconfigured to use ppsw.cam.ac.uk, in which case the policies in the next subsection will apply.
The rate limit for smtp.hermes will be per-user, and the same for the whole user-base. This corresponds to smtp.hermes's criterion for providing service: an authenticated user is authorized to send email.
Users that need to send bulk email from their MUAs should use either a proper @lists mailing list to do the bulk explosion, or they should send the email from a dedicated computer via ppsw.cam.ac.uk. However if rate limiting is per-message then users are less likely to encounter problems with large CC: or BCC: lists, so these "shoulds" can be recommendations rather than technical restrictions.
Outgoing email other than from MUAs is sent via ppsw.cam. This includes email from other email systems as well as web servers, cron output, etc. The rate limit here will be per client IP address, which again corresponds to ppsw.cam's criterion for providing service: a host on the CUDN is authorized to send email.
We will probably need to have at least two rate limits for ppsw.cam: a large limit for known bulk senders such as email servers, and a small limit for everything else. We will have to maintain a table of bulk-sending hosts, which will probably be very similar to the existing special_routes table for incoming email.
It might be more appropriate to limit the recipient rate on ppsw.cam rather than the message rate. This corresponds more closely to the harm caused by a flood, i.e. the time wasted by recipients. The client software used with ppsw.cam is probably better able to deal with slow-down feedback (such as deliberate delays or 4xx temporary errors) than the MUA software used with smtp.hermes. However MUAs are used with ppsw.cam, so this generalization might not be sufficiently correct.
Here is a patch to implement the necessary feature in Exim. The rest of this document was written as part of the design process to develop the patch.
Rate limiting will be configured using an Exim ACL condition, which will look like:
ratelimit = m / p / options / key
If the average client sending rate is greater than m messages per time period p then the condition is true, otherwise it is false.
The parameter p is the smoothing time constant, in the
form of an Exim time interval, e.g. 8h
for eight hours,
and m is the maximum number of messages in a burst. By
increasing both m and p but keeping m
divided by p constant, you can allow a client to send more
messages in a burst without changing its overall sending rate limit.
Conversely, if m and p are both small then messages
must be sent at an even rate.
The key is used to look up the client's persistent state,
used to calcluate its average rate, in a database maintained by Exim
in its spool directory alongside the retry database etc. For example,
you can limit the sending rate of each authenticated user by setting
the key to $authenticated_id
. The default key is
$sender_host_address
.
There are two groups of options.
The per_conn
option means that the client's rate is
recomputed once per SMTP connection. The per_mail
option
means that the client's rate is recomputed once per message. The
per_cmd
option means that the rate is recomputed every
time the ACL is processed. The alias per_rcpt
is provided
for use in the RCPT
ACL to make it clear that the effect
is to limit the rate at which recipients are accepted. In this case
the rate limiting engine will see a message with many recipients as a
large high-speed burst. If none of the per_*
options is
present the default is per_mail
.
If a client's average rate is greater than the maximum, the rate
limiting engine can react in two possible ways, depending on the
presence of the strict
and leaky
options.
This is independent of the other counter-measures (e.g. rejecting the
message) that may be implemented by the rest of the configuration. The
default mode is leaky
, which avoids an over-aggressive
retry rate preventing a sender from getting any email through.
The strict
option means that the client's recorded
rate is updated as usual. The effect of this is that Exim measures the
client's average rate of attempts to send email, which can be much
higher than the maximum. If the client is over the limit it will be
subjected to counter-measures until they slow down below the maximum
rate.
The leaky
option means that the client's recorded rate
is not updated. The effect of this is that Exim measures the client's
average rate of successfully sent email, which cannot be greater than
the maximum. If the client is over the limit it will suffer some
counter-measures, but it will still be able to send email at the
configured maximum rate, whatever the rate of its attempts.
As a side-effect the ratelimit
condition will set the
expansion variables $sender_rate
containing the client's
computed rate in messages per period p,
$sender_rate_limit
containing the value of m,
and $sender_rate_period
containing the value of
p.
Exim's existing ACL facilities are used to define what counter-measures are taken when the rate limit is exceeded. This might be anything from logging a warning (e.g. while measuring existing sending rates in order to define our policy), through time delays to slow down fast senders, up to rejecting the message. For example,
warn ratelimit = ${lookup {$sender_host_address} \ cdb {DB/ratelimits.cdb} \ {$value} {RATELIMIT} } / strict warn ratelimit = 4 / 3600 / per_rcpt delay = ${eval: $sender_rate - $sender_rate_limit } defer ratelimit = 4 / 3600 / key=$authenticated_id
The software requires a way of computing a client's sending rate reasonably efficiently, i.e. with a minimum of persistent state which must be read and updated for each message (or recipient). In addition to that we would like to be able to set the maximum rate relatively low over long periods of time but still allow high rates for short periods, because email sending activity is very bursty. For example, 100 messages per day is a lot but it is only one every 15 minutes.
An exponentially-weighted moving average seems to fit these requirements very well. It calculates a smoothed rate which accommodates bursty behaviour. Its state (per sender) is the time of the previous message and the smoothed rate at that time. EWMA is often used by packet switching systems, e.g. in TCP to estimate the round-trip time and in various bandwidth shaping algorithms.
The usual formula for an exponentially weighted moving average is:
r_now = (1 - a) * r_inst + a * r_prev
r_now is the current smoothed rate; r_inst is the instantaneous rate at this point in time; r_prev is the smoothed rate computed at the previous sample; and a is the smoothing parameter, with larger values (up to 1) being less sensitive to bursts.
In most situations this formula is applied to data sampled at regular intervals. However our sampling points occur when messages are sent, which has variable intervals. Therefore we need to adjust a according to the time since the previous sample. Let t_now be the current time and t_prev be the time of the previous sample. Then the interval i is given by:
i = t_now - t_prev
If s is assumed to be the smoothing factor for samples at a regular interval of 1 second, then the adjusted smoothing factor for an interval of i is:
a = si
However it's more convenient for the user to set the smoothing factor using a time constant: the time c in the recent past during which messages contribute significantly to the computed rate. If we say that
sc = 1 / e
where e is the base of the natural logarithms, then this
means that messages arriving more than c seconds ago
contribute less than 1/e to the smoothed rate. I arbitrarily
picked e as the significance factor to avoid using the
pow()
function (though this chioce has a useful
consequence described below): it means we can simply compute
s from c with
s = exp(-1 / c)
which means that the adjusted smoothing factor for an interval of i is given by
a = si = exp(-1 / c)i = exp(-i / c)
If i = c then this produces a = exp(-1) = 1 / e as we wanted.
How do we compute r_inst, the instantaneous sending rate? We can guess based on the time since the previous message, i, which gives us a sending frequency of 1 / i messages per second. However it's more convenient for the user to think of rates as messages per some longer time interval, for which we might as well use c. This gives us a rate of c / i messages every c seconds.
So to summarize, the user chooses a smoothing interval c, where longer intervals allow greater burst rates, and for each message (or recipient) the smoothed rate is computed by the following equations:
i = t_now - t_prev
r_inst = c / i
a = exp(-i / c)
r_now = (1 - a) * r_inst + a * r_prev
How does the measured rate change when the sending rate changes, e.g. during a burst? This is useful to know for the purpose of choosing the smoothing interval c so that a sensible amount of bursty behaviour from clients is allowed.
We'll look at a simplified scenario where the client is sending a burst of messages at a constant rate rk. We will name the client's smoothed rate at the start of the burst r0, which is the same as r_prev for the first message of the burst. Because the sending rate is constant, i is fixed for each message in the burst, so we can treat a as a constant parameter in this analysis. It will also simplify the algebra to define an abbreviation k for the first term in the equation for r_now:
k = (1 - a) * rk
After the first few messages of the burst, we get
r1 = (1 - a) * rk + a * r0 = a r0 + k r2 = a r1 + k = a2 r0 + a k + k r3 = a r2 + k = a3 r0 + a2 k + a k + k
which in general is
rn = an r0 + k
n-1 Σ x = 0 ax
Using the standard formula for the sum of a geometric progression we get
rn = an r0 + k
1 - an 1 - a
and when we plug in the unabbreviated k it simplifies to
rn = an r0 + rk (1 - an) = rk + an (r0 - rk) where
rk = c / i a = exp(-i / c)
This is the calculated sending rate after receiving n messages at intervals of i seconds with a smoothing constant of c, following a period whose computed rate was r0. When n is large, the exponential tends to zero so the smoothed rate tends to the incoming rate rk as you would expect.
By re-arranging the formula we can work out the number of messages n that can be sent in a burst before rn reaches the maximum rate m.
m - rk r0 - rk = an = exp(-n i / c)
- n i / c = log
m - rk r0 - rk
n = - rk log
m - rk r0 - rk = rk log
r0 - rk m - rk
So, for example, if we start with r0 = 0, then this formula gives us the following table. I've set m / c roughly constant for each column, so that the nominal permitted rate is roughly the same. However, larger smoothing time constants c allow bigger bursts, and the burst size happens to be roughly equal to m. This shows that the configuration parameters have a very intuitive meaning, which is a great boon.
Note that the the equality between the maximum rate and the maximum burst size is a side-effect of the choice of e as the significance factor - it turned out not to be arbitrary after all.
c = 86400, m = 100 | c = 18000, m = 20 | c = 3600, m = 4 | c = 900, m = 1 | |||||
i | rk | n | rk | n | rk | n | rk | n |
---|---|---|---|---|---|---|---|---|
0.001 | 86400000 | 100 | 18000000 | 20 | 3600000 | 4 | 900000 | 1 |
1 | 86400 | 100 | 18000 | 20.01 | 3600 | 4.002 | 900 | 1.001 |
10 | 8640 | 101 | 1800 | 20.1 | 360 | 4.02 | 90 | 1.01 |
60 | 1440 | 104 | 300 | 20.7 | 60 | 4.14 | 15 | 1.03 |
300 | 288 | 123 | 60 | 24.3 | 12 | 4.87 | 3 | 1.21 |
600 | 144 | 171 | 30 | 33.0 | 6 | 6.59 | 1.5 | 1.65 |
The above table includes a line for messages arriving at an interval of 1ms. This illustrates that the burst size limit is strictly applied to fast senders. An interval of 1ms is unlikely for whole messages, but very small intervals can occur if rate limiting is applied to individual RCPT commands, which can all arrive in a single packet.
The implementation will have to prevent attempts to measure the rate of RCPT commands with an interval of 0. A suitable response is to wait for one tick of the system clock (typically 10ms or 1ms) then re-compute the rate. The instantaneous rate will still easily be high enough for the burst size limit to work.
In the formulae above, the value c performs two functions: it is the exponential smoothing time constant, used to calculate a; and it is a scale factor for the units in which rates are measured, so they are not always per second. We set these to the same, so that users have fewer numbers to worry about and so that they don't have do deal with inconveniently small numbers of messages per second.
Other scale factors can be included in the formula without affecting its behaviour. In the formula for rn above, the scale factor is part of the value of k which is outside the sum of the geometric progression.
This can be used to scale the rate according to some measurement of size of the message, e.g. the count of bytes or recipients.
The model described above calculates an accurate rate for very slow senders, but this turns out not to be particularly useful in practice especially when scale factors are used as described in the previous section. The problem occurs because the first message of a spam run contributes to a rate averaged over a long period of inactivity, so the result is close to zero. This is bad if it has a large count of bytes or recipients because it delays the detection of a spam run.
An alternative model is to regard isolated events as counting only towards the smoothing period in which they fall, ignoring previous inactivity. This is equivalent to deciding that it doesn't make sense to say the rate is less than one, except that if a high-speed burst occurs it can take more than one period to forget about it.
When the rate calculation is scaled by some count and produces a value less than the count, the alternative model sets the rate equal to the count. This produces almost the same result as performing count rate updates one at a time in quick succession, since very short intervals increase the measured rate by one.
A client that exceeds the maximum rate, i.e. r_now > m must ensure that its future sending rate r_fut is below the maximum m, i.e. it must satisfy one of the following. I don't currently know how to solve them analytically so I'm not sure which is the most useful.
m > (1 - a) * r_fut + a * r_now m - r_fut > a * (r_now - r_fut)
m - r_fut r_now - r_fut > a = exp(-1 / r_fut) 1 > r_fut * log
r_now - r_fut m - r_fut
Note that the final inequality has the same form as the equation for n in the previous section. The meaning is that one message isn't enough to break the maximum rate.
$Cambridge: hermes/doc/antiforgery/ratelimit.html,v 1.18 2009/01/20 00:01:38 fanf2 Exp $