Rate limiting demo


This is an animated demo of some rate measurement computations. Click on the graph to trigger an event, or use the tickybox to turn on automatic event generation. The graph shows the average rate of events. The red line is according to the formula used in Exim's current ratelimit code, which is designed to react quickly to changes in rate while retaining some memory of the past in minimal space. The blue line is a smoothed average of the red line, which might be suitable for automatically adaptive rate limiting.

You can adjust the periods over which the rate is averaged using the text boxes. The "decay interval" is for the red line and the "increase interval" is for the blue line. The vertical scale is adjusted relative to the decay smoothing period.

The graph is drawn with the HTML canvas element, so if you see no animation you may need to upgrade your browser. It should work with Safari and Firefox. The source of this page includes the Javascript code that draws the graph.





Smoothing periods: decay=0 increase=0
Seasons: length=0 count=0
Display time 0
Δt=0 Δx=0 Y=0
interval=0 inc=0
i/pr=0 αr=0 rate=0
i/ps=0 αs=0 smoothed=0
squared=0 sd=0
season=0 ave=0 sd=0
Update time 0
interval=0
i/pr=0 αr=0 rate=0
i/ps=0 αs=0 smoothed=0
squared=0 sd=0
season=0 ave=0 sd=0
Auto update sample=0 rate=0

The maths

Models that don't work

There are a couple of models which I considered when developing Exim's ratelimit code, but which turned out to be unsuitable.

• The leaky bucket models a network buffer in which packets may arrive at random intervals and are forwarded at a fixed maximum rate. It's commonly used in networking code. Advantage: uses a fixed amount of space and time to store and update a sender's rate. Disadvantage: The computed value tends to zero or a maximum if the input rate is less than or greater than the maximum.

• The moving average is the mean of the last N samples. It's commonly used in statistical applications such as forecasting. Advantage: The computed value is the average rate. Disadvantage: Requires space and time proportional to N.

The exponentially-weighted moving average aka exponential smoothing has both advantages and neither disadvantage, so it is the model I chose.

Exponential smoothing

Exponential smoothing is usually defined for use with a data set which has samples at a fixed interval, dt, and is parametrized with a smoothing factor, 0 <= a <= 1. The smoothed data is derived using the formula:

	smoothi = (1 - a) * datai + a * smoothi-1

The result is smoother when a is larger. We can define the smoothing period as the time the model takes to mostly forget old behaviour, allowing us to specify the smoothing parameter in a more human-friendly manner, in terms of the sample interval and the smoothing period.

	p = n * dt, where an = e-1
	a = e-1/n = e-dt/p

Aside: The exact value of "most" is fairly arbitrary: it would be reasonable to choose 50% or 90% (base 2 or base 10) instead of 63% (base e).

It's fairly simple to generalize exponential smoothing to data sampled without a fixed interval, by just substituting the sample interval in the above formula. In the following, ti is the timestamp of sample i.

	a = e-dti/p, where dti = ti - ti-1

Measuring the average rate

We want to measure the average rate of events that occur at random intervals. We define the instantaneous rate of an event as follows. In practice we are always dealling with the current event, so dt is the interval since the previous event.

	rinst = 1 / dt, where dt = tthis - tprev

This gives the rate in events per second. As before, it seems convenient to use more human-friendly units such as events per hour or events per day. In fact Exim re-uses the smoothing period as the scale factor for measuring rates, so if you set a rate limit of 10 messages per hour then the averaging period is one hour. The instantaneous rate is then

	rinst = p / dt

We use exponential smoothing to derive the average rate from the instantaneous rate and the previous calculated average.

	rave := (1 - a) * rinst + a * rave
	where a = e-dt/p

Aside: It turns out that the first term of this formula can be approximated as 1, which is usually a slight over-estimate which makes the average rate converge to a value about 0.5 above the true rate.

Properties of this recurrence

We can derive some useful properties of this recurrence if we assume that events occur at a constant rate. It then becomes a geometric progression which we can sum (as detailed elsewhere) to obtain the rate it converges to from an initial value r0 after n events that occur at rk.

	rn = rk + an * (r0 - rk)
	where a = e-dt/p and rk = p / dt

In the limit, an = 0 so rn = rk.

When events cease, the average rate decays exponentially. The time for it to decay from rn to r0 (for example, the time to recover from an excessively fast burst rn to below the maximum rate r0) is simply:

	p * ln(rn / r0)

The number of events it takes to move from r0 to rn is useful to know. For example, we can calculate the maximum size of an over-rate burst by setting r0 = 0 and rn to the maximum permitted average rate.

n = rk log
r0 - rk
rn - rk

When rk is very large, the value tends towards n = rn. For example, if the maximum rate is 10 messages per hour, the maximum fast burst size is 10 messages. This is surprising but very handy, because it means that a maximum rate setting is also a maximum burst size setting.

A qualitative summary is that the average rate computed this way reacts quite quickly to sudden bursts of events. This is a valuable property when we want to use the measurement to detect sudden bursts of events!

Limitations

While this model is very good at what it does, it does not help with some common problems.

• How should I set a rate limit? At the moment this is a manual operation. The postmaster is expected to set the limit based on an estimate of reasonable behaviour backed up with samples of real behaviour.

• How do I handle differences in behaviour of different senders? A flat rate has to be generous enough to accommodate busy senders but not so lax that bad behaviour by light users goes undetected. Manual rate limit setting doesn't scale to large user bases with highly variable usage patterns.

• How do I avoid overnight spam runs? Internet services have a strong diurnal load variation. A rate limit that makes sense for working hours is several times too large for the small hours.

Any attempt to answer these questions must currently be done with external tools. Say, for example, that you want to automatically derive a sender's rate limit from its long-term behaviour. You might be tempted to measure its rate twice, using short and long smoothing periods. Unfortunately, a sudden burst will immediately affect the "long-term" average as well as the short-term one, so it isn't the firm benchmark it was intended to be.

In mathematical terms, this is because the average rate increases by nearly 1 for each event, so the sender has fairly direct control over its rate of change. XXX: I think the rate of change of the smoothed rate is related to the difference between it and the average rate; I need to see if I can formalize this. How much control does this give the sender?

 


Tony Finch <dot@dotat.at> <fanf@exim.org> <fanf2@cam.ac.uk>
$Cambridge: hermes/doc/antiforgery/ratelimit-demo.html,v 1.11 2009/01/19 22:24:44 fanf2 Exp $