Previous News
Next News
Issue last updated September 2,
1998.
I continue here a discussion of various talks presented at the Princeton meeting. Here, we'll examine caching. This is a fairly classical topic in the context of disk accesses that's also fairly hot nowadays as processors have built-in caches, as do browsers or web servers, and many other pieces of software.
The Merriam-Webster dictionary defines a cache, a word that comes from French, as a hiding place, and better for us as ``a computer memory with very short access time used for storage of frequently used instructions or data''. Even if you've never thought of caches, you're in fact using one (almost!) every day: think of your desktop to which you have immediate access, as opposed to your file cabinets that store huge amounts of data, but with slow access.
Initially, caches were meant to accelerate access to data
from disks or slow memory. Thus, they have been investigated early
by specialists of performance evaluation.
In the usual framework, you have a slow device of a large size m
that you may refer to quite often. The idea is to maintain
a fast but much smaller (and costlier) device of some size
in order to store the data that you anticipate will be most useful.
The potential
usefulness of a cache is based the fact that probabilities
of accesses are quite uneven.
For instance, you're much more likely to access data
on Analysis of Algorithms
than on
antidisestablishmentarianism
,
where the scores of references under Altavista are,
as of today at least, of 11,212 to 443 (!).
What also matters is the expected locality of references,
something that's a bit harder to capture by mathematical modelling.
(Has anybody even tried?)
Another issue is the cache management policy.
The most common choice is LRU, which stands for
Least Recently Used, meaning that the c items
that were most recently accessed are in the cache.
Other strategies are based on counters
or on FIFO (first-come-first-serve)
--you kill the oldest guys even if they're still useful.
On the analytic side,
early studies have concentrated on the Independence Reference Model,
where a long sequence is built of individual requests that are
independent in the usual probabilistic sense. Thus, the parameters
of the model are just a finite probability distribution
,and c, the cache size.
At the IFIP Congress in 1971, W. C. King proposed
expressions for the probability D of ``page faults''
(the probability of a non-hit of the cache) of various policies
under this model. The typical formula (for LRU),

When probabilities are equal, it is
a good combinatorial exercise for students to analyse
what goes on. An essential rôle is then played by Stirling
subset numbers (a.k.a numbers of the second kind).
defined by the fact that
is the number of ways of partitioning
into
k nonempty subsets. Generating functions involve
powers of exponentials since
![]()
Such product forms can be extended to nonuniform distributions pi. This is done in a paper with Gardy and Thimonier that is based on a description of cache fault events by shuffles of regular languages. Integral representations for the long-run cache fault probability D result, for instance,
![]()


A formula like (1) is also reminiscent of Poisson laws, which suggests alternative probabilistic approaches. This thread has ben pursued by Dobrow and Fill. See, for instance, ``The move-to-front rule for self-organizing lists with Markov dependent requests'' available under Bob Dobrow's publications.
The talk of Predrag Jelenkovic at Princeton revisits these
questions from an asymptotic perspective. Given that
the request probabilities have some soft tail,
how does the cache fault probability decrease under
larger cache sizes?
Both LRU caching and the closely related Move-To-Front rule
of self-organizing search are treated.
What is needed is an estimate of tail probabilities.
First, product formulæ of a shape similar to (1) give
the Laplace transform of the distribution of search costs (MTF)
and page faults. The explicit character of the integral
lends itself to asymptotic analysis.
The case of a generalized Zipf law,
with
,with
is solved. See
Jelenkovic's paper
for a nice
discussion of the literature and for the technical developments.
For instance, the probability distribution of
a reference R (i.e.,
in earlier notations)
and of the search cost C for MTF satisfy the elegant relation
![]()
Another interesting aspect
is treated by Knuth in the paper referred to above.
What happens when a cache is maintained
``clairvoyantly'', i.e., with perfect knowledge of the future?
In that case, a rule stated by Belady at IBM in 1966
states that the optimal choice at each stage consists in
throwing away from the cache the element whose
reference is farthest in the future.
Under the uniform distribution, Knuth proves that
this rule avoids faults with probability about
,which is surprisingly effective.
I don't think anybody has revisited Belady's rule in the case
of nonuniform reference probabilities.
This is likely to be a good problem to work on.
I remember, once at a lecture, somebody being criticized sharply for the totally unrealistic assumption of knowledge of the future. However, there are many contexts where this is precisely achievable: a program that produces postscript with fonts descriptions knows the size of a laser printer's memory (the ``cache'') and is free to do some look-ahead to decide when to include a font description; a compiler certainly knows the size of its target processor's cache beforehand and it can similarly do arbitrary look-aheads in order to decide what it should do.
Caching and adaptive list searching are intimately related, as the latter can be viewed as the special case where c=m, that is, `cache size = memory size'. In that situation, the cache becomes a list data structure the aim is to at optimize search times. The optimum arrangement of a sequential list is by order of decreasing probabilities, if probabilities are known in advance and requests are independent. (We usually give this as an exercise to our undergrads, don't we?).
In general, access probabilities are unknown a priori. Move-To-Front is then a way to approximate the optimum by moving an element to the front as soon as it is accessed. Another approach consists in keeping reference counts and keeping the elements in decreasing order of their counters. Sure enough, by some law of large numbers, the counters will eventually stabilize to give the right ordering, thereby ensuring optimality. How far should one go? That is the question.
Haddas Shachnai
has a page full of nice preprints that deal with various aspects
of resource allocation in distributed environments.
A paper
of Haddas Shachnai and Micha Hofri entitled
``The List Update Problem: Improved Bounds for the Counter
Scheme'' deals specifically with this problem.
There, it is shown for example that the Counter Scheme
is within
of the optimum after
time about
.The estimate holds for arbitrary probabilities.
Amongst other things on Haddas's site, you may find
a related
discussion
of tail probability bounds for sums of
independent random variables, where names like Chernoff,
Hoeffding, and Bennett are familiar.