BALLS AND URNS, ETC.
Philippe Flajolet
(Version of December 14, 1996)
Balls and urns models are basic in combinatorics, statistics, analysis of algorithms, and statistical physics. These models are nicely decomposable and their basic properties can be explored using tools developed for the automatic manipulation of combinatorial models, like Combstruct .
As is well-known there are four types of models, depending on whether balls and urns are taken to be distinguishable or not.
The four basic models
We consider the placement of balls into urns in all possible ways. For definiteness, we examine only the situation of nonempty urns, so that the number of possible configurations of a fixed size (i.e., a fixed number of balls) is always finite. If the balls are distinguishable, we may assume them to be numbered consecutively by integers 1, 2, ..., ; in this case, we are dealing with labelled structures, and balls are labelled atoms. If the balls are indistinguishable, then we simply regard them as anonymous unlabelled atoms (generically called Z , by a global convention of Combstruct). If the urns are distinguishable, we may view them as arranged in a row, so that we are dealing with a Sequence construction; otherwise, we have a Set construction. (The Set construction of Combstruct means a multiset, that is to say a set where repetitions are allowed.)
Balls are not ordered within an urn, so that an urn is a priori a Set of balls. This gives rise to four different models:
DBDU: distinguishable balls and distinguishable urns; we are dealing with Sequences of Sets, in a labelled universe;
DBIU: distinguishable balls and indistinguishable urns; we are dealing with Sets of Sets, in a labelled universe;
IBDU: indistinguishable balls and distinguishable urns; we are dealing with Sequences of Sets, in an unlabelled universe;
IBIU: indistinguishable balls and indistinguishable urns; we are dealing with Sets of Sets, in an unlabelled universe.
In combstruct, this is expressed by four different, but similar looking, specifications:
> with(combstruct);
> DBDU:=[S,{S=Sequence(U),U=Set(Z,card>=1)},labelled]:
> DBIU:=[S,{S=Set(U),U=Set(Z,card>=1)},labelled]:
> IBDU:=[S,{S=Sequence(U),U=Set(Z,card>=1)},unlabelled]:
> IBIU:=[S,{S=Set(U),U=Set(Z,card>=1)},unlabelled]:
> for spec in DBDU,DBIU,IBDU,IBIU do draw(spec,size=10) od;
The corresponding counting sequences satisfy natural domination conditions that one can summarize by the informal inequality: "Distinguishable>Indistinguishable"
> for spec in DBDU,DBIU,IBDU,IBIU do seq(count(spec,size=j),j=1..12) od;
In the sequel, it is convenient to represent objects by a more concise notation. We thus introduce "reduction" procedures for labelled and unlabelled objects:
>
lreduce:=proc(e) eval(subs({Set=proc() {args} end, Sequence=proc() [args] end},e)) end:
ureduce:=proc(e) eval(subs({Set=proc() {[args]} end, Sequence=proc() [args] end},e)) end:
Since the set construction "{}" in Maple does not keep multisets, an unlabelled (multi)set will be represented as "{[...]}".
>
for spec in DBDU,DBIU do lreduce(draw(spec,size=25)) od;
for spec in IBDU,IBIU do ureduce(draw(spec,size=25)) od;
On such simulations, we see that there tends to be fewer urns in models of type IU, but more filled ones.
Distinguishable balls (labelled structures)
Distinguishable urns
In this model, we deal with distinguishable balls (labelled atoms) that go in all possible way into distinguishable urns corresponding to the specification:
> DBDU:=[S,{S=Sequence(U),U=Set(Z,card>=1)},labelled]:
Combinatorially, this model is the same as of Surjections from to an initial segment of the integers. It is the one that leads to larger cardinality counts.
> for j to 3 do j=map(lreduce,allstructs(DBDU,size=j)) od;
> seq(count(DBDU,size=j),j=0..30);
Such tables are quite useful for checking various combinatorial conjectures. Here, we may verify that these numbers are the sequence
M2952
of the
Encyclopedia of Integer Sequences
by Sloane and Plouffe, where they are known as the numbers of preferential arrangements of
things.
The counting problem is solved automatically by
combstruct[gfeqns]
,
combstruct[gfseries]
(a series alternative to
combstruct[count]
) and
combstruct[gfsolve]
:
> gfeqns(op(2..3,DBDU),z);
> Order:=12: gfseries(op(2..3,DBDU),z);
> gfsolve(op(2..3,DBDU),z);
In particular, we have found the exponential generating function (EGF) explicitly:
> S_z:=subs(",S(z)); series(S_z,z=0,7);
The EGF is singular with a pole at . This immediately gives an approximate expression for the coefficients:
> series(S_z,z=log(2),3);
> S_n_asympt:=1/2*n!*log(2)^(-n-1);
As usual with meromorphic functions, the approximation is extremely good:
> for j from 0 by 5 to 30 do j,evalf(count(DBDU,size=j)/subs(n=j,S_n_asympt),30); od;
This type of analysis can be easily generalized to determine for instance the expected number of urns in a random surjection. Such analyses may then be used to validate an a priori statistical model by comparing theoretical predictions against empirical data.
Indistinguishable urns (set partitions)
We are now dealing with indistinguishable urns. Equivalently, we consider the way elements (the labels 1, ..., ) may be grouped into equivalence classes in all possible ways.
> DBIU:=[S,{S=Set(U),U=Set(Z,card>=1)},labelled]:
> for j to 4 do j=map(lreduce,allstructs(DBIU,size=j)) od;
> seq(count(DBIU,size=j),j=0..30);
Such tables are quite useful for checking various combinatorial conjectures. Here, we may verify that these numbers are the sequence M1484 of the Encyclopedia of Integer Sequences by Sloane and Plouffe. They are the well-known Bell numbers of combinatorial theory that also appear as moments of the Poisson distribution, in the calculus of finite differences, etc.
We automatically obtain the exponential generating function as
> gfsolve(op(2..3,DBIU),z);
> P_z:=subs(",S(z));
> series(P_z,z=0,8);
> Order:=8: gfseries(op(2..3,DBIU),z);
By expanding and truncating, we obtain excellent approximations (this is in fact a version of a formula found by Dobinski in 1877):
>
P_n_asympt:=exp(-1)*Sum(k^n/k!,k=0..2*n);
for j by 3 to 20 do j,evalf(count(DBIU,size=j)/subs(n=j,P_n_asympt),30); od;
Indistinguishable balls (unlabelled structures)
Distinguishable urns (integer compositions)
We start from the specification
> IBDU:=[S,{S=Sequence(U),U=Sequence(Z,card>=1)},unlabelled]:
In this particular case, as balls are indistinguishable, we may as well consider urns as Sequence of atoms. The reason for doing this is a simpler form of generating functions (as we do not have to go unnecessarily through Polya operators) as well as faster computations. We can check that this new version is equivalent to the earlier one, namely
> IBDU_0:=[S,{S=Sequence(U),U=Set(Z,card>=1)},unlabelled]:
> seq(count(IBDU,size=j),j=0..20); seq(count(IBDU_0,size=j),j=0..20);
Of course, here we recognize the powers of two: the result is combinatorially obvious since
a partition can be obtained by inserting arbitrary cuts in the integer interval 1, ..., . We can also check this with combstruct[gfsolve]
> gfsolve(op(2..3,IBDU),z);
> SS_z:=subs(",S(z));
> series(SS_z,z=0,10);
Indistinguishable urns (integer partitions)
We start from the specification
> IBIU:=[S,{S=Set(U),U=Sequence(Z,card>=1)},unlabelled]:
Combinatorially, we are specifying integer partitions that describe the occupancy profile of urns.
> for j to 6 do j=map(ureduce,allstructs(IBIU,size=j)) od;
Naturally, since we are dealing with sets of summands (the order does not count), we may as well regard these objects as an increasing sequence of summands that sum to the size , or equivalently as "staircases" with size being the area below the staircase.
> preduce:=proc(e) sort(eval(subs({Set=proc() [args] end, Sequence=proc() nargs end},e))) end:
> rand_part:=draw(IBIU,size=100): ureduce(rand_part); preduce(rand_part);
There are much fewer structures than in previous models:
> seq(count(IBIU,size=j),j=0..30);
The random generation process is nontrivial as one must generate objects up to certain symmetries. The first time, counting tables are set up on the fly, so that random generation takes a few seconds for size .
> for i from 0 by 20 to 100 do i,preduce(draw(IBIU,size=i)); od;
Next, random generation becomes faster:
> for i to 10 do preduce(draw(IBIU,size=100)); od;
In this particular case, the random generation procedure that is automatically built by Combstruct coincides with a method especially designed by Wilf for integer partitions. By design, Combstruct accepts in full generality arbitrary compositions of Set and Cycle constructions (in addition to Union, Product, Sequence, etc).
The generating functions are now more complicated since they involve Polya operators.
> gfeqns(op(2..3,IBIU),z);
> gfsolve(op(2..3,IBIU),z);
Such Polya operators are however known to combstruct[gfseries]
> gfseries(op(2..3,IBIU),z);
though they are not otherwise "known" to Maple
> subs("",S(z)); series(",z=0);
Error, (in series/exp) unable to compute series
In such cases, one has to resort either to simplification by hand (not always possible) or to the literature. Here, it is very well known that the generating function of integer partitions is
> PP_z:=Product(1/(1-z^k),k=1..infinity);
> Order:=12: series(subs(infinity=Order+2,PP_z),z=0);
Constrained models
Number of urns in surjections
The approach developed so far may be tuned to analyse a variety parameters. We explore here the way Combstruct may serve to analyse the number of urns as well as related situations with bounded urn capacity. We focus on counts and building numerical tables. Naturally, random generation and exhaustive listing are possible from any of these specifications.
We deal here with a fixed number of urns in the model DBDU that corresponds to surjections. Combinatorial specifications may actually be computed in Maple, then used by Combstruct.
The following procedure computes the specifications with urns.
> surj:=[S,{S=Sequence(U,card=r),U=Set(Z,card>=1)}, labelled]: subs(r=5,surj);
In passing, this illustrates the use of cardinality modifiers for Sequence, Set, and Cycle constructions.
The following counts imply the first few values of the probability distribution of the number of urns in a random unconstrained surjection:
> for i to 5 do seq(count(subs(r=i,surj),size=m),m=0..10) od;
and we may build tables of probability distributions automatically:
> for i to 7 do seq(evalf(count(subs(r=i,surj),size=m)/count(DBDU,size=m),4),m=0..10) od;
Finally, we are led to rediscover the corresponding generating functions. Usually, this is done via recurrence computations, and what we obtain here is equivalent to the EGF of Stirling second kind (partition) numbers :
> for i to 5 do gfsolve(op(2,subs(r=i,surj)),labelled,z) od;
This suggests a pattern involving Pascal's triangle:
> Surj_z:=proc(k) local j; add(binomial(k,j)*(-1)^(k-j)*exp(z)^j,j=0..k) end:
> Surj_z(6); gfsolve(op(2,subs(r=6,surj)),labelled,z);
For coefficients finally, we have by straight expansion
.
> Surj_nk:=proc(n,k) local j; `if`(n<>0,add(binomial(k,j)*(-1)^(k-j)*j^n,j=0..k),1) end:
The formula matches the values obtained by combstruct[gfseries]
> Order:=16: gfseries(op(2..3,subs(r=7,surj)),z);
> seq(Surj_nk(n,7)/n!,n=0..15);
Number of parts in integer compositions
This is the model IBDU.
> comp_r:=[S,{S=Sequence(U,card=r),U=Sequence(Z,card>=1)},unlabelled]:
> for i to 5 do seq(count(subs(r=i,comp_r),size=m),m=0..15) od;
> for i to 5 do subs(gfsolve(op(2,subs(r=i,comp_r)),labelled,z),S(z)) od;
The pattern is obvious and this corresponds to an explicit (and well-known!) binomial formula for the coefficients.
Number of parts in integer partitions
This is the model IBIU.
> part_r:=[S,{S=Set(U,card=r),U=Sequence(Z,card>=1)}, unlabelled]:
We can build tables, like before
> for i to 5 do seq(count(subs(r=i,part_r),size=m),m=0..15) od;
The generating functions are found by combstruct[gfsolve] to be rational:
> for i to 5 do gfsolve(op(2,subs(r=i,part_r)),unlabelled,z) od;
The form is then easily inferred from the factored representations, as one recognizes cyclotomic polynomials.
> S5_z:=subs(",S(z));
> factor(S5_z); series(",z=0,20);
> z^5/mul(1-z^i,i=1..5); factor("); series(",z=0,20);
Bounded capacity in the DBDU model (hashing)
We now consider configurations where the number of urns is a fixed number , so that one can relax the constraint that urns need to be nonempty. We are thus dealing with the collection of the functions of to , not necessarily surjections. Such specifications also describe words (size being length) on an alphabet of cardinality and this may be used to model hashing sequences in a table of length . The set of all possible sequences ( fixed) is specified by
> hash_r:=[S,{S=Prod(U$r),U=Set(Z)},labelled]: subs(r=10,hash_r);
Here, we use a hack, with Prod instead of "Sequence" with fixed cardinality, since the current version of Combstruct does not accept Sequence applied to an argument that leads to an Epsilon structure, even in the case of a bounded cardinality modifier.
We change the reduction procedure to take this new construct into account.
>
lreduce:=proc(e) eval(subs({Set=proc() {args} end, Prod=proc() [args] end},e)) end:
ureduce:=proc(e) eval(subs({Set=proc() {[args]} end, Sequence=proc() [args] end,Prod=proc() [args] end},e)) end:
The number of objects of size is, as predicted, .
> seq(count(subs(r=10,hash_r),size=j),j=0..10);
Next, we consider urns with a bounded maximum capacity :
> hash_br:=[S,{S=Prod(U$r),U=Set(Z,card<=b)}, labelled]: subs({r=5,b=3},hash_br);
Such specifications also describe words (size being length) on an alphabet of cardinality , given that no letter is used more than times. This models hashing sequences in a table of length when pages have capacity .
> lreduce(draw(subs(r=10,hash_r),size=30));
> lreduce(draw(subs({r=10,b=4},hash_br),size=30));
The following command automatically creates a table of maximum urn occupancy: the -th entry in the -th line is the probability that balls thrown into 10 urns fit into urns of capacity .
> for i 1 to 6 do seq(evalf(count(subs({r=10,b=i},hash_br),size=j)/count(subs(r=10,hash_r),size=j),4),j=0..20); od;
Syntax error, unexpected number
Naturally, Combstruct automatically recognizes "impossible" configurations:
> draw(subs({r=10,b=3},hash_br),size=31);
Error, (in combstruct/drawgrammar) there is no structure of this size
Bounded capacity in the IBDU model (bosons)
This is a model of integer compositions. (A more sophisticated example that is related to submarine detection is treated below.) We consider here a fixed number of urns and this is exactly the so-called "Bose-Einstein" model of statistical physics, where the corresponding objects are called bosons.
Mark Kobrak, a chemist at the University of Chicago wrote to us (December 13, 1996):
The basic problem is this: I am interested in simulating a problem in laser spectroscopy. A molecule has n modes, and I need to generate every possible combination which places up to m quanta in each mode. As I go through each one, I will analyze its energy.
This is exactly like a problem where, given n jars, you may put up to m marbles in each jar. The marbles are indistinguishable. I know how to count the number of combinations, but I need a good way to program a computer to go through all these combinations.
We let again denote the number of urns and the bucket capacity, i.e., the maximum number of urns that may fit into any given urn. We have the model of integer compositions:
> boson_br:=[S,{S=Prod(U$r),U=Sequence(Z,card<=b)}, unlabelled]: subs({r=5,b=3},boson_br);
For instance, here are all the 95 possible ways of putting marbles ( ) into jars, each jar being of maximum capacity .
> seq(count(subs({r=5,b=2},boson_br),size=j),j=1..4);
> for j to 4 do j=map(ureduce,allstructs(subs({r=5,b=2},boson_br),size=j)) od;
Stack polyominoes
Polyominoes are familiar objects of combinatorial mathematics. A stack polyomino or stack is a piling up of nonempty integer intervals, each interval being included in the previous one. The number of intervals comprising a stack are called the height; the total length of the intervals is called the size. For instance, the collection
> [1,12],[3,8],[4,7],[4,7],[6,7];
is a stack of height 5 and size
> (12-1)+(8-3)+(7-4)+(7-4)+(7-6);
We may assume stacks to be left justified, starting at 1. Stacks of height exactly are specified by ( in the example)
> stack:=[st,{st=Prod(left,right),left=Set(U,card=r),right=Set(U,card<r),U=Sequence(Z,card>=1)},unlabelled]: subs(r=5,stack);
Combinatorially, we view a stack as an ascending staircase (the left part) of height followed by a descending staircase (the right part) of height at most . Staircases of fixed height are defined as partitions into bounded summands.
The distribution of height in stacks is for small height:
> for i to 6 do seq(count(subs(r=i,stack),size=m),m=0..20) od;
The total number of stacks is:
> for i to 6 do for m from 0 to 10 do stnum[i,m]:=count(subs(r=i,stack),size=m) od od: for m to 6 do m,convert([seq(stnum[i,m],i=1..6)],`+`) od;
This is M1102 of the Encyclopedia of Integer Sequences . The table given there is incomplete. However, from an earlier computation of generating functions, we have a formula for the generating function of all stacks:
> Q:=Product(1-z^k,k=1..n); Stack_z:=1+Sum(z^r/subs(n=r,Q)/subs(n=r-1,Q),r=1..infinity);
> Order:=30: series(value(subs(infinity=Order+3,Stack_z)),z=0);
A problem in submarine detection
Problem 68-16 that appeared in the 1968 volume of SIAM Review reads as follows:
Problem 68-16. A combinatorial Problem, by Melda Hayes (Ocean Technology Inc.).
In how many ways can identical balls be distributed in boxes in a row such that each pair of adjacent boxes contains at least 4 balls? This problem arose in some work on submarine detection.
The problem is thus relative to integer compositions with constrained summands. The constraints are reminiscent of maximum capacity problems but they concern successive summands. For pedagogical reasons, we decompose the solution in two phases:
a problem of counting words over a 5-letter alphabet that translates the succession contraint;
the original problem.
The word counting problem
Consider an alphabet comprising letters
> A=seq(a.j,j=0..4);
There , for represents symbolically a summand of size ; the quantity represents a summand of cardinality at least 4. The first grammar specifies words over the alphabet such that the sum of indices of any two consecutive letters is at least 4.
>
grammar1:=[Q4,{
seq(Q.i=Union(Epsilon,seq(Prod(a.j,Q.j),j=4-i..4)),i=0..4),
seq(a.j=Z,j=0..4)
},unlabelled];
The grammar above is just a translation of the finite automaton that recognizes the language of symbolic constraints. We can solve the counting problem easily.
> seq(count(grammar1,size=j),j=0..20);
> gfsolve(op(2..3,grammar1),z);
The generating function of all words has been found to be
> Q_z:=factor(subs(",Q4(z)));
> series(Q_z,z=0,20);
The asymptotics results from locating the dominant pole:
> Digits:=30: fsolve(denom(Q_z),z);
> rho:=fsolve(denom(Q_z),z,0..1);
> c:=-subs(z=rho,numer(Q_z)/diff(denom(Q_z),z))/rho;
> Q_n_asympt:=proc(n) round(c*rho^(-n)) end:
> count(grammar1,size=30); Q_n_asympt(30); evalf("/"");
The integer composition problem
For the original problem, we only need to interpret for as meaning a summand of value , that is to say repeated times, and as a summand of value at least 4.
>
grammar2:=[Q4,{
seq(Q.i=Union(Epsilon,seq(Prod(a.j,Q.j),j=4-i..4)),i=0..4),
seq(a.j=Sequence(Z,card=j),j=0..3),a4=Sequence(Z,card>=4)
},unlabelled];
> seq(count(grammar2,size=j),j=0..30);
We then get the generating function as before.
> gr2_sys:=gfsolve(op(2..3,grammar2),z);
> QQ_z:=factor(subs(",Q4(z)));
> series(QQ_z,z=0,31);
> rho:=fsolve(denom(QQ_z),z,0..1);
> QQ_n_asympt:=proc(n) round(subs(z=rho,-1/diff(denom(QQ_z),z)/rho*numer(QQ_z))*rho^(-n)) end: QQ_n_asympt(n);
> count(grammar2,size=40); QQ_n_asympt(40); evalf(""/");
This solves the original problem. A detailed solution involving reduction of infinite matrices was submitted by D. R. Breach, University of Toronto.
Fixing the number of boxes
The published solution by D. R. Breach also provided detailed tables for number of balls (size) and number of boxes . Like before, we could generate specifications for each given value of . However, it is possible to solve simultaneously all such problems by means of bivariate generating functions. Roughly, Combstruct makes it possible to insert marks (in the form of Epsilons that do not modify the counting results).
The specification generates objects where all atoms are anonymously labelled .
> allstructs(grammar1,size=2);
If we wish to keep track of additional informations, we can just take products with suitable structures of size 0 (Epsilons). This is a general programming technique of Combstruct. Here we use b.j to mark an occurrence of an a.j.
>
grammar1bis:=[Q4,{
seq(Q.i=Union(Epsilon,seq(Prod(a.j,Q.j),j=4-i..4)),i=0..4),
seq(a.j=Prod(Z,b.j),j=0..4),
seq(b.j=Epsilon,j=0..4)
},unlabelled];
> subs({Prod=``,Epsilon=NULL},allstructs(grammar1bis,size=2));
Likewise, we may insert a generic marker f or each summand in the compositions described by :
>
grammar2bis:=[Q4,{
seq(Q.i=Union(Epsilon,seq(Prod(a.j,Q.j),j=4-i..4)),i=0..4),
seq(a.j=Prod(b,Sequence(Z,card=j)),j=0..3),a4=Prod(b,Sequence(Z,card>=4)),
b=Epsilon
},unlabelled];
The functions combstruct[gfeqns] and combstruct[gfsolve] respect such marks and allow for the possibility of assigning auxiliary variables for such marks. Thus, one can determine multivariate generating functions. Here is the variable associated to mark .
> gfeqns(op(2..3,grammar2bis),z,[[u,b]]);
Solving for , we find:
> QQ_zu:=(-u^2*z^2-2*u^2*z^3+u^4*z^6+u+1-u^3*z^4+u*z)*(-1+z)/(u*z^2-1)/(u^4*z^8-u^2*z^5-2*u^2*z^4-u*z^3-z+1);
We then automatically obtained the main table in the solution published by SIAM Review . We can even detect errors: for instance, in the last line we must have 1261 (instead of 1211) and 4756 (instead of 4762)!
> map(series,series(QQ_zu,z=0,21),u,infinity);
We have also easy access to any subproblem defined by fixing the number of boxes:
> map(normal,series(QQ_zu,u=0,8));