A PROBLEM IN STATISTICAL CLASSIFICATION THEORY

Philippe Flajolet

(Version of January 14, 1997)

This problem discussed here is at the origin of the whole Combstruct package. On October 8, 1992, Bernard Van Cutsem, a statistician at the University of Grenoble wrote to us:

In classification theory, we make use of hierarchical classification trees. I would need to generate at random such classification trees according to the uniform law. The elements to be classified may be taken as distinguished integers say from 1 to [Maple Math] . Do you know of an algorithm for doing this?

This led to a cooperation involving Paul Zimmermann, Bernard Van Cutsem, and Philippe Flajolet, out of which the general theory and the algorithms of Combstruct evolved, see Theoretical Computer Science , vol. 132, pp. 1-35. A first implementation was designed by Paul Zimmermann in 1993, under the name Gaia ( Maple Technical Newsletter , 1994 (1), pp. 38-46).

Van Cutsem's original question was motivated by the following problem: Classification programmes in statistics build classification trees, usually proceeding by successive aggregations of closest neighbours amongst existing classes. How can we measure the way a classification carries useful information and not just "random noise"? Certainly, "good" classification trees should exhibit characteristics that depart significantly from random ones. Hence the need to simulate and analyse parameters of random classification trees.

Statistics and classification theory

Specification

We start by loading the combstruct package.

> with(combstruct);

[Maple Math] [Maple Math]

A classification is either: 1) an atom; 2) a set of classification trees of degree at least 2. Atoms are distiguishable, hence we are in a labelled universe. Note that the Set construction translates a pure graph-theoretic structure with no ordering between descendents of a node.

> hier:=[H,{H=Union(Z,Set(H,card>1))},labelled]:

The original problem of Van Cutsem is solved by single commands like

> draw(hier,size=10);

[Maple Math] [Maple Math]

We may adopt a more concise representation format:

> lreduce:=proc(e) eval(subs({Set=proc() {args} end, Sequence=proc() [args] end},e)) end:

> lreduce(draw(hier,size=20));

[Maple Math] [Maple Math] [Maple Math]

Random generation takes only a few seconds while counting tables (that serve to determine splitting probabilities) are set up on the fly.

> for j from 20 by 20 to 100 do j,lreduce(draw(hier,size=j)) od;

[Maple Math] [Maple Math] [Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

The number of objects of size [Maple Math] grows fast

> seq(count(hier,size=j),j=0..40);

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

This appears to be sequence M3613 of the Encyclopedia of Integer Sequences and it corresponds to "Schroeder's fourth problem". When the count is not too large, we can do exhaustive listings. This is made possible by Combstruct that is able to build canonical forms and generate elements under unique standard forms.

> for j to 4 do map(lreduce,allstructs(hier,size=j)) od;

[Maple Math]

[Maple Math]

[Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

Asymptotic analysis

We get generating function equations by combstruct[gfeqns]

> gfeqns(op(2..3,hier),z);

[Maple Math]

And combstruct[gfsolve] attempts different strategies to solve the system

> gfsolve(op(2..3,hier),z);

[Maple Math]

The solution involves Lambert's W function that is known to Maple: by definition, this is the solution of

[Maple Math] .

> H_z:=subs(",H(z));

[Maple Math]

Objects being labelled, this is an exponential generating function (EGF).

> H_ztayl:=series(H_z,z=0,20);

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

As usual, we also obtain the corresponding ordinary generating functions by a Laplace transform applied to the series expansion

> series(subs(w=1/w,w*inttrans[laplace](H_ztayl,z,w)),w,20);

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

The result is then directly comparable to the counting coefficients:

> seq(count(hier,size=j),j=1..18);

[Maple Math] [Maple Math] [Maple Math] [Maple Math]

In order to analyse the number of hierarchies, we must find the dominant singularity of their generating function. A plot detects a vertical slope near 0.4

> plot(H_z,z=0..1);

Here is a cute way to get the singularity "automatically". We express that the function ceases to be differentiable at its singularity.

> diff(H_z,z);

[Maple Math]

> rho:=solve(denom(")=0); evalf(rho,30);

[Maple Math]

[Maple Math]

Next, we know that the singular expansion determines the asymptotic form of coefficients. Thus, we look at

> H_s:=subs(z=rho*(1-Delta^2),H_z);

[Maple Math] [Maple Math]

> H_sing:=map(simplify,series(H_s,Delta=0,5));Delta=sqrt(``(1-z/rho));

[Maple Math] [Maple Math]

[Maple Math]

With this, we can get an asymptotic expansion for coefficients to any order, which is an interesting fact per se. Here is the first one:

> H_n_asympt:=n!*asympt(coeff(H_sing,Delta,1)*rho^(-n)*subs({cos(Pi*n)=1,O=0},simplify(asympt(binomial(1/2,n),n,2))),n);
evalf(",20);

[Maple Math]

[Maple Math]

And for n=50, we get

> round(evalf(subs(n=50,H_n_asympt),20)); count(hier,size=50); evalf(""/");

[Maple Math] [Maple Math]

[Maple Math] [Maple Math]

[Maple Math]

The error is of about 1% for [Maple Math] . A complete asymptotic expansion can be obtained by this method, by taking successive singular terms into account.

The number of classification stages

We examine the number of internal nodes in a classification. This corresponds to the number of classes actually created. The idea is to make use of "marks" in the form of Epsilon structures that have size 0 (and thus do not affect the combinatorial model).

> hier2:=[H,{H=Union(Z,Prod(class,Set(H,card>1))),class=Epsilon},labelled]:

Such marks do not affect the combinatorial model:

> seq(count(hier,size=j),j=0..11);

[Maple Math] [Maple Math]

> seq(count(hier2,size=j),j=0..11);

[Maple Math] [Maple Math]

(We change the formatting procedure to take Prod into account.)

> lreduce:=proc(e) eval(subs({Set=proc() {args} end, Sequence=proc() [args] end, Prod=``},e)) end:

Here is a random object with "class" marking classification nodes:

> lreduce(draw(hier2,size=20));

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

The system determined by equations over bivariate generating functions can be solved by Maple:

> gfeqns(op(2..3,hier2),z,[[u,class]]);

[Maple Math] [Maple Math]

> H_zu:=solve(H=z+u*(exp(H)-1-H),H);

[Maple Math] [Maple Math]

One gets averages by differentiation:

> H1_z:=subs(u=1,diff(H_zu,u));

[Maple Math] [Maple Math] [Maple Math]

Numerically, the mean number of nodes in a random classification tree of size [Maple Math] , when divided by [Maple Math] , is for [Maple Math] ,

> Digits:=5: evalf(series(H1_z,z=0,22)): seq(coeff(",z,j)/count(hier,size=j)/j*j!,j=1..20);

[Maple Math] [Maple Math] [Maple Math]

This suggests that a random classification may have about [Maple Math] classification stages.

We can in fact analyse this rigorously, using the asymptotic method already employed for counts.

> H1_s:=subs(z=rho*(1-Delta^2),H1_z):
H1_sing:=map(simplify,series(H1_s,Delta=0,5));
H1_n_asympt:=n!*asympt(coeff(H1_sing,Delta,-1)*rho^(-n)*subs({cos(Pi*n)=1,O=0},simplify(asympt(binomial(-1/2,n),n,2))),n);

[Maple Math] [Maple Math]

[Maple Math]

> C_classif:=asympt(H1_n_asympt/H_n_asympt,n,1); evalf(",20);

[Maple Math]

[Maple Math]

Thus, we have obtained (easily!) a new Theorem . In a random classification tree, the number of classification stages (internal nodes) is asymptotic to

[Maple Math] .

Degrees in random classification trees

The corresponding generating functions are now outside of the range of implicit functions that Maple knows about. Thus, a separate mathematical analysis is needed. However, an empirical analysis based on small sizes is already quite informative. The following code builds a specification where nodes of degree [Maple Math] are marked. The principle is the obvious set-theoretic equation

[Maple Math] .

The code uses combstruct[gfeqns] to generate the system of equations for each degree that is then expanded. In passing, it prints the corresponding generating function:

> deg_hier:=proc(k) local j,spec,n,dHH;
spec:=[H,{
H=Union(Z,Union(Set(H,card>k)),Prod(classif,Set(H,card=k)),seq(Set(H,card=j),j=2..k-1)),
classif=Epsilon},labelled];
dHH:=subs(u=1,diff(RootOf(subs({Z(z,u)=z,classif(z,u)=u,H(z,u)=H},
H(z,u)=subs(gfeqns(op(2..3,spec),z,[[u,classif]]),H(z,u))),H),u));
print(dHH);
seq(evalf(coeff(series(subs(u=1,dHH),z,27),z,n)/count(spec,size=n)*n!/n,5),n=1..25)
end:

> deg_hier(2);

[Maple Math]

[Maple Math] [Maple Math] [Maple Math]

> deg_hier(3);

[Maple Math]

[Maple Math] [Maple Math] [Maple Math]

> deg_hier(4);

[Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math]

> deg_hier(5);

[Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math]

Thus a random classification on [Maple Math] elements seems to have on average about

about [Maple Math] binary nodes;

about [Maple Math] ternary nodes;

about [Maple Math] quaternary nodes.

These results are consistent with the proved result that the total number of internal nodes is on average [Maple Math] . The simple pattern revealed by this computation suggests a formal proof (by singularity analysis) that the distribution of degrees in fact obeys a modified Poisson law. The following theorem, first found while developing this worksheet, appears to be new:

Theorem . The probability that a random internal node in a random hierarchy of size [Maple Math] has degree [Maple Math] satisfies asymptotically a truncated Poisson law

> tau:=sqrt(2*log(2)-1);
S:=expand(sum(exp(-tau)*tau^(k-1)/(k-1)!,k=2..infinity));
Pr(deg=k)=normal(1/S*exp(-tau)*tau^(k-1)/(k-1)!);

[Maple Math]

[Maple Math]

[Maple Math]

Equivalently, the mean number of nodes of degree [Maple Math] is asymptotic to

> C_classif/S*exp(-tau)*tau^(k-1)/(k-1)!;

[Maple Math]

Numerically, this evaluates to

> evalf([seq(",k=2..10)]);

[Maple Math] [Maple Math] [Maple Math]

These figures are consistent with what was found on sizes near 20. They show that nodes of degree 5 and higher have negligible chances of occurring.

Alternative models

Unlabelled hierarchies

A number of related models can be similarly analyzed. We examine here:

Unlabelled hierarchies: these represent the types of trees when one considers the elements to be classified as "indistinguishable". What we obtain is then reminiscent of chemical molecules (with an unrealistic element that would be capable of an arbitray valency).

Planar hierarchies, where one distinguishes the order between decsendents of classification node.


For unlabelled, hierarchies, we just need to change the qualifier of specifications to "unlabelled".

> hier4:=[H,{H=Union(Z,Set(H,card>1))},unlabelled]:

> ureduce:=proc(e) eval(subs({Set=proc() {[args]} end,Prod=proc() ``(args) end, Sequence=proc() [args] end},e)) end:

> ureduce(draw(hier4,size=20));

[Maple Math] [Maple Math]

Notice that internally, the setting up of counting tables is more complex as it involves a fragment of Polya's theory. The counting results grow much more slowly, since we distinguish fewer configurations.

> seq(count(hier4,size=j),j=0..30);

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

> for j to 6 do j,map(ureduce,allstructs(hier4,size=j)) od;

[Maple Math]

[Maple Math]

[Maple Math]

[Maple Math] [Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

Planar hierarchies

We only need to change Set into Sequence to get the right classification:

> hier5:=[H,{H=Union(Z,Sequence(H,card>1))},unlabelled]:

> ureduce:=proc(e) eval(subs({Set=proc() {[args]} end,Prod=proc() ``(args) end, Sequence=proc() [args] end},e)) end:

> ureduce(draw(hier5,size=50));

[Maple Math] [Maple Math] [Maple Math] [Maple Math]

The counting sequence is

> seq(count(hier5,size=j),j=0..30);

[Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math] [Maple Math]

This is found as Sequence M2898 in the Encyclopedia of Integer Sequences by Sloane and Plouffe and is known as Schroeder's second sequence.This sequence has a dignified history and Stanley noticed recently that the element [Maple Math] already appears in Plutarch's [AD50- AD120 (!)] biographical notes on Hipparchus.

> gfsolve(op(2..3,hier5),z);

[Maple Math]

> H5_z:=subs(",H(z));

[Maple Math]

> series(H5_z,z=0,11);

[Maple Math] [Maple Math]

Here is finally one quick way to obtain a simple recurrence for these numbers: first guess the recurrence, then check your guess. This, and many alternatives are encapsulated in the Gfun package.

> with(gfun):
listtorec([seq(count(hier5,size=j),j=0..30)],u(n));

[Maple Math] [Maple Math] [Maple Math]

> rectodiffeq(op(1,"),u(n),Y(z));

[Maple Math] [Maple Math]

> dsolve(",Y(z));

[Maple Math]

Conclusion

Various models of random classification trees can be analysed both theoretically and empirically. Random generation is easy and the experiments lead to new conjectures (like the degree distribution) and even theorems (like the analysis of the number of classification stages). Returning to statistics, some properties of random trees appear to be present accross all models: for instance nodes of even moderately large degrees, [Maple Math] , are highly infrequent, and branching is predominantly binary. General observations of this type may be used to help distinguish classification trees without informational content ("random" trees) from meaningful ones.