MORE EXAMPLES OF MARKING COMBSTRUCT GRAMMARS

Marni Mishna July 1997

This worksheet continues to explore the possibilities with combstruct[mark]. There is an introductory worksheet that explains the idea of marking and contains some basic examples which are built upon here.

> with(combstruct);

Lowest Common Ancestor in Binary Trees

Given any two nodes in a tree there is a shortest distance between them through a nearest common ancestor. Determining the average distance between two points (similarly, the average distance to the nearest common ancestor) is a fundamental problem in combinatorics with applications in combinatorial optimization, compiling and parrallelism. We can determine this value for binary trees by considering a method similar to determining pathlength.

Average Distance Between external Nodes

Let us ease into the topic with a specialisation of the problem: nearest common ancestor among external nodes.

First consider the case where the nearest common ancestor is the root. We can graft the solution into the leaf of a larger tree to represent the more general case.

We wish to describe a bijection that counts all distances between the left children of the root and the right. We will do this by first considering a tree as stretched out, thus the product of a sequence of trees, and a node (the root). Based on our restriction, the two end trees of the sequence will be trees consisting of only a leaf. The root can occur anywhere along the sequence and to mark its place we use a sequence with one tree marked. We will say that the root appears after this node. (We will add separately the case where it occurs before). We will use a bivariate generating function to do the counting of the length of the sequence (and hence the distance between the two leaves).

> Bintree:={T=Union(L, Prod(N, T, T)), L=Epsilon, N=Atom,
Chain=Prod(A, Sequence(B), A), A=Prod(len,L), B= Prod(len, N,Union(Prod(Epsilon,T),Prod(T,Epsilon))), len=Epsilon}:

We mark the point in the sequence where the lowest common ancestor goes:

> B1:= mark(Bintree, unlabelled, [B,m1], 1):

We mark the leaves of the trees for grafting purposes:

> B2:= mark(B1, unlabelled, [L[0],m2], 1):

Next, we create the bijection. It is a product of the tree to graft into (with a leaf marked telling us where), the ancestor node, and the sequence with either a spot marked for the ancestor node or none (indicating it is at the beginning of the sequence) :

> BCount:= B2 union
{AncTree=Prod(T[0][1], N[0][0], Union(Chain[1][0], Chain[0][0]))}:

If we count the number of AncTrees of size n it will be the number of ways to choose two external nodes in all binary trees of size n.

> count([AncTree, BCount, unlabelled], size =3);

To count the distance we can sum up all of the occurrences of " len " in all of the AncTrees. We do this by determining the bivariate generating function.

> BGF:=gfsolve(BCount, unlabelled, x, [[u, len[0][0]]]):

> AT:=subs(BGF,AncTree(u, x));

To give each u the corresponding weight, we differentiate and set u= 1.

> CountLength:=eval(subs(u=1, diff(AT, u)));

This is the generating function for the sum of all shortest paths between leaves. We can find the coefficients with the series expansion, and in fact we can determine the closed form for the coefficients, using gfun.

> series(CountLength, x, 10);

> with(gfun):
T1:= expand(rsolve(diffeqtorec(holexprtodiffeq(CountLength, y(x)), y(x), u(n)), u(n)));

To find the average sum per tree we divide by the number of trees of size n . This is represented in the grammar as T[0][0]. To find the average distance we also need to divide by the number of pairs of leaves. A tree with n nodes has n+ 1 leaves and thus this is .

> Total:=rsolve(diffeqtorec(holexprtodiffeq(subs(BGF,T[0][0](u, x)), y(x)), y(x), u(n)), u(n));

> simplify(T1/Total/((n+1)*n/2),GAMMA);

Here are the values for a few small trees:

> seq(evalf(T1/Total/((n+1)*n/2)), n=1..10);

The distance to the nearest common ancestor is half of this value since the trees have a symmetric construction.

Asymptotically we have:

> map(combine, asympt(T1/Total/((n+1)*n/2), n), exp);

Average Distance Between any two Nodes

To extend the problem to nearest common ancestor between any two nodes we need to allow the end bits of our sequence to be more than just leaves, but in fact trees. This is attained with a simple substitution. We allow A to be a tree. This omits the case when one of the nodes is the ancestor. This case is omitted since it is not possible for a node to be both an ancestor of a second node and a leaf of the same tree, and this is precisely what is needed. To account for this, we add the possibility of a second chain, much like the first except it is terminated on the top end by a node (not a tree). To avoid over counting by one we do not add a tag on the end tree.

> Bintree:={T=Union(L, ST), ST= Prod(N, T, T), L=Epsilon, N=Atom,
Chain=Prod(A, Sequence(B), A), A=Prod(len,T), B= Prod(len, N,Union(Prod(Epsilon,T),Prod(T,Epsilon))), Chain2=Prod(T, Sequence(B, card>0)), len=Epsilon}:

We mark the point in the sequence where the node goes:

> B1:= mark(Bintree, unlabelled, [B,m1], 1):

We mark the leaves of the trees for grafting purposes:

> B2:= mark(B1, unlabelled, [L[0],m2], 1):

Next, we create the bijection. It is a product of the tree to graft into (with a leaf marked telling us where), the ancestor node, the sequence with either a spot marked for the ancestor or none (indicating it is at the beginning of the sequence). Or it is the chain that represents the case when one of the nodes is the ancestor.

> BCount:= B2 union
{AncTree=Union(Prod(T[0][1], N[0][0], Union(Chain[1][0], Chain[0][0])),PL),
PL=Prod(T[0][1], Chain2[0][0])}:

If we count the number of AncTrees of size n it will be the number of ways to choose two nodes or leaves from a binary tree of size n .

> count([AncTree, BCount, unlabelled], size = 2);

As before, to count the distance we can sum up all of the occurrences of " len " in all of the AncTrees. We can solve the bivariate generating function.

> BGF:=gfsolve(BCount, unlabelled, 'x', [['u', len[0][0]]]):

> AT:=subs(BGF,AncTree(u, x));

> CountLength:=eval(subs(u=1, diff(AT, u)));

> series(CountLength, x, 10);

Again, we can find a closed form for the coefficients using gfun:

> T1:=rsolve(diffeqtorec(holexprtodiffeq(CountLength, y(x)), y(x), u(n)), u(n));

To find the the average, we again need to divide by the number of trees of size n and the number of ways of choosing two nodes. In this case, since we are including internal nodes there are a total of 2n+1 nodes, and thus there are n(2n+1) ways of choosing two.

> Total:=rsolve(diffeqtorec(holexprtodiffeq(subs(BGF,T[0][0](u, x)), y(x)), y(x), u(n)), u(n)):

> AVG:= simplify(T1/Total/n/(2*n+1), symbolic);

> seq(evalf(AVG), n=1..10);

Notice that these values are smaller than when leaves alone were considered. We can also examine the asymptotic behaviour:

> map( simplify, asympt(AVG, n), symbolic);

Non-Crossing Configurations

To better understand the capabilities for determining mean and variance, we now consider the enumeration of planar configurations defined on vertices of a convex n -gon. We will look at six basic non-crossing configurations: trees and forests, graphs and connected graphs, dissections and partitions. We will use marking directly and indirectly via the commands combstruct[mean] and combstruct[variance] to determine properties of these figures. The simpler constructions offer more information, however we can determine specific moments and variances for sub-objects when the bivariate generating function is not available in closed form.

Trees and Forests

Trees

First we consider non-crossing trees and non-crossing forests. A tree is a sequence of butterflies attached to a root, where a butterfly is an ordered pair of trees whose roots have been merged into a single node. This merge step may duplicate the root, so a butterfly is expressed as the product of two forests and a root, where a forest is a sequence of butterflies. We are interested in the average number of leaves in such a tree.

> nctree:={T=Union(L, Prod(Z, Sequence(B, card>0))), F=Sequence(B, card>0), B=Union(Prod(F,Z,F), Prod(F, Z), Prod(Z, F), L), L=Atom}:

> marked_trees:= mark(nctree, labelled, [L, m], 2):

We can use the moment command to give us specific moments. The first moment is the average.

> evalf(moment([T, nctree, labelled], size =25, L, 1)/25);

Notice that this value is equal to the { number trees of size n with one leaf marked }/{ number of trees of size n}.

> evalf(count([T[1], marked_trees, labelled] ,size=25)/count([T[0], marked_trees,labelled], size=25))/25;

We can determine the general form of the coefficients of the generating functions of the marked productions and hence determine a closed form for the mean percentage of leaves in a tree of size n .

> TT:=gfsolve(marked_trees, labelled, z):

> T1:=rsolve(diffeqtorec(holexprtodiffeq(subs(TT, T[1](z)), y(z)), y(z), u(n)) minus {u(1) = 1, u(0)=0}, u(n));

> T0:=rsolve(diffeqtorec(holexprtodiffeq(subs(TT, T[0](z)), y(z)), y(z), u(n)) minus{u(0)=0}, u(n));

> T2:=rsolve(diffeqtorec(holexprtodiffeq(subs(TT, T[2](z)), y(z)), y(z), u(n)) minus{u(2)=0, u(1)=0, u(0)=0}, u(n));

The mean number of leaves in the tree is determined:

> simplify(T1/T0/n, GAMMA);

> asympt(", n);

Notice that 4/9 is near the value we had determined for the value at n=25.

We can also determine a closed form for the variance:

> evalf(variance([T, tbf, labelled], size =25, L)/25);

> var:=simplify(((2*T2+T1)/T0-(T1/T0)^2)/n, GAMMA);

> asympt(var, n);

The asymptotic forms tell us that both the mean and the variance tend to a constant.

> evalf(28/243);

This is similar to the value found at n=25 as well.

Forests

A non-crossing forest is a collection of non-crossing trees. We can obtain a forest by substituting in for each vertex of a tree a vertex and a forest, (possibly empty). We are interested in the average number of components in a forest on n nodes. Thus, we must distinguish the occurrence of a new, non-empty forest, marked by the non-terminal comp .

> fo:={B=Prod(Sequence(B),V,Sequence(B)),
V=Prod(Z, F),
F=Union(Epsilon, comp), comp=Prod(V, Sequence(B))}:

First, let us determine the variance and moments for a large class of forests:

> evalf(moment([F, fo, labelled], size =30, comp, 1))/30;

> evalf(variance([F, fo, labelled], size =30, comp))/30;

We have less success determining the general form with marking in this instance:

> mf:=mark(fo, labelled, [comp, m], 1):

> FF:=gfsolve(mf, labelled, z):

> F1:=rsolve(diffeqtorec(holexprtodiffeq(subs(FF, V[1](z)), y(z)), y(z), u(n)), u(n));

However, we do have the generating functions at our disposal.

Connected and General Graphs

Connected graphs

We now alleviate the constraint of tree and consider more general graphs. We build the graphs in a similar way, with a few added precautions to avoid double counting of nodes. Consider the set of children { ,.., } attached to the root . This set divides the graph into several parts. We have the graphs built on { , ..., } and { , .., } which are connected themselves by hypothesis, and between any two adjacent children and there either one connected graph, or two neighbouring connected graphs.

Any graph built from a sequence of successive vertices of the circle is a system of arches since the edges of the graphs, or arcs, are not allowed to cross. The endpoints of these arches are shared by the graphs built to the left and to the right of a given , so that we shall say that the size of an arch built on n points has size n- 1. We are interested here in Elementary Arches , that is arches that always contain an arch between the first and last points. General arches are easily obtained by sequencing EAs .

From this, we have the following combstruct specification. In particular, the 5 arguments of a congraph entity are as follows:

1. First Z: the root of the graph

2. Second Z: the leftmost point of the first EA

3 and 4. the two EA s built on { , ..., } and { , .., }

5. The sequence of EA s found between two consecutive children of . The first term corresponds to one EA ,

and the second one to two EA s. For the latter case, the Z in the Prod stands for the lefmost point of the second EA .

The component of interest here is edges. We are interested in the average number of edges in a connected graph. Every arch is an edge, plus the original edges to the children. Thus we distinguish these as EA and conseq .

> ar:={EA = Union(Sequence(EA, card >= 2),
Prod(Z, Sequence(EA), Sequence(EA))
),
congraph = Prod(Z,Z,Sequence(EA), Sequence(EA),Sequence(conseq)
),
conseq = Union(Sequence(EA,card>=1), Prod(Z,Sequence(EA),Sequence(EA)))}:

> marked_edges:= mark(ar, labelled, [[conseq, EA], m], 2):

> ME:=gfsolve(marked_edges, labelled, z):

Unfortunately rsolve cannot solve the recurrence relation which gfun gives it. We can examine the generating function though:

> simplify(subs(ME, congraph[1](z)),symbolic);

> P0:=rsolve(diffeqtorec(holexprtodiffeq(subs(ME, congraph[0](z)), y(z)), y(z),u(n)), u(n));

We also have moments for any size class available to us:

> evalf(variance([C, ar, labelled],size =30, [F, EA])/30);

> evalf(moment([C, ar, labelled],size =50, [F, EA],1)/50);

General Graphs

We create general graphs from connected graphs the same way we obtained forests from trees. A general graph is obtained from a connected one by the substitution Z -> Prod(Z, G). So that we just have to rewrite the previous grammar by adding a new symbol which makes this substitution.
However the decomposition of a connected graph misses a configuration for general graphs: the one where vertex
has no child. We are interested in two parameters in this instance: the number of edges and the number of components.

> br:={EA = Union(Sequence(EA, card >= 2),
Prod(V, Sequence(EA), Sequence(EA))
),
V=Union(Z,Comp), Comp=Prod(Z, G),
G=Union(Z,
Comp,
Prod(V,V,Sequence(EA), Sequence(EA),Sequence(F))),
F=Union(Sequence(EA,card>=1), Prod(V,Sequence(EA),Sequence(EA)))
}:

> marked_edges:= mark(br, labelled, [G, m], 2):

We attempt to solve the generating function:

> GE:= gfsolve(marked_edges, labelled, z):

> subs(GE, G[1](z));

Unfortunately, again rsolve is unable to determine an expression for the coefficients. However, we can solve for specific instances:

> evalf(variance([G, br, labelled],size =30, [F, EA])/30);
evalf(moment([G, br, labelled],size =30, [F, EA],1)/30);

Next, we handle components. Each occurrence of comp marks a new component.

> marked_comps:= mark(br, labelled, [Comp,m], 2):

> GC:= gfsolve(marked_comps, labelled, z):

> subs(GC, G[1](z));

Again, the closed form for coefficients is beyond the capabilities of rsolve. So, we look at specific examples:

> evalf(variance([G, br, labelled],size =30, Comp)/30);
evalf(moment([G, br, labelled],size =30, Comp,1)/30);

Dissections and Partitions

Dissections

A dissection of a convex polygon ={ , ..., } is a partition of the polygon into polygonal regions by means of non-intersecting diagonals. Our representation is the same as non-crossing graphs whose connected components are points (polygons on one point) , edges (polygons on two points) and cycles (all polygons). We are curious to determine the average number of parts in a dissection.

> dissG:={Di=Union(Z, P), P=Sequence(Di, card >= 2)}:

> marked_dp:= mark(dissG, labelled, [[Z, P], m], 2):
MD:=gfsolve(marked_dp, labelled, z);

A more careful look at this failure shows that it is actually due to a weakness in series. This can be overcome by increasing Order:

> Order:=20:

> MD:=gfsolve(marked_dp, labelled, z):

Unfortunately the recurrence set up by gfun is unsolveable using rsolve , so we cannot attain a closed form for the coefficients. However, we do have the generating function:

> simplify(subs(MD, Di[1](z)), power);

> evalf(variance([Di, dissG, labelled], size =75, [P, Z])/75);

> evalf(moment([Di, dissG, labelled], size =75, P, 1)/75);

Partitions

A non crossing partition of size n is a partition of [ n ]={1, 2, ..., n } such that if a < b < c < d and a block contains a and c , then no block contains b and d . Thus, if we consider each partition block as a polygon, partitions are the "forest" version of dissections. Given a partition block, possibly empty, one gets a bigger partition by substituting to each vertex the product between
Z and another partition. We are interested in the average number of partition blocks in a partition.

> partG:={P=Sequence(V, card>0), V = Prod(Z,Union(P, Epsilon))}:

> marked_parts:= mark(partG, labelled, [P, m], 2):
MP:=gfsolve(marked_parts, labelled, z):

We try to solve for the coefficients explicitly:

> P1:=rsolve(diffeqtorec(holexprtodiffeq(subs(MP, P[1](z)), y(z)), y(z),u(n)) minus{u(0)=0}, u(n));
P0:=rsolve(diffeqtorec(holexprtodiffeq(subs(MP, P[0](z)), y(z)), y(z),u(n)) minus{u(0)=0}, u(n));
P2:=rsolve(diffeqtorec(holexprtodiffeq(subs(MP, P[2](z)), y(z)), y(z), u(n)) minus{u(2)=0, u(1)=0, u(0)=0}, u(n));

> simplify(P1/P0/n, GAMMA);
asympt(", n);

The mean tends to the constant 1/2.

> var:=simplify(((2*P2+P1)/P0-(P1/P0)^2)/n, GAMMA);
asympt(var, n);

The variance tends to the constant 1/8.