Enumerating alcohols and other classes of chemical moleculs,
an example of Polya theory
Frederic Chyzak
(Version of January 13, 1997. Closing note added on February 25, 2008)
Alkanes are a simple class of chemical compounds. They are generically described by the chemical formula . First examples for small are methane ( ), ethane ( ), propane ( ), butane ( ), a.s.o. For a given however, there exist several different isomers , i.e., different structures of bonds between atoms. In chemistry, there is much interest in knowing the number, or better yet the list, of such isomers. Alcohols are obtained from alkanes by replacing a hydrogen atom by an group. It follows that they are isomorphic to carbon chains with a distinguished node, or again to alkyl radicals , which are alkanes with a missing hydrogen atom. If we disregard geometrical constraints (i.e., if we consider structural isomers only, and not conformational isomers), this leads to a pure graph-theoretical problem: how many rooted trees are there with n internal nodes, where each internal node has degree 4?
In this session, we thus consider rooted trees, so that we count and enumerate alkyls, with generic formula . The combinatorics also corresponds to simple alcohols , organo-metalic compounds , and any other monosubstituted alkanes. We next treat the cases of disubstituted and trisubstituted alkanes. We develop the study of our models using the package Combstruct .
> with(combstruct);
Enumerations of such classes of chemical compounds are part of Polya theory. We refer to the book by G. Polya and R. C. Read [ Combinatorial Enumeration of Groups, Graphs, and Chemical Compounds , (1987), Springer-Verlag] for more extensive results.
Monosubstituted alkanes,
In this section, we study monosubstituted alkanes, i.e., rooted trees, first without any constraint, next according to the height .
General alkyls
Definition
An alkyl radical can be viewed as a carbon atom linked to at most 3 alkyl radicals. Thus, we only take into account hydrogen atoms implicitly. There is no loss of information, since hydrogen atoms can always be recovered from the carbon skeleton. This yields the class equation , which we map into the following grammar:
>
gramm_Alkyl:=Alkyl=Prod(Carbon,Set(Alkyl,card<=3)),Carbon=Atom:
specs_Alkyl:=[Alkyl,{gramm_Alkyl},unlabelled]:
Note that since the Set construct denotes multisets, i.e., sets with repetitions, a carbon atom of an alkyl is allowed to be bound to two copies of the same subtree (but the order of the subtrees does not matter).
Define the size of an alkyl as the number of carbon atoms it contains. We compute the number of alkyls of a given size using combstruct[count] .
> seq(count(specs_Alkyl,size=i),i=0..50);
This series appears as the entry M1146 ("quartic planted trees with nodes") in the book by N. J. A. Sloane and S. Plouffe [ The Encyclopedia of Integer Sequences , (1995), Academic Press].
Here is an example of an alkyl with 6 carbon atoms, obtained by the command combstruct[draw] .
> alk:=draw(specs_Alkyl,size=6);
The following procedure rewrites an alkyl into a more readable way.
> nice:=proc(alk) eval(subs({Epsilon=NULL,Carbon=C,Prod=proc() global H; [args] end,Set=proc() args end},alk)) end:
> nice(alk);
The following procedure computes the size of a given alkyl.
>
size:=proc(alk) option remember; 1+convert(map(size,op(2,alk)),`+`) end:
size(Prod(Carbon,Epsilon)):=1:
The following procedure computes the height of a given alkyl.
>
height:=proc(alk) option remember; 1+max(op(map(height,op(2,alk)))) end:
height(Prod(Carbon,Epsilon)):=1:
Here is an alkyl with 50 carbon atoms, its nice representation and height.
> alk:=draw(specs_Alkyl,size=50):
> nice(alk);
> height(alk);
Empirical study
Drawing
By drawing several random structures, we can study probabilistic properties of alkyls. For instance, the following is a probabilistic estimate of their height on average:
> for i to 10 do ho[i]:=height(draw(specs_Alkyl,size=50)) od;
> add(ho[i],i=1..10)/10.;
In the same way, we get a probabilistic estimate of their standard deviation:
> sqrt(add((ho[i]-")^2,i=1..10)/10);
Exhaustive enumeration
The command combstruct[draw] permits us to draw one structure at random. We can also generate all alkyls of a given size, using combstruct[allstructs] , so as to compute the mean of a particular parameter exactly, or to count all those with a particular property. For instance, the height of trees cannot be represented in the class of combinatorial structures when using Combstruct . For instance, by computing all alkyls of size 5, we get the distribution of height for these alkyls (in their nice representation).
> allstructs(specs_Alkyl,size=5):
> map(nice,");
> sort(map(height,""));
Here we count 4 alkyls of size 5 and height 3, 3 alkyls of size 5 and height 4, and 1 alkyl of size 5 and height 5.
By the same method, we get the exact mean and standard deviation of the height for small sizes.
>
esd:=proc(n) local i,as,mean;
as:=map(height,allstructs(specs_Alkyl,size=n));
mean:=evalf(convert(as,`+`)/nops(as));
nops(as),mean,evalf(sqrt(add((i-mean)^2,i=as))/nops(as))
end:
> for i from 2 to 6 do i=esd(i) od;
We could go up to in less than 2 minutes.
Alkyls according to their height
Grammar
We define the class to be the class of alkyls of height at most . An alkyl of height at most can be viewed as a carbon atom linked to at most three alkyls of height at most , according to the equation .
>
gramm_ltd_height:=proc(n) option remember;
Alkyl_height[n]=Prod(Carbon,Set(Alkyl_height[n-1],card<=3)),gramm_ltd_height(n-1)
end:
gramm_ltd_height(1):=Alkyl_height[1]=Prod(Carbon,Epsilon),Carbon=Atom:
specs_ltd_height:=proc(n) option remember;
[Alkyl_height[n],{gramm_ltd_height(n)},unlabelled]
end:
The following procedure rewrites an alkyl into a more readable way.
> nice:=proc(alk) eval(subs({Epsilon=NULL,Carbon=C,Prod=proc() global H; [args] end,Set=proc() args end},alk)) end:
The following procedures compute the size and height of a given alkyl.
>
size:=proc(alk) option remember; 1+convert(map(size,op(2,alk)),`+`) end:
size(Prod(Carbon,Epsilon)):=1:
>
height:=proc(alk) option remember; 1+max(op(map(height,op(2,alk)))) end:
height(Prod(Carbon,Epsilon)):=1:
For instance, we compute the height of a random alkyl of size 10 and height at most 5.
> alk:=draw(specs_ltd_height(5),size=10):
> nice(alk);
> size(alk),height(alk);
In this section, we proceed to compute a table of the number of alkyls according to their size and height. The first method is by generating all structures. Next, we use generating functions to extend the table.
Generating all structures
The following procedure remembers all the alkyls of a given size and with bounded height.
> list_all_st:=proc(d,s) option remember; allstructs(specs_ltd_height(d),size=s) end:
An alkyl of height has a size at most . Therefore, to produce all alkyls with height at most , we need to produce all alkyls with size up to .
> s[max]:=(3^h[max]-1)/2;
To begin with, we enumerate all alkyls with height at most 3.
> h[max]:=3;
> for i from 1 to s[max] do i,nops(list_all_st(h[max],i)),map(nice,list_all_st(h[max],i)) od;
In this way, we have obtained the truncation of the bivariate generating function of alkyls with size marked by and height by .
> enum_BGF:=map(series,series(convert(map(proc(s,z,u) z^size(s)*u^height(s) end,map(op,[seq(list_all_st(h[max],i),i=1..s[max])]),z,u),`+`),z,infinity),u,infinity);
Generating functions
combstruct[gfeqns] returns a system of functional equations satisfied by the generating functions of related combinatorial structures. In the case of the alkyls with maximum height above, we get the following triangular system.
> gfeqns(op(2..3,specs_ltd_height(4)),z);
> gfsol:=gfsolve(op(2..3,specs_ltd_height(4)),z);
In particular, we have obtained a truncation of the bivariate generating function of all alkyls (i.e., with no constraint on height). In this series, marks the height. It extends the previous truncation .
> BGF:=map(series,series(eval(subs(Alkyl_height[0]=0,gfsol,add(u^h*(Alkyl_height[h]-Alkyl_height[h-1])(z),h=1..4))),z,infinity),u,infinity);
This is made explicit on the following normalized difference: each entry starts with a term in , denoting alkyls with height at least 4.
> map(series,series(BGF-enum_BGF,z,infinity),u,infinity);
Table of the number of alkyls according to size and height
Calculations with respect to different heights are much more efficient than the method of exhaustive enumeration. This makes it possible for us to set up the table of the number of alkyls according to size and height in a few minutes:
> h[max]:=5;
> gfsol:=gfsolve(op(2..3,specs_ltd_height(h[max])),z);
> BGF:=map(series,series(eval(subs(Alkyl_height[0]=0,gfsol,add(u^hh*(Alkyl_height[hh]-Alkyl_height[hh-1])(z),hh=1..h[max]))),z,infinity),u,infinity);
In the following table, the entry at row and column is the number of alkyls of size and height :
> matrix([[` `,seq(`height = `.hh,hh=1..h[max])],seq([`size = `.ss,seq(coeff(coeff(BGF,z,ss),u,hh),hh=1..h[max])],ss=1..s[max])]);
A (huge) table for could be computed in less than 10 minutes.
Disubstituted alkanes,
Enumerating disubstituted alkanes is equivalent to enumerating monosubstituted alkyls . The latter can generically be viewed as a carbon atom linked to one monosubstituted alkyl and at least 2 nonsubstituted alkyls. This yields the class equation .
> gramm_S1_Alkyl:=S1_Alkyl[X]=Union(Prod(Carbon,S1_Alkyl[X],Set(Alkyl,card<=2)),Prod(Prod(Carbon,X),Set(Alkyl,card<=2))),X=Epsilon:
> specs_S1_Alkyl:=[S1_Alkyl[X],{gramm_S1_Alkyl,gramm_Alkyl},unlabelled]:
> seq(count(specs_S1_Alkyl,size=i),i=0..50);
This series appears as the entry M1418 ("paraffins with carbon atoms") in the book by N. J. A. Sloane and S. Plouffe [ The Encyclopedia of Integer Sequences , (1995), Academic Press].
Of course, there are more monosubstituted alkyls than nonsubstituted ones. We give the ratios number of monosubstituted alkyls/number of alkyls for small sizes:
> seq([i=evalf(count(specs_S1_Alkyl,size=i)/count(specs_Alkyl,size=i))],i=1..50);
Here is an example of a monosubstituted alkyl with 6 carbon atoms, obtained by the command combstruct[draw] .
> alk:=draw(specs_S1_Alkyl,size=6);
The following procedure rewrites a monosubstituted alkyl into a more readable way.
> nice:=proc(alk) subs([C,X]=CX,eval(subs({Epsilon=NULL,Carbon=C,Prod=proc() [args] end,Set=proc() args end},alk))) end:
> nice(alk);
The following procedures compute the size and height of a given monosubstituted alkyl.
>
size:=proc(alk) option remember; 1+convert(map(op,map2(map,size,[op(2..-1,alk)])),`+`) end:
size(Carbon):=1:
size(X):=0:
size(Epsilon):=0:
>
height:=proc(alk) option remember; `if`(nops(alk)=2,1+max(op(map(height,op(2,alk)))),1+max(height(op(2,alk)),op(map(height,op(3,alk))))) end:
height(Carbon):=1:
height(X):=0:
height(Epsilon):=0:
Here is a monosubstituted alkyl with 50 carbon atoms, its nice representation and height.
> alk:=draw(specs_S1_Alkyl,size=50):
> nice(alk);
> height(alk);
Trisubstituted alkanes,
Enumerating trisubstituted alkanes is equivalent to enumerating disubstituted alkyls . In this section, we assume , and to be distinct. The grammar is more involved than in the disubstituted case: we have to distinguish several cases, according to which of and go into subtrees, and into which subtrees. The corresponding class equation is .
> gramm_S2_Alkyl:=S2_Alkyl[X,Y]=Union(Prod(Carbon,S2_Alkyl[X,Y],Set(Alkyl,card<=2)),Prod(Carbon,Union(S1_Alkyl[X],X),Union(S1_Alkyl[Y],Y),Set(Alkyl,card<=1))):
> specs_S2_Alkyl:=[S2_Alkyl[X,Y],{gramm_S2_Alkyl,gramm_S1_Alkyl,op(subs(X=Y,[gramm_S1_Alkyl])),gramm_Alkyl},unlabelled]:
> seq(count(specs_S2_Alkyl,size=i),i=0..50);
This series extends the entry M3466 ("paraffins with carbon atoms") in the book by N. J. A. Sloane and S. Plouffe [ The Encyclopedia of Integer Sequences , (1995), Academic Press].
Here is an example of a disubstituted alkyl with 6 carbon atoms, obtained by the command combstruct[draw] .
> alk:=draw(specs_S2_Alkyl,size=6);
The following procedure rewrites a disubstituted alkyl into a more readable way.
> nice:=proc(alk) subs({[C,X]=CX,[C,Y]=CY,[C,X,Y]=CXY},eval(subs({Epsilon=NULL,Carbon=C,Prod=proc() [args] end,Set=proc() args end},alk))) end:
> nice(alk);
The following procedures compute the size and height of a given disubstituted alkyl.
>
size:=proc(alk) option remember; `if`(nops(alk)=2,1+convert(map(size,op(2,alk)),`+`),1+convert(map(size,[op(2..-2,alk)]),`+`)+convert(map(size,op(-1,alk)),`+`)) end:
size(Carbon):=1:
size(X):=0:
size(Y):=0:
size(Epsilon):=0:
>
height:=proc(alk) option remember; `if`(nops(alk)=2,1+max(op(map(height,op(2,alk)))),1+max(op(map(height,[op(2..-2,alk)])),op(map(height,op(-1,alk))))) end:
height(Carbon):=1:
height(X):=0:
height(Y):=0:
height(Epsilon):=0:
Here is a disubstituted alkyl with 50 carbon atoms, its nice representation and height.
> alk:=draw(specs_S2_Alkyl,size=50):
> nice(alk);
> height(alk);
Trisubstituted alkanes,
Enumerating trisubstituted alkanes is equivalent to enumerating disubstituted alkyls . In this section, we assume and to be distinct. The class equation is .
> gramm_S2b_Alkyl:=S2_Alkyl[X,X]=Union(Prod(Carbon,S2_Alkyl[X,X],Set(Alkyl,card<=2)),Prod(Carbon,Union(Prod(S1_Alkyl[X],S1_Alkyl[X]),Prod(S1_Alkyl[X],X),Prod(X,X)),Set(Alkyl,card<=1))):
> specs_S2b_Alkyl:=[S2_Alkyl[X,X],{gramm_S2b_Alkyl,gramm_S1_Alkyl,gramm_Alkyl},unlabelled]:
> seq(count(specs_S2b_Alkyl,size=i),i=0..50);
This series does not appear in the book by N. J. A. Sloane and S. Plouffe [ The Encyclopedia of Integer Sequences , (1995), Academic Press], nor do its first four differences. Comparing to the entry M2838 ("tertiary alcohols with carbon atoms") in this book, we check that there are always more compounds than compounds (for ).
Here is an example of a disubstituted alkyl with 6 carbon atoms, obtained by the command combstruct[draw] .
> alk:=draw(specs_S2b_Alkyl,size=6);
The following procedure rewrites a disubstituted alkyl into a more readable way.
> nice:=proc(alk) subs({[C,X]=CX,[C,X,X]=CX[2]},eval(subs({Epsilon=NULL,Carbon=C,Prod=proc() [args] end,Set=proc() args end},alk))) end:
> nice(alk);
Here are the 9 disubstituted compounds for :
> map(nice,allstructs(specs_S2b_Alkyl,size=3));
Conclusion: multiply substituted alkyls
In the previous sections, we have enumerated the substituted compounds , , and . We could in principle enumerate the class of compounds obtained after substituting hydrogen atoms by atoms, hydrogen atoms by atoms, ..., hydrogen atoms by atoms, and one hydrogen atom by (so as to plant the trees). Doing so would require to define the class for each , ..., , and for each , ..., , to write a recursion involving partitions into 4 parts of the multiset { ( times), ..., ( times)}. When the 's are given, those partitions can be computed by a call to combstruct[allstructs] . It follows that we would describe and generate the grammar for multiply substituted alkyls in terms of the grammar for partitions into 4 parts!
Note added on February 25, 2008: The four integer sequences mentioned here are entries A000598, A000635, A000640, and A135142 in Sloane's On-Line Encyclopedia of Integer Sequences.