Lattice protein model
I implemented a version of the 5 × 5 lattice protein model put forward by Goldstein and coworkers [21–23, 32]. In this model, proteins are sequences of n = 25 residues that fold into a maximally compact structure on a two-dimensional grid of 5 × 5 lattice points. There are 1081 distinct possible conformations in this model, and the partition function can be evaluated exactly by summing over the contact energies of all distinct conformations.
The contact energy of a conformation i is
where
is the contact energy between amino acids
at location j and
at location k in the sequence, and
is 1 if the two amino acids are in contact in conformation k, and 0 otherwise. The partition function is
where the sum runs over all 1081 conformations. A sequence folds into conformation f if the contact energy for that conformation is lower than the contact energies of all other formations, E
f
<E
i
for all i ≠ f, and if the free energy of folding, which is defined as
is smaller than some cutoff ΔGcut. Throughout this study, I used kT = 0.6 and ΔGcut = 0. The contact energies
where taken from Table VI in Ref. [33].
Sequence evolution
I simulated the evolution of populations of DNA sequences in discrete, non-overlapping generations. Population size is denoted by Ne. The fitness of a sequence was 1 if the DNA sequence translated into a peptide sequence that could fold into a chosen target structure, and had a free energy of folding smaller than Gcut. Otherwise, the fitness of the sequence was 0. All sequences had length L = 75. In each successive generation, sequences with fitness 1 were randomly chosen to reproduce, until the new generation had Ne members. At reproduction, the sequences were mutated, with an average of μ base pair substitutions per sequence. I let each population evolve for several thousand generations, and kept track of the full genealogic information of all sequences in the population. In order to measure the molecular clock of fixed mutations only, I studied the pattern of base substitutions in a window of τ generations along the line of descent backwards in time, starting from the most recent common ancestor of the final population.
I varied the parameters Ne (10, 33, 100, 330, 1000, 3300), μ (0.0075, 0.075, 0.75), and τ (500, 1000). For each set of parameters, I carried out 500 replicates (each with a different, randomly chosen target structure), to obtain a distribution for the number of synonymous and non-synonymous substitutions Sd and Nd. Since there was some variation in the number of synonymous and non-synonymous sites across different target structures (on the order of approximately ± 5% variation from the mean), I then applied a correction factor to Sd and Nd to bring them into comparable units: I calculated the corrected number of synonymous substitutions
as
Here, S is the mean number of synonymous sites for the given replicate, and (S) is the average of S over all 500 replicates. Likewise, I calculated
(Indices of dispersion calculated without this correction factor are slightly larger than the ones reported here, because the variation in S and N creates additional variance in Sd and Nd). Similar correction factors have been used in sequence analysis [7], and are generally referred to as lineage adjustments. They control for differences among lineages that are primarily related to the expected number of substitutions in a lineage, and thus should not enter the index of dispersion.
To obtain an estimate for mean and standard error of the index of dispersion, I subdivided the 500 results into 10 blocks of 50 each, and calculated mean and variance of the number of substitutions for each block. The ratio of variance to mean for a given set of substitutions (synonymous or non-synonymous) in a block is the index of dispersion for this data set. I then calculated mean and standard error for the index of dispersion from the individual results of the 10 blocks.
The total CPU time needed to carry out all simulations was several months on a small cluster of Pentium II 500 MHz machines.
Calculation of synonymous and non-synonymous substitutions and sites
I calculated the number of synonymous and non-synonymous sites S and N and the number of synonymous and non-synonymous substitutions Sd and Nd according to the method proposed by Nei and Gojobori [34]. In short, under this method the number of synonymous sites s
i
of a codon i is the fraction of possible substitutions to that codon that leave the residue unchanged. The number of non-synonymous sites n
i
for the same codon is n
i
= 3 - s
i
. For the complete sequence, S and N are calculated as
and
where i runs over all codons in the sequence. The number of synonymous or non-synonymous substitutions sd,ior nd,ibetween two codons is the average number of such substitutions, where the average is taken over all paths that lead from one codon to the other. The total number of synonymous or non-synonymous substitutions between two sequences is the sum over all individual constributions,
and
(again, i runs over all codons in the sequence).
To calculate the number of synonymous or non-synonymous substitutions along the line of descent, I simply summed up all synonymous or non-synonymous substitutions that occurred from generation to generation. Because the full evolutionary history was known, a correction for multiple mutations such as the Jukes-Cantor correction [35] was not necessary. I also averaged the number of synonymous and non-synonymous sites over all sequences along the line of descent, to get the mean number of synonymous and non-synonymous sites for the given evolutionary trajectory.