DS-2019-06: Considerations in Evolutionary Biochemistry

DS-2019-06: der Gulik, Peter T.S. van (2019) Considerations in Evolutionary Biochemistry. Doctoral thesis, University of Amsterdam.

[thumbnail of Full Text] Text (Full Text)

Download (2MB)
[thumbnail of Samenvatting] Text (Samenvatting)

Download (13kB)


Computational methods offer a powerful way to investigate difficult problems in evolutionary biochemistry. A clear example how computational methods can provide new and thorough knowledge in this area, is the history of the investigation of regularities in the structure of the genetic code. Upon visual inspection of the table which gives the translation rules of nucleic acid to protein, several investigators noted that similar codons often encoded similar amino acids. As an example: Carl Woese noted in the mid sixties that codons with C at the middle position encoded without exception amino acids that are not really large and not really hydrophobic, but certainly not very hydrophilic. Francis Crick placed a critical note to these kind of observations: all twenty canonical amino acids resemble each other and it is very difficult for the human mind to not think to see patterns, even in a random presentation. Computational methods offered a way out of this disagreement. By producing a set of random redistributions of the codon assignments, coupled with the use of a quantitative scale of amino acid characterizations (developed by Carl Woese and co-workers) and the use of a function to reflect the error robustness of the genetic code and its variants-by-random-redistribution, it became possible to prove that Carl Woese was right in recognizing the pattern. Already in 1969 Alff-Steinberger published this approach, but only in the nineties the fact was generally accepted, after work of Hurst and co-workers.

One of the areas in which we applied computational methods to problems in evolutionary biochemistry, was the enigma of the primordial peptides. What was the function of the first coded peptides? Which sequence fragments in protein-coding genes are the very oldest in all protein-coding information? We adressed these questions by assuming that ancient biological systems used a smaller repertoire of amino acids in their proteins. To be specific: we assumed that at a certain stage of life, proteins consisted of just four amino acids: valine, alanine, aspartic acid and glycine. We next searched the PDB (Protein Data Bank) for stretches of protein which consisted of just these four amino acids, with one position left as an exception to not totally exclude later adaptation of old motives. Interestingly, we found types of proteins, which are fundamental to life: polymerases, mutases and kinases. The mutases and kinases play roles in glycolysis, which is a central pathway in biochemistry. The sequence “Alanine-Aspartic acid-Phenylalanine-Aspartic acid-Glycine-Aspartic acid” in RNA polymerase is the active site of the enzyme which produces mRNA in all living cells. We conjectured that our procedure indeed pinpointed sequence fossils in existing proteins. Furthermore, we concluded that protein stretches containing glycine and aspartic acid, and manipulating magnesium dications, may have been among the very first coded peptide sequences. Maybe, peptides like “Aspartic acid-Glycine-Aspartic acid” were originally produced by a prebiotic environment, and maybe their concentrations were among the first aspects of the environment which life managed to get under control.

Another area in which we applied computational methods to problems in evolutionary biochemistry, was the riddle of the structure of the genetic code. As pointed out above, Hurst and co-workers used a function for the error robustness of the genetic code to show that similar codons in the genetic code encode, in general, similar amino acids. Carl Woese’s polar requirement was used to quantify the similarity of different amino acids, and it was shown that the error robustness which is a result of this distribution of assignments resides mainly in the first and the third position of the codon. We were fascinated by this work, and decided to refine some mathematical aspects of it. First of all, we wanted to characterise the global optimum of the error function in the space defined by the randomization procedure. In the field, a value (found by Goldman using a heuristic search procedure) was used as if it was the global minimum. This value was the lowest value known to exist in this space, but it was not known to be the global optimum. From a mathematical viewpoint, such a situation is very dissatisfying. We searched for the optimum and proved that the value reported by Goldman was the global minimum. During our work in this area, we noted that we always obtained a smooth distribution of values in our histograms, while the published histograms were characterised by spikes, which were thought to result from the combination of the discrete, clumped distribution of amino acid polar requirement and the patterns of codon blocks in the first and third bases. Because we did not obtain these spikes, we concluded that this line of reasoning had to be false. We tried to determine what the cause of the spikes must have been, and decided they are an artifact resulting from the combination of rounding errors in both the data and the bin borders of the histograms. Another facet of the work with which we were not completely happy, was the procedure with which random variant codes were generated. By swapping the amino acid assignments, a space was created in this field (a space which we decided to give the name ‘Space 0’) which did not contain variant codes which were in fact known to exist in remote corners of biology (and, in the case of mitochondria, even corners of biology which, in a certain sense, are not at all remote). By devising a new procedure to generate random code variants, we enlarged the code space to successively contain known sense-to-sense reassignments, known stop-to-sense reassignments, and known unused codons. The resulting spaces were called ‘Space 1’, ‘Space 2’, and ‘Space 3’, and we also devised a ‘Space 4’ which contains hypothetical precursor codes with less than 20 amino acids, and experimental synthetic codes with amino acids selected to be co-translationally incorporated in proteins by researchers. We could perform calculations with Space 1 and Space 2, and found that the basics of the relationship between the genetic code and the average code did not change, despite the (considerable) enlargement of the space. A further refinement we contributed to the field was a critical examination of the conclusions drawn in the field based on calculations as described above. In particular the use of the concept ”Frozen Accident” was examined. Also the tendency to conclude from a low error value of the genetic code as compared with the average code that very large amounts of codes must have been screened by natural selection to arrive at the genetic code as we know it, was shown to not be the only way one can interprete this low value. Scenarios of code development do not so much differ in showing that error robustness due to codon assignments is present in the genetic code, but in the way these scenarios propose this error robustness has been built. The refinements we contributed in this way to the field are described in detail in the third chapter.
In the fourth chapter, a result is reported concerning another kind of error robustness in the genetic code. Not only are similar amino acids often encoded by similar amino acids; identical amino acids are nearly always (serine being the exception) encoded by similar codons. As an example: all arginine codons share middle G. During contemplation of this kind of error robustness, a stunning fact was suddenly discovered. Unmodified anticodons are known to pair in a limited set of ways with several codons. An anticodon starting with guanine pairs with both codons ending with a pyrimidine, and an anticodon starting with cytosine pairs only with a codon ending on guanine. The implication of these most basic wobble rules is that a set of tRNAs without anticodon modifications is able to transfer all twenty amino acids of the canonical genetic code. No complex modification apparatus is needed in early biochemistry, which is exactly what would be expected if the system evolved from a simple origin. This observation suggests that originally certain codons would not have been in use because the simple system was not able to recognize them unambiguously. Negative selection would keep these codons on an extremely low level of presence in protein-coding sequences. To be precise: in the fourth chapter the exact argument for the conjecture that UUA, UAA, UGA, CAA, AUA, AAA, AGA, and GAA were unused codons in a stage of code development in which all twenty amino acids were already part of the amino acid repertoire is laid down.
The work on the genetic code is further elaborated in the fifth chapter. While the third chapter contributed refinements of an existing approach and the fourth chapter highlighted a missed regularity, the fifth chapter integrates different aspects, which are all considered important in the evolution of the genetic code, in a single mathematical procedure. The main line of reasoning is, that if certain assignments are determined by stereochemical interactions between triplet and amino acid (which is indicated by experimental work in the field), these assignments should not be allowed to vary in the randomization procedure which provides code variants. Another aspect which should be integrated into the model, is the concept of a gradual growth of the repertoire, starting with valine, alanine, aspartic acid and glycine, and gradually developing towards a twenty amino acid code. This can be done by using a randomization procedure developed by Freeland and Hurst. Taken together, the different aspects provide a realistic model of the space available for early life to probe code variations. In this (limited!) space, the standard genetic code is optimal. During this work, the similarity of amino acids in molecular structure was investigated. Using the procedure of the third chapter to study the position dependence of error robustness, we found that with the Molecular Structure Matrix developed in this work as input data, error robustness was found to reside in the first and second position of the codon (as contrasted to the first and third position as found for polar requirement). It is suggested that this regularity derives from a gradual expansion of the amino acid repertoire from simple to complex amino acids combined with a gradual expansion of the codon repertoire from codons starting with purines (first guanine, later adenine) to codons starting with pyrimidines (codons starting with uracil being the last to be added to the set).

The last chapter deals with another problem in evolutionary biochemistry which can be investigated with computational methods. The expression of the mitochondrial genetical material of the sleeping sickness parasite Trypanosoma brucei is very complex. Information necessary to ensure that many uridine nucleosides are present in the right places in mRNA, is, in fact, scattered over the mitochondrial genome. Many different suggestions about the evolutionary background of this complex organization are brought forward in the scientific literature; one of these is that this organization provides a protection against loss of information due to intense intraspecific competition in combination with a complex life cycle. The sixth chapter presents our efforts to give this concept a mathematical foundation.

Item Type: Thesis (Doctoral)
Report Nr: DS-2019-06
Series Name: ILLC Dissertation (DS) Series
Year: 2019
Subjects: Computation
Depositing User: Dr Marco Vervoort
Date Deposited: 14 Jun 2022 15:17
Last Modified: 14 Jun 2022 15:17
URI: https://eprints.illc.uva.nl/id/eprint/2168

Actions (login required)

View Item View Item