Читать книгу Algorithms in Bioinformatics - Paul A. Gagniuc - Страница 73

2.8 Genes vs. Proteins in the Tree of Life

Throughout different organisms, the proteome may be smaller, equal to (hardly ever), or larger than the genome. In eukaryotic species in particular, one gene may encode for more than one protein via a process known as alternative splicing. Note that RNA-splicing mechanisms are discussed in detail in Chapter 8. A comparative analysis between the average number of genes and the average number of proteins is shown in Table 2.6. Based on the values shown in this table, various rough estimates can be made on the frequency of alternative splicing in different kingdoms of life. A general equation can be formulated by assuming a “one gene–one protein” correspondence. Given that an equality between the number of proteins and the number of genes means 100%, everything that is above this threshold is a surplus that can be attributed to alternative splicing and protein splicing. Thus, the average number of genes divides the unity (a value of 1 – it can also be 100 for simplicity) and the result is multiplied by the average number of proteins. To find the average protein surplus (S), the unity is deduced from this result only if the proteome is larger than the genome, as follows:

Table 2.6 Genes vs. proteins in the tree of life.

Eukaryotes	Size (Mb)	Genes	Proteins	GC%
Animals	1493.6	27 075.8	39 140.1	41.0
Fungi	18.6	7707.5	6951.2	42.5
Plants	940.8	39 140.1	45 405.0	38.7
Protists	22.7	5915.1	5628.1	35.6
Other	45.6	7546.3	7354.8	43.4
Prokaryotes	Size (Mb)	Genes	Proteins	GC%
Bacteria & Archaea	4.0	3829.0	3598.4	49.5

The table shows a comparison between the average genome size, the average GC% content, the average number of genes, and the average number of resulting proteins. Note that the unit of length for DNA is shown in mega bases (Mb). DNA fragments equal to 1 million nucleotides (1 000 000 b) are 1000 kilo bases in length (1000 kb) or 1 mega bases in length (1 Mb), or 0.001 giga bases in length (0.001 Gb). For instance, an average genome size of 1493.6 Mb is 1.4936 Gb (∼1.4 Gb).

The average animal proteome is 45% more diverse when compared to the average number of animal genes. By using the same formula from above, the average plant proteome is 16% more diverse than the average number of known genes. The fungi, protest, and prokaryote average proteome is moderately undersized. As before, the average number of genes divides unity (a value of 1) and the result is multiplied by the average number of proteins. However, since in this case the proteome is smaller than the genome, the final result is subtracted from 1 (unity), as follows:

where S in this case is the part of the proteome that should exist, assuming a “one gene – one protein” correspondence. Thus, the fungal average proteome indicates ∼10% fewer types of proteins than the average number of genes. This suggest that 10% of the fungal genome encodes for functional RNAs. The situation is similar for protists. In protists, about 5% of genes could encode for functional RNAs and the remaining 95% encodes for proteins. In the case of prokaryotes, about 93% of archaeal and bacterial genes encode for proteins and the remaining 7% could encode for functional RNAs.

Although informative, note that an undersized proteome does not rule out the possibility of alternative splicing or protein splicing in any of these kingdoms. Animals and plants show the most diverse proteomes, well above the average number of genes (Table 2.6). Individually, some species may show a particularly high proteome diversity compared to these averages. For instance, in plants, Triticum durum (macaroni wheat) contains ∼63k of genes and a proteome of ∼190k. Following the same reasoning as above, the proteome of T. durum is ∼197% more diverse when compared to the number of genes. In animals, a significant difference can also be found. Current NCBI data shows that the human genome contains ∼60k of genes (the list of annotated features includes protein-coding genes, noncoding genes, and pseudogenes) and a proteome of ∼120k (H. sapiens GRCh38.p13). The proteome of H. sapiens is ∼95% more diverse when compared to the number of genes. Note: However, when it comes to the human genome and the proteome, a discussion can be almost dangerous over time. In literature, the number of genes and proteins for H. sapiens can vary depending on different agreements or/and advances in bioinformatics [234–236]. But why all this uncertainty related to the number of genes or proteins? All genes are predicted by using bioinformatic means. Many predictions are then verified by alignment of sequenced mRNAs against a reference genome. However, many genes express themselves only in special conditions or over certain periods of time, or only once in a life time. Thus, their mRNAs cannot be detected and sequenced to further confirm the bioinformatic predictions. To add to this matter, many genes may overlap and often gene promoters can show bidirectional activity [237–239]. It stands to reason that such elusive genes are difficult to locate with certainty and other genes will prove difficult to detect in the future. Moreover, many results derived from large-scale experiments (e.g. genome studies) are directly under the umbrella of chaos theory. Small changes in the initial parameters of different algorithms can lead to huge variations in the final predictions. This has already been evident over time in the case of the human genome [234–236].

Подняться наверх