Читать книгу Algorithms in Bioinformatics - Paul A. Gagniuc - Страница 61

2.3.3 Computations on the Average Genome Size

Оглавление

A series of computations show the average genome size observed for each division in the tree of life, as well as the average size of viral genomes and the average DNA length of plasmids (Figure 2.1 and Table 2.1). These values were calculated from the raw data extracted from the file transfer protocol (FTP) of the National Center for Biotechnology Information (NCBI). The NCBI section for Genome Information by Organism contains general data in relation to each branch from the tree of life: eukaryotes (13k); prokaryotes (265k); viruses (41k); plasmids (23k); organelles (17k). These categories amount to ∼359k DNA/RNA sequences of different assembly levels of readiness, of which 341k sequence samples of assembly level “complete” were used to calculate the averages presented here. Thus, filters were used to obtain a clean data set. For instance, only levels for “complete chromosomes” or “complete genomes” were considered for these calculations.

Moreover, the maximum values presented in the main text were extracted from these data and checked against the literature. The files containing the raw data can be found in the additional materials online. Important note: The number of samples shown on the last row of Table 1.4 can be misleading. Table 1.4 shows 252k prokaryote samples, whereas the cataloged prokaryotes in Table 1.1 show a total of 12k species. In the NCBI database, prokaryotes have more than one reference or representative genome per species. According to NCBI filters, around 3.2k of the prokaryote genomes are representative.


Figure 2.1 The average genome size. (a) Shows the proportion of known species in each kingdom of life. (b) It shows the tree of life with data on the main kingdoms of life. Each kingdom is labeled with the average genome size and the average GC% content. (c) Shows the average organellar genome for a number of organelles investigated to date. Here, the organelles are sorted by GC%. (d) It shows a comparison between mitochondria and chloroplasts. (e) Shows a comparison between plasmids from bacteria, archaea, and eukaryotes. For each chart (c–e), the left axis indicates the GC% percentage and the right axis indicates the average size of the genome expressed in mega base pairs (written here as Mb instead of Mbp, for ease).

Table 2.1 The average genome size in the tree of life.

Genome size average (Mb)
Eukaryotes (Mb) Prokaryotes (Mb) Plasmids (Mb) Organelles (Mb) Viruses (Mb)
AV 433.92 3.74 0.11 0.07 0.04
SD ±1160.87 ±1.81 ±0.23 ±0.39 ±0.43
Average GC% content
Eukaryotes (%) Prokaryotes (%) Plasmids (%) Organelles (%) Viruses (%)
AV 41.92 48.72 45.91 36.05 45.34
SD ±10.90 ±11.87 ±11.32 ±7.92 ±9.27
Samples
Total 12 039 252 029 21 801 16 388 38 431

The table shows the average genome size and the average GC% content in: Eukaryotes, prokaryotes, plasmids, organelles, and viruses (eukaryotic and prokaryotic). Note that smaller standard deviation (SD) values indicate that more of the data are clustered about the mean while a larger SD value indicates the data are more spread out (larger variation in the data). The unit of length for DNA is shown in mega bases (Mb). For instance, DNA fragments equal to 1 million nucleotides (1 000 000 b) are 1 mega base in length (1 Mb) or 1000 kilo bases (1000 kb) in length. The last row (samples) indicates how many sequenced genomes have been used for these computations.

Algorithms in Bioinformatics

Подняться наверх