Читать книгу An Introduction to Molecular Biotechnology - Группа авторов - Страница 39
4.1.1 Genome Size
ОглавлениеThe total DNA of a cell is referred to as a genome. Genome sizes of major organismal groups are shown schematically in Figure 4.1. When the minimal genome size of organisms is examined (i.e. only the left side of the bar), an increase in size can be seen that mainly runs parallel to the organizational level. Bacteria and fungi with simple structures have smaller genomes than structurally complicated multicellular organisms. It is presumed that the genome was enlarged particularly through genome duplications. Protostomia and the Deuterostomia ancestors of the vertebrates (see Chapter 6) contain generally only one copy of a gene, while several copies of a gene are often found in the genomes of chordates. As a result, it is supposed that the chordate genomes have doubled at least two or three times (1‐2‐4 rule). The first genome duplication during the evolution of chordates has already taken place before the Cambrian explosion, whereas the second and next doubling occurred in the early Devonian period. In the evolution of fish, a further doubling of the genome occurred with up to eight copies of the original Deuterostomia (1‐2‐4‐8 hypothesis) in the late Devonian period. This took place after the Actinopterygii and Sarcopterygii had already divided. Among the Sarcopterygii are the famous Coelacanthus and lungfishes. All land vertebrates (amphibians, reptiles, birds, and mammals) have apparently descended from them. Within the eukaryotes, the maximum genome size has only a small relationship to the developmental level. This is because many plants and amphibians have genomes with up to 1011 bases, and the genomes are therefore one to two orders of magnitude higher than the genome of humans – it is obvious that many genome duplications must have taken place in these groups.
Figure 4.1 Number of nucleotides in the haploid genomes of important groups of organisms.
When the human genome is considered, it is obvious that a massive amount of information is present. If the DNA in an individual human cell was stretched out, it would be 2 m long. With around 1013 cells in our body, the total length of DNA in all cells is 2 × 1010 km. This length would be a distance that runs many times from the earth to the sun and back again!
Of the 3.2 million bases that are present in human haploid chromosomes, about 25% of the DNA defines genes, but only 1.5% of the DNA codes directly for proteins (Table 4.2 and Figure 4.2). The rest of the DNA is made up of RNA genes and noncoding sequences, which often either serve no function or their function is still unknown. In recent years microRNAs have been detected encoded in the “functionless” DNA, which are important for gene regulation (see Chapters 3 and 21).
Table 4.2 Relation between genome size and the number of genes of a few selected species whose genomes have been sequenced.
Organisms | Genome size (bp)a) | Approximate number of genesb) |
---|---|---|
Archaea | ||
Archaeoglobus fulgidus | 2.18 × 106 | 2405 |
Methanothermobacter thermautotrophicus | 1.75 × 106 | 1866 |
Pyrococcus furiosus (Archaea) | 1.91 × 106 | 2057 |
Sulfolobus acidocaldarius (Archaea) | 2.99 × 106 | 2221 |
Bacteria | ||
Clostridium tetani | 2.8 × 106 | 2373 |
Escherichia coli | 4.67 × 106 | 4288 |
Haemophilus influenzae | 1.83 × 106 | 1702 |
Mycoplasma genitalium | 0.58 × 106 | 476 |
Rhodospirillum rubrum | 4.35 × 106 | 3791 |
Fungi | ||
Aspergillus fumigatus | 2.9 × 107 | 9920 |
Saccharomyces cerevisiae | 1.3 × 107 | 6600 |
Candida glabrata | 1.4 × 107 | 5180 |
Sporozoa | ||
Plasmodium falciparum (causes malaria) | 2.3 × 107 | 5300 |
Plants | ||
Arabidopsis thaliana | 2.2 × 108 | 29000 |
Animals | ||
Caenorhabditis elegans (nematode) | 1.3 × 108 | 21 000 |
Drosophila melanogaster (fruit fly) | 2.0 × 108 | 32 000 |
Danio rerio (zebra fish) | 1.4 × 109 | 21 000 |
Mus musculus (mouse) | 2.8 × 109 | 30 000 |
Homo sapiens (human) | 3.2 × 109 | 30 000 |
((done))
a Haploid genome.
b Including protein‐coding and noncoding RNA genes.
Source: www.ebi.ac.uk/genomes.
Figure 4.2 Composition of eukaryotic genomes and a fraction of a few DNA elements of the entire human genome.
Possibly the largest part of the genome (over 50% with higher eukaryotes) is not transcribed and according to our present knowledge is partially functionless. Important elements are pseudogenes and repetitive DNA sequences (Table 4.3 and Figure 4.2).
Table 4.3 A few characteristics of the human genome.
Parameter | Human genome |
---|---|
Genome size | 3.2 × 109 bp |
Number of protein‐coding genes | 21 000 |
Longest gene | 2.4 × 106 bp |
Medium gene size | 27 000 bp |
Smallest number of exons/gene | 1 |
Highest number of exons/gene | 178 |
Mean number of exons/gene | 10.4 |
Largest exon | 17 106 bp |
Mean exon size | 145 bp |
Number of pseudogenes | >20 000 |
Number of noncoding RNA genes | >9000 |
Percentage of protein‐coding sequences | 1.5% |
Percentage DNA in rRNA, functional DNA | 3.5% |
Percentage in repetitive DNA elements | ∼50% |
It is usually (but not always) the case that new functional genes develop through the doubling or duplication of genes in the progress of evolution. New genes can also be generated by combining domains or partial gene sequences. Horizontal gene transfer (which happened when bacteria became mitochondria) also helped to enlarge the eukaryotic genomes. In contrast, pseudogenes, which are nontranslatable copies of genes, show frameshifts, nonsense mutations, deletions, and insertions (see Section 4.1.4). Pseudogenes do not have any further function today. Pseudogenes can be divided into two groups: the first arose from gene duplication and the second from retroposons. In the second case, the genes were transcribed and processed and, following reverse translation in DNA, were inserted into a location in the genome. It is usually the case that these retropseudogenes have no introns, but frequently poly(A) tails, and unlike the pseudogenes they are not present in the vicinity of the original gene from which they arose. Surprisingly, nature can afford to reproduce this junk DNA in every generation, even though replication is an energy‐consuming process. Perhaps these DNA sections that today appear to be useless will become functional in a later evolutionary phase as molecular replacement parts.
When the duplicated DNA sequence lies beside the original gene, it is termed as tandem repeat. These tandem repeats are the starting point for further DNA amplifications, induced by uneven crossing‐over. Repetitive DNA is quantitatively important and can be divided into middle repetitive DNA (transposons and retroelements) and highly repetitive DNA. The latter class includes short nucleotide sequences, which are present in great numbers in chromosomes in a tandem‐type style. There are also further divisions into telomere, satellite, minisatellite, and microsatellite DNA.
Upon cesium chloride gradient centrifugation, the DNA of eukaryotes is separated, and two bands are often observed, the smaller of which contains satellite DNA. This satellite DNA is especially rich in repetitive sequences and prefers to be localized in the region of the centromeres. In insects and other arthropods, this satellite DNA is very homogeneous, meaning that their sequence elements are highly conserved. In vertebrates the repeated sequence units contain up to 1000 repetitions of satellite DNA, and it is significantly longer and more variable (length of over 200 bp); subelements such as GA5TGA can often be found in these elements. Through uneven crossing‐over, the variability of satellite DNA is about 10 times higher than with genes that only have a low copy number. Division and organization of the repetitive DNA elements in the centromere region are chromosome and type specific. It is assumed that the repetitive DNA at the centromere region is responsible for homologous chromosome recognition and the fact that they arrange themselves next to each other during meiosis.
In the actual satellite DNA of both plants and animals, elements are found that are repeated 5–50 times, each being 15–100 bp. The sequence elements can be attributed to the original sequence that was varied through point mutations. This repetitive DNA, each about 500–5000 nucleotides in length, is significantly shorter than the satellite DNA and is termed minisatellite or variable number tandem repeats (VNTRs). It exhibits a large variability in length in every locus, and a very high mutation rate is present as a result of uneven crossing‐over (as the number and length of repeats is changed), which can amount to 5% of the gamete. Minisatellite DNA is therefore termed the hot spot of meiotic recombination. Minisatellite DNA is especially suitable for the identification of individuals and has been used also for clarification of paternity and homozygosity in a population. Many VNTR loci each have dozens of alleles, which are codominantly inherited. This characteristic was used in DNA fingerprinting. The possibility that two unrelated individuals have the same DNA fingerprints is less than 1 in 10 million. Presently, DNA fingerprinting is based on short tandem repeat (STR) and single nucleotide polymorphism (SNP) analyses.
In addition, there are still shorter repeats that arise in animal and plant genomes. These consist of a basic unit of two (sometimes as many as five) nucleotides, such as (GC)n or (CA)n, which are repeated up to 100 times. Of these elements, termed microsatellites or STRs (short tandem repeats), about 30 000 loci are found in humans, which are of great importance for the recognition of tissues and individuals, paternity and population studies, and genome mapping. STR analysis is the method of choice for the determination of sexual crimes or murder in forensic medicine or criminal studies. The alleles allow amplification through polymerase chain reaction(PCR) (see Chapter 13). Microsatellite PCR is currently the method of choice for many forensic, biotechnological, and biological investigations due to the fact that it requires only the smallest amounts of DNA. The variability of microsatellite DNA is strongly increased during meiosis via uneven crossing‐over and slippage of the DNA polymerase, so that the short sequence elements can be mutated, duplicated, and deleted. Alternatively to STR analyses, SNP analyses have become available for a number of organisms, which often provide a more detailed picture of the genetic background.
Additional 500‐base long DNA sections are found in animal and plant genomes. These so‐called scattered or short interspersed elements (SINEs), or 1000‐ to 5000‐nucleotide long interspersed elements (LINEs), appear in high copy numbers (although not in tandem style repeats) (Figure 4.2). The DNA elements Alu (which is recognized by the restriction enzyme AluI), Kpn, and poly(CA) are also counted among the SINEs. The percentage of these elements in the human genome is about 20% of the entire genome. It is presumed that these elements, which are also called mobile genetic elements or retrotransposons, arise through reverse transcription. From an evolutionary point of view, transposons (with long terminal repeats [LTRs] or inverted repeats [IRs]), retrotransposons, and retroposons (transposons without LTRs) could be considered as examples of active egoistic genes (selfish DNA), which only have their own replication in mind. On the other hand, these mobile elements lead to genetic variability (an increased exon shuffling or enhancer shuffling) that in the long run can also have positive effects. In areas of Alu sequences, chromosomes exhibit increased rates of new orientation. When Alu elements jump into active genes, most of them are inactivated; conversely, sleeping genes can be activated, in that the skipped elements can function as enhancers. Finally, the selection of new characteristics is made available. Sexual isolation and type formation can be increased through this mechanism.
The relative percentage of nonrepetitive DNA in bacteria is 100% and decreases in the higher developed eukaryotes: 70% in Drosophila, around 55% in mammals, and 33% in plants.
The percentage of repetitive DNA increases correspondingly. Assisted by uneven crossing‐over, the percentage of repetitive DNA in the genome of eukaryotes in future evolutions will probably increase further. As explained above, the function of about 50% of the genome remains unknown. Whether or not repetitive DNA is really functionless or egoistic DNA, as is often speculated, will be determined by future research.