Читать книгу Molecular Biotechnology - Bernard R. Glick - Страница 9
Оглавление2
Fundamental Technologies
Preparation of DNA for Cloning
Insertion of Target DNA into a Plasmid Vector
Transformation and Selection of Cloned DNA in a Bacterial Host
Genome Engineering Using CRISPR Technology
Assembling Oligonucleotides into Genes
Sequencing Using Reversible Chain Terminators
Sequencing by Single Molecule Synthesis
Preparation of Genomic DNA Sequencing Libraries
High-Throughput Next-Generation Sequencing Strategies
Molecular Cloning
MOLECULAR BIOTECHNOLOGY USES A variety of techniques for isolating genes and transferring them from one organism to another. At the root of these technologies is the ability to join a sequence of deoxyribonucleic acid (DNA) of interest to a vector that can then be introduced into a suitable host. This process is known as recombinant DNA technology or molecular cloning. A vast number of variations on this basic process have been devised to produce products that are important for medicine, food production, environmental remediation, and industrial processes.
Preparation of DNA for Cloning
In theory, DNA from any organism can be cloned. The target DNA may be obtained directly from genomic DNA, derived from messenger ribonucleic acid (mRNA), subcloned from previously cloned DNA, or synthesized in vitro. The target DNA may contain the complete coding sequence for a protein, a part of the protein coding sequence, a random fragment of genomic DNA, or a segment of DNA that contains regulatory elements that control expression of a gene. Prior to cloning, both the source DNA that contains the target sequence and the cloning vector must be cut into discrete fragments, predictably and reproducibly, so that they can be joined (ligated) together to form a stable molecule. Bacterial enzymes known as type II restriction endonucleases, or (more commonly) restriction enzymes, are used for this purpose (see Milestone box on page 12). These enzymes recognize and cut double-stranded DNA molecules at specific base pair sequences and are produced naturally by bacteria to cleave foreign DNA, such as that of infecting bacterial viruses (bacteriophage). A bacterium that produces a specific restriction endonuclease also has a corresponding system to modify the sequence recognized by the restriction endonuclease in its own DNA to protect it from being degraded. Often, methylation of a cytosine within a restriction endonuclease binding site prevents the enzyme from cutting at these sites.
milestone Cleavage of DNA by RI Restriction Endonuclease Generates Cohesive Ends
Recombinant DNA technology requires a vector to carry cloned DNA, the specific joining of vector and cloned (insert) DNA molecules to form a vector–insert DNA construct, the introduction of the vector–insert DNA construct into a host cell, and the identification of host cells that acquire the cloned DNA. The discovery of type II restriction endonucleases facilitated cloning genes into vectors. In 1968, M. Meselson and R. Yuan (Nature 217:1110–1114) showed that the capability of a strain of E. coli to prevent (restrict) the development of a bacterial virus (bacteriophage) was due to a host cell enzyme that cleaved the DNA of the infecting bacteriophage. The study done by Mertz and Davis (Proc. Natl. Acad. Sci. USA. 69:3370–3374, 1972) established that the RI restriction endonuclease from E. coli, which is now called EcoRI, cut DNA at a specific site and produced complementary extensions. Briefly, they showed that after circular DNA was linearized by treatment with EcoRI, some of the molecules formed hydrogen-bonded circular DNA molecules which were converted to covalently closed circular DNA molecules by treating the sample with a DNA ligase. The extensions of all of the cut DNA molecules were the same and were estimated to be four to six nucleotides long, with the recognition site being six nucleotide pairs. Mertz and Davis concluded that “… any two DNA molecules with RI sites can be ‘recombined’ at their restriction sites by the sequential action of RI endonuclease and DNA ligase to generate hybrid DNA molecules.” The discovery that EcoRI created cohesive ends was one of the most important contributions to the development of recombinant DNA technology because it provided, according to Mertz and Davis, a “simple way … to generate specifically oriented recombinant DNA molecules in vitro.”
A large number of restriction endonucleases from different bacteria are available to facilitate cloning. The sequence and length of the recognition site in the DNA vary among the different enzymes and can be four or more nucleotide pairs (Table 2.1). One of the first restriction endonucleases to be characterized was from the bacterium Escherichia coli, designated EcoRI. The name of a restriction endonuclease indicates the genus (capitalized letter), species (first two letters in lowercase), and occasionally the strain or serotype (e.g., R in EcoRI) of the source bacterium, as well as the order of characterization of different restriction endonucleases from the same bacterium (Roman numerals). As for most restriction endonucleases, EcoRI is a homodimeric protein (made up of two identical polypeptides) that recognizes and binds to a specific, palindromic DNA sequence (Fig. 2.1A). In DNA, a palindrome is a sequence of nucleotides in each of the two strands that is identical when either is read in the same polarity, i.e., 5′ to 3′. The EcoRI recognition sequence consists of six base pairs (bp) and is cut between the guanine and adenine residues on each strand (Fig. 2.1A). Specifically, it cleaves the bond between the oxygen attached to the 3′ carbon of the sugar of one nucleotide and the phosphate group attached to the 5′ carbon of the sugar of the adjacent nucleotide. The symmetrical staggered cleavage of DNA by EcoRI produces two single-stranded, complementary ends, each with extensions of four nucleotides, often referred to as sticky ends. Each single-stranded extension terminates with a 5′ phosphate group, and the 3′ hydroxyl group of the opposite strand is recessed (Fig. 2.1A). Some other restriction endonucleases, such as PstI, leave 3′ hydroxyl extensions with recessed 5′ phosphate ends (Fig. 2.1B), while others, such as SmaI, cut the backbone of both strands within a recognition site to produce blunt-ended DNA molecules (Fig. 2.1C).
Table 2.1 Recognition sequences of some restriction endonucleases
Figure 2.1 Type II restriction endonucleases bind to and cut within a specific DNA sequence. (A) EcoRI makes a staggered cut in the DNA strands producing single-stranded, complementary ends (sticky ends) with a 5′ phosphate group extension; (B) PstI also makes a staggered cut in both strands but produces sticky ends with a 3′ hydroxyl group extension; (C) cleavage of DNA with SmaI produces blunt ends. Arrows show the sites of cleavage in the DNA backbone. S, deoxyribose sugar; P, phosphate group; OH, hydroxyl group. The restriction endonuclease recognition site is shaded.
Restriction endonucleases isolated from different bacteria may recognize and cut DNA at the same site (Fig. 2.2A). These enzymes are known as isoschizomers. Some recognize and bind to the same sequence of DNA but cleave at different positions (neoschizomers), producing different single-stranded extensions (Fig. 2.2B). Other restriction endonucleases (isocaudomers) produce the same nucleotide extensions but have different recognition sites (Fig. 2.2C). In some cases, a restriction endonuclease will cleave a sequence only if one of the nucleotides in the recognition site is methylated, while in other cases a restriction endonuclease (type IIS) binds to a specific recognition site but cuts outside of that site, a fixed number of nucleotides away from one or both ends. In the latter case, any particular sequence of nucleotides may be present between the binding sequence and the cut sites. These characteristics of restriction endonucleases are considered when designing a cloning experiment.
Figure 2.2 Restriction endonucleases have been isolated from many different bacteria. (A) Isoschizomers such as BspEI from Bacillus sp. and AccIII from Acinetobacter calcoaceticus bind the same DNA sequence and cut at the same sites; (B) neoschizomers such as NarI from Nocardia argentinensis and SfoI from Serratia fonticola bind the same DNA sequence but cut at different sites; (C) isocaudomers such as NcoI from Nocardia corallina and PagI from Pseudomonas alcaligenes bind different DNA sequences but produce the same sticky ends. Bases in the restriction enzyme recognition sequence are shown. Arrows show the sites of cleavage in the DNA backbone.
Many other enzymes may be used to prepare DNA for cloning. In addition to restriction endonucleases, nucleases that degrade single-stranded extensions, such as S1 nuclease and mung bean nuclease, are used to generate blunt ends for cloning (Fig. 2.3A). This is useful when the recognition sequences for restriction enzymes that produce complementary sticky ends are not available on both the vector and target DNA molecules. Blunt ends can also be produced by extending 3′ recessed ends using a DNA polymerase such as Klenow polymerase derived from E. coli DNA polymerase I (Fig. 2.3B). Phosphatases such as calf intestinal alkaline phosphatase cleave the 5′ phosphate groups from restriction enzyme-digested DNA (Fig. 2.3C). A 5′ phosphate group is required for formation of a phosphodiester bond between nucleotides, and therefore, its removal prevents recircularization (self-ligation) of vector DNA. On the other hand, kinases add phosphate groups to the ends of DNA molecules. Among other activities, T4 polynucleotide kinase catalyzes the transfer of the terminal (γ) phosphate from a nucleoside triphosphate to the 5′ hydroxyl group of a polynucleotide (Fig. 2.3D). This enzyme is employed to prepare chemically synthesized DNA for cloning, as such DNA molecules are often missing a 5′ phosphate group required for ligation to vector DNA.
Figure 2.3 Some other enzymes used to prepare DNA for cloning. (A) Mung bean nuclease degrades single-stranded 5′ and 3′ extensions to generate blunt ends; (B) Klenow polymerase extends 3′ recessed ends to generate blunt ends; (C) calf alkaline phosphatase removes the 5′ phosphate group from the ends of linear DNA molecules; (D) T4 polynucleotide kinase catalyzes the addition of a 5′ phosphate group to the ends of linear DNA fragments. Dotted lines indicate that only one end of the linear DNA molecule is shown.
Insertion of Target DNA into a Plasmid Vector
When two different DNA molecules are digested with the same restriction endonuclease the same sticky ends are produced in both molecules. After the two molecules are mixed together, new DNA combinations can be formed as a result of complementary base-pairing between the extended regions (Fig. 2.4). The enzyme DNA ligase, usually from the E. coli bacteriophage T4, is used to reform the phosphodiester bond between the 3′ hydroxyl group and the 5′ phosphate group at the ends of DNA strands that are already held together by the hydrogen bonds between the complementary bases of the extensions (Fig. 2.4). DNA ligase also joins blunt ends, although this is generally much less efficient and typically requires a much greater amount of DNA ligase.
Figure 2.4 Ligation of two different DNA fragments after digestion of both with restriction endonuclease BamHI. Complementary nucleotides in the single-stranded extensions form hydrogen bonds. T4 DNA ligase catalyzes the formation of phosphodiester bonds by joining 5′ phosphate and 3′ hydroxyl groups at nicks in the backbone of the double-stranded DNA.
Ligation of restriction enzyme-digested DNA provides a means to stably insert target DNA into a vector for introduction and propagation in a suitable host cell. Many different vectors have been developed to act as carriers for target DNA. Most are derived from natural gene carriers, such as genomes of viruses that infect eukaryotic or prokaryotic cells and integrate into the host genome, or plasmids that are found in bacterial or fungal cells. Others are synthetically constructed artificial chromosomes designed for delivery of large pieces of target DNA (>100 kilobase pairs [kb]) into bacterial, yeast, or mammalian host cells. Many different vectors that carry sequences required for specific functions, for example, for expression of foreign DNA in a host cell, are described throughout this book. Here, vectors based on bacterial plasmids are used to illustrate the basic features of a cloning vector.
Plasmids are small, usually circular, double-stranded DNA molecules that are found naturally in many bacteria. They can range in size from less than 1 kb to more than 500 kb and are maintained as extrachromosomal entities that replicate independently of the bacterial chromosome. While they are not usually essential for bacterial cell survival under laboratory conditions, plasmids often carry genes that are advantageous under particular conditions. For example, they may carry genes that encode resistance to antibiotics or heavy metals, genes for the degradation of unusual organic compounds, or genes required for toxin production. Each plasmid has a sequence that functions as an origin (initiation site) of DNA replication which is required for it to replicate in a host cell. Some plasmids carry information for their own transfer from one cell to another.
The number of copies of a plasmid that are present in a host cell is controlled by factors that regulate plasmid replication and are characteristic of that plasmid. High-copy-number plasmids are present in 10 to more than 100 copies per cell. Low-copy-number plasmids are maintained in 1 to 4 copies per cell. When two or more different plasmids cannot coexist in the same host cell because they use the same mechanism of replication, they are said to belong to the same plasmid incompatibility group. However, plasmids from different incompatibility groups can be maintained together in the same cell. This coexistence is independent of the copy numbers of the individual plasmids. Some microorganisms have been found to contain as many as 8 to 10 different plasmids. In these instances, each plasmid can carry out different functions and have its own unique copy number, and each belongs to a different incompatibility group. Some plasmids can replicate in only one (or very few) host species because they require very specific proteins for their replication as determined by their origin of replication (often denoted as oriV or origin of vegetative replication to distinguish it from the oriC or origin of chromosomal replication). These are generally referred to as narrow-host-range plasmids. On the other hand, broad-host-range plasmids have less specific origins of replication and can replicate in a number of different bacterial species. The copy number, incompatibility group, and host range of a plasmid are considered when choosing a suitable vector for a molecular cloning experiment.
As autonomous, self-replicating genetic elements, plasmids are useful vectors for carrying cloned DNA. However, naturally occurring plasmids often lack several important features that are required for a good cloning vector. These include a choice of unique (single) restriction endonuclease recognition sites into which the target DNA can be inserted and one or more selectable genetic markers for identifying recipient cells that carry the cloning vector–insert DNA construct. Most of the plasmids that are currently used as cloning vectors have been genetically modified to include these features.
An example of a commonly used plasmid cloning vector is pUC19 (the lower case p denotes a plasmid), which is derived from a natural E. coli plasmid. The plasmid pUC19 is 2,686 bp long, contains an origin of replication that enables it to replicate in E. coli, and has a high copy number, which is useful when a large number of copies of the target DNA or its encoded protein are required (Fig. 2.5A). It has been genetically engineered to possess a short (54-bp) DNA sequence that contains many unique restriction endonuclease sites which is called a multiple-cloning site (also known as a polylinker) (Fig. 2.5B). A DNA sequence from the lactose operon of E. coli has also been added that includes a segment of the β-galactosidase gene (lacZ′) under the control of the lac promoter and a lacI gene that produces a repressor protein that regulates the expression of the lacZ′ gene from the lac promoter (Fig. 2.5A). The multiple-cloning site has been inserted within the β-galactosidase gene in a manner that does not disrupt the function of the β-galactosidase enzyme when it is expressed (Fig. 2.5B). In addition, pUC19 carries the bla gene (Ampr gene) encoding β-lactamase that renders the cell resistant to ampicillin and can therefore be used as a selectable marker to identify cells that carry the vector.
Figure 2.5 Plasmid cloning vector pUC19. (A) The plasmid contains an origin of replication for propagation in E. coli, an ampicillin resistance gene (Ampr) for selection of cells carrying the plasmid, and a multiple-cloning site for insertion of cloned DNA. (B) The multiple-cloning site (nucleotides in uppercase letters) containing several unique restriction endonuclease recognition sites (indicated by horizontal lines) was inserted into the lacZ′ gene (nucleotides in lowercase letters) in a manner that does not disrupt the production of a functional β-galactosidase (LacZα fragment). The first 26 amino acids of the protein are shown. Expression of the lacZ′ gene is controlled by the LacI repressor encoded by the lacI gene on the plasmid. The size of the plasmid is 2,686 bp.
To clone a gene of interest into pUC19, the vector is cut with a restriction endonuclease that has a unique recognition site within the multiple-cloning site (Fig. 2.6). The source DNA carrying the target gene is digested with the same restriction enzyme, which cuts at sites flanking the sequence of interest. The resulting linear molecules, which have the same sticky ends, are mixed together and then treated with T4 DNA ligase. A number of different ligated combinations are produced by this reaction, including the original circular plasmid DNA (Fig. 2.6). To reduce the amount of this unwanted ligation product, prior to ligation, the cleaved plasmid DNA is treated with the enzyme alkaline phosphatase to remove the 5′ phosphate groups from the linearized plasmid DNA (Fig. 2.3C). T4 DNA ligase cannot join the ends of the dephosphorylated linear plasmid DNA. However, the target DNA is not treated with alkaline phosphatase and therefore provides phosphate groups to form two phosphodiester bonds with the alkaline phosphatase-treated vector DNA (Fig. 2.6). Following treatment with T4 DNA ligase, the two phosphodiester bonds are sufficient to hold the circularized molecules together, despite the presence of two nicks (Fig. 2.6). After introduction into a host bacterium, these nicks are sealed by the host cell DNA ligase system.
Figure 2.6 Cloning target DNA into pUC19. The restriction endonuclease BamHI cleaves pUC19 at a unique sequence in the multiple-cloning site (MCS) and at sequences flanking the target DNA. The cleaved vector is treated with alkaline phosphatase to remove 5′ phosphate groups to prevent vector recircularization. Digested target DNA and pUC19 are mixed to join the two molecules via complementary single-stranded extensions and treated with T4 DNA ligase to form a phosphodiester bond between the joined molecules. Several ligation products are possible. In addition to pUC19 inserted with target DNA, undesirable circularized target DNA molecules and recircularized pUC19 that escaped treatment with alkaline phosphatase are produced.
Transformation and Selection of Cloned DNA in a Bacterial Host
After ligation, the next step in a cloning experiment is to introduce the vector–target DNA construct into a suitable host cell. A wide range of prokaryotic and eukaryotic cells can be used as cloning hosts; however, routine cloning procedures are often carried out using a well-studied bacterial host, usually E. coli. The process of taking up DNA into a bacterial cell is called transformation, and a cell that is capable of taking up DNA is said to be competent. Competence occurs naturally in many bacteria, usually when cells are stressed in high-density populations or in nutrient-poor environments, and enables bacteria to acquire new sequences that may enhance survival. Although competence and transformation are not intrinsic properties of E. coli, competence can be induced by various treatments.
One method to induce the uptake of plasmid DNA by a bacterial host such as E. coli is by treating mid-log phase cells with ice-cold calcium chloride (CaCl2) and then exposing them for two minutes to a high temperature (42°C). This treatment creates transient openings in the cell wall that enable DNA molecules to enter the cytoplasm. Alternatively, uptake of free DNA can be induced by subjecting bacteria to a high-voltage electric field in a procedure known as electroporation. The experimental protocols for electroporation are different for various bacterial species. For E. coli, the cells (∼50 microliters [μL]) and DNA are placed in a chamber fitted with electrodes (Fig. 2.7A), and a single pulse of approximately 25 microfarads, 2.5 kilovolts, and 200 ohms is administered for about 4.6 milliseconds (ms). Although the precise mechanism of DNA uptake during electroporation is not known, it has been deduced that transient pores are formed in the cell wall as a result of the electroshock and that, after contact with the lipid bilayer of the cell membrane, the DNA is taken into the cell (Fig. 2.7B). Generally, transformation is an inefficient process, and therefore, most of the cells will not have acquired a plasmid; at best, about 1 cell in 1,000 E. coli host cells is transformed. The integrity of the introduced DNA constructs is also more likely to be maintained in host cells that are unable to carry out exchanges between DNA molecules because the gene encoding recombination enzyme RecA has been deleted from the host chromosome.
Figure 2.7 Electroporation. (A) Electroporation cuvette with a cell suspension between two electrodes. (B) (1) Cells (yellow) and DNA (red) in suspension in an electroporation cuvette prior to the administration of high-voltage electric field (HVEF) pulses. (2) HVEF pulses induce transient openings in the cells (dashed lines) that allow entry of DNA into the cells. (3) After HVEF pulsing, some cells acquire exogenous DNA.
Cells transformed with vectors that carry a gene encoding resistance to an antibiotic can be selected by plating on medium containing the antibiotic. For example, cells carrying the plasmid vector pUC19, which contains the bla gene encoding β-lactamase, can be selected on medium containing ampicillin (Fig. 2.8A). Nontransformed cells or cells transformed with circularized target DNA cannot grow in the presence of ampicillin. However, cells transformed with the pUC19–target DNA construct and cells transformed with recircularized pUC19 that escaped dephosphorylation by alkaline phosphatase are both resistant to ampicillin. To differentiate cells carrying the desired vector–target DNA construct from those carrying the recircularized plasmid, loss of β-galactosidase activity that results from insertion of target DNA into the lacZ′ gene is determined. Recall that the multiple-cloning site in pUC19 lies within the lacZ′ gene (Fig. 2.5). An E. coli host is used that can synthesize the part of β-galactosidase (LacZω fragment) that combines with the product of the lacZ′ gene (LacZα fragment) encoded on pUC19 to form a functional enzyme. When cells carrying recircularized pUC19 are grown in the presence of isopropyl-β-D-thiogalactopyranoside (IPTG), which is an inducer of the lac operon, the protein product of the lacI gene (the LacI repressor) is prevented from binding to the promoter–operator region of the lacZ′ gene, so the lacZ′ gene in the plasmid is transcribed and translated (Fig. 2.8B). The LacZα fragment combines with a host LacZω fragment to form an active hybrid β-galactosidase. If the substrate 5-bromo-4-chloro-3-indolyl-β-D-galactopyranoside (X-Gal) is present in the medium, it is hydrolyzed by the hybrid β-galactosidase to form a blue product (5,5′-dibromo-4,4′-dichloro-indigo). Under these conditions, colonies containing recircularized pUC19 appear blue (Fig. 2.8A). In contrast, host cells that carry a plasmid-cloned DNA construct produce white colonies on the same medium. The reason for this is that target DNA inserted into a restriction endonuclease site within the multiple-cloning site usually disrupts the correct sequence of DNA codons (reading frame) of the lacZ′ gene and prevents the production of a functional LacZα fragment, so no active hybrid β-galactosidase is produced (Fig. 2.8B). In the absence of β-galactosidase activity, the X-Gal in the medium is not converted to the blue compound, so these colonies remain white (Fig. 2.8A). The white (positive) colonies subsequently must be confirmed to carry a specific target DNA sequence.
Figure 2.8 (A) Strategy for selecting host cells that have been transformed with pUC19 carrying cloned target DNA. E. coli host cells transformed with the products of pUC19–target DNA ligation are selected on medium containing ampicillin, X-Gal, and IPTG. Nontransformed cells and cells transformed with circularized target DNA do not have a gene conferring resistance to ampicillin and therefore do not grow on medium containing ampicillin (recircularized target DNA also does not carry an origin of replication, and therefore, the plasmids are not propagated in host cells even in the absence of ampicillin). However, E. coli cells transformed with recircularized pUC19 or pUC19 carrying cloned target DNA are resistant to ampicillin and therefore form colonies on the selection medium. The two types of ampicillin-resistant transformants are differentiated by the production of functional β-galactosidase. (B) IPTG in the medium induces expression of the lacZ′ gene by binding to the LacI repressor and preventing LacI from binding to the lacO operator sequence. This results in production of the β-galactosidase LacZα fragment in cells transformed with recircularized pUC19. The LacZα fragment combines with the LacZω fragment of β-galactosidase encoded in the host E. coli chromosome to form a functional hybrid β-galactosidase. β-Galactosidase cleaves X-Gal, producing a blue product, and the colonies on the plate appear blue. Insertion of target DNA into the multiple-cloning site of pUC19 alters the reading frame of the lacZ′ gene, thereby preventing production of a functional β-galactosidase in cells transformed with this construct. Colonies that appear white on medium containing X-Gal and IPTG carry the cloned target gene (A).
A number of selection systems have been devised to identify cells carrying vectors that have been successfully inserted with target DNA. In addition to ampicillin, other antibiotics such as tetracycline, kanamycin, and streptomycin are used as selective agents for various cloning vectors. Some vectors carry a gene that encodes a toxin that kills the cell (Table 2.2). The toxin gene is under the control of a regulatable promoter, such as the promoter for the lacZ′ gene that is activated only when the inducer IPTG is supplied in the culture medium. Insertion of a target DNA fragment into the multiple-cloning site prevents the production of a functional toxin protein in the presence of the inducer. Only cells that carry a vector with the target DNA survive under these conditions.
Table 2.2 Some toxin genes used to select for successful insertion of target DNA into a vector
In addition to E. coli, other bacteria, such as Bacillus subtilis, often are the final host cells. For many applications, cloning vectors that function in E. coli may be provided with a second origin of replication that enables the plasmid to replicate in the alternative host cell. With these shuttle cloning vectors, the initial cloning steps are generally conducted using E. coli before the final construct is introduced into a different host cell. In addition, a number of plasmid vectors have been constructed with a single broad-host-range origin of DNA replication instead of a narrow-host-range origin of replication. These vectors can be used with a variety of microorganisms.
Broad-host-range vectors can be transferred among different bacterial hosts by exploiting a natural system for transmitting plasmids known as conjugation. There are two basic genetic requirements for transfer of a plasmid by conjugation: (i) a specific origin-of-transfer (oriT) sequence on the plasmid that is recognized by proteins that initiate plasmid transfer and (ii) several genes encoding the proteins that mediate plasmid transfer. The genes encoding the transfer proteins may be present on the transferred plasmid (Fig. 2.9A), in the genome of the plasmid donor cell, or supplied on a helper plasmid (Fig. 2.9B). Some of these proteins form a pilus that extends from the donor cell and, following contact with a recipient cell, retracts to bring the two cells into close contact. A specific endonuclease cleaves one of the two strands of the plasmid DNA at the oriT, and as the DNA is unwound, the displaced single-stranded DNA is transferred into the recipient cell through a conjugation pore made up of proteins encoded by the transfer genes. A complementary strand is synthesized in both the donor and recipient cells, resulting in a copy of the plasmid in both cells.
Figure 2.9 Plasmid transfer by biparental (A) or triparental (B) conjugation. A plasmid carrying a cloned gene, an origin of transfer (oriT), and transfer genes is transferred by biparental mating from a donor cell to a recipient cell (A). Proteins encoded by the transfer genes mediate contact between donor and recipient cells, initiate plasmid transfer by nicking one of the DNA strands at the oriT, and form a pore through which the nicked strand is transferred from the donor cell to recipient cell. In a cloning experiment, the donor is often a strain of E. coli that does not grow on minimal medium, allowing selection of recipient cells that grow on minimal medium. Acquisition of the plasmid by recipient cells is determined by resistance to an antibiotic, such as ampicillin in this example. If the plasmid carrying the cloned gene does not possess genes for plasmid transfer, these can be supplied by a helper cell (B). In triparental mating, the helper plasmid is first transferred to the donor cell, where the proteins that mediate transfer of the plasmid carrying the cloned gene are expressed. Although the plasmid carrying the cloned gene does not possess transfer genes, it must have an oriT in order to be transferred. Neither the helper nor donor cells grow on minimal medium, and therefore, the recipient cells can be selected on minimal medium containing an antibiotic such as ampicillin. To ensure that the helper plasmid was not transferred to the recipient cell, sensitivity to kanamycin is determined.
Cloning Eukaryotic Genes
Bacteria lack the molecular machinery to excise the introns from RNA that is transcribed from eukaryotic genes. Therefore, before a eukaryotic sequence is cloned for the purpose of producing the encoded protein in a bacterial host, the intron sequences must be removed. Functional eukaryotic mRNA does not contain introns because they have been removed by the eukaryotic cell’s splicing machinery. Purified mRNA molecules are used as a starting point for cloning eukaryotic genes but must be converted to double-stranded DNA before they are inserted into a vector that provides bacterial sequences for transcription and translation.
Purified mRNA can be obtained from eukaryotic cells by exploiting the tract of up to 200 adenosine monophosphates (polyadenylic acid [poly(A)] tail) that are added to the 3′ ends of mRNA before they are exported from the nucleus (Fig. 2.10). The poly(A) tail provides the means for separating the mRNA fraction of a tissue from the more abundant ribosomal RNA (rRNA) and transfer RNA (tRNA). Short chains of 15 thymidine monophosphates (oligodeoxythymidylic acid [oligo(dT)]) are attached to cellulose beads, and the oligo(dT)–cellulose beads are packed into a column. Total RNA extracted from eukaryotic cells or tissues is passed through the oligo(dT)–cellulose column, and the poly(A) tails of the mRNA molecules bind by base-pairing to the oligo(dT) chains. The tRNA and rRNA molecules, which lack poly(A) tails, pass through the column. The mRNA is removed (eluted) from the column by treatment with a buffer that breaks the A:T hydrogen bonds.
Figure 2.10 Schematic representation of oligo(dT)-cellulose separation of polyadenylated mRNA from total cellular RNA.
To convert mRNA to double-stranded DNA for cloning, the enzyme reverse transcriptase, encoded by certain RNA viruses (retroviruses), is used to catalyze the synthesis of complementary DNA (cDNA) from an RNA template. If the sequence of the target mRNA is known, a short (∼20 nucleotides), single-stranded DNA molecule known as an oligonucleotide primer that is complementary to a sequence at the 3′ end of the target mRNA is synthesized (Fig. 2.11A). The primer is added to a sample of purified mRNA that is extracted from eukaryotic cells known to produce the mRNA of interest. This sample of course contains all of the different mRNAs that are produced by the cell; however, the primer will specifically base-pair with its complementary sequence on the target mRNA. Not only is the primer important for targeting a specific mRNA, but also it provides an available 3′ hydroxyl group to prime the synthesis of the first cDNA strand. In the presence of the four deoxyribonucleotides, reverse transcriptase incorporates a complementary nucleotide into the growing DNA strand as determined by the sequence of the template mRNA strand. To generate a double-stranded DNA molecule, the RNA:DNA (heteroduplex) molecules are treated with RNase H, which nicks the mRNA strands, thereby providing free 3′ hydroxyl groups for initiation of DNA synthesis by DNA polymerase I. As the synthesis of the second DNA strand progresses from the 3′ ends of the nicked mRNA fragments, the 5′ exonuclease activity of DNA polymerase I removes the ribonucleotides of the mRNA. After synthesis of the second DNA strand is completed, the ends of the cDNA molecules are blunted (end repaired and polished) with T4 DNA polymerase, which removes 3′ extensions and fills in from 3′ recessed ends. The double-stranded cDNA carrying only the exon sequences encoding the eukaryotic protein can be cloned directly into a suitable vector by blunt-end ligation. Alternatively, chemically synthesized short double-stranded DNA adaptors that contain a restriction endonuclease recognition sequence can be ligated to the ends of the cDNA molecules, and then digested with the restriction endonuclease prior to insertion into a vector via sticky-end ligation.
Figure 2.11 Synthesis of double-stranded cDNA using gene-specific primers (A) or oligo(dT) primers (B). A short oligonucleotide primer is added to a mixture of purified mRNA and anneals to a complementary sequence on the mRNA. Reverse transcriptase catalyzes the synthesis of a DNA strand from the primer using the mRNA as a template. To synthesize the second strand of DNA, the mRNA is nicked by RNase H, which creates initiation sites for E. coli DNA polymerase I. The 5′ exonuclease activity of DNA polymerase I removes RNA sequences that are encountered as DNA synthesis proceeds. The ends of the cDNA are blunted using T4 DNA polymerase prior to cloning.
When the sequence of the target mRNA intended for cloning is not known or when several target mRNAs in a single sample are of interest, cDNA can be generated from all of the mRNAs using an oligo(dT) primer rather than a gene-specific primer (Fig. 2.11B). The mixture of cDNAs, ideally representing all possible mRNA produced by the cell, is cloned into a vector to create a cDNA library that can be screened for the target sequence(s) (described below).
Recombinational Cloning
Recombinational cloning is a rapid and versatile system for cloning sequences without restriction endonuclease and ligation reactions. It is particularly useful when a large number of DNA fragments are to be cloned into one type of vector, for example, to introduce protein coding sequences into an expression vector for the production and purification of thousands of different proteins in parallel to facilitate the creation of a proteomic microarray (described later in this chapter). One method, known as Gateway cloning technology, exploits the mechanism used by bacteriophage λ to integrate viral DNA into the host bacterial genome during infection. Bacteriophage λ integrates into the E. coli chromosome at a specific sequence (25 bp) in the bacterial genome known as the attachment bacteria (attB) site. The bacteriophage genome has a corresponding attachment phage (attP) sequence (243 bp) that can recombine with the bacterial attB sequence with the help of the bacteriophage λ recombination protein integrase and an E. coli-encoded protein called integration host factor (Fig. 2.12A). Recombination between the attP and attB sequences results in insertion of the phage genome into the bacterial genome to create a prophage with attachment sites attL (100 bp) and attR (168 bp) at the left and right ends of the integrated bacteriophage λ DNA, respectively. For subsequent excision of the bacteriophage λ DNA from the bacterial chromosome, recombination between the attL and attR sites is mediated by integration host factor, integrase, and bacteriophage λ excisionase (Fig 2.12B). The recombination events occur at precise locations without either the loss or gain of nucleotides.
Figure 2.12 Integration (A) and excision (B) of bacteriophage λ into and from the E. coli genome via recombination between attachment (att) sites in the bacterial and bacteriophage DNA.
For recombinational cloning, a modified attB sequence is added to each end of the target DNA. The attB sequences are modified so that they will only recombine with specific attP sequences. For example, attB1 recombines only with attP1, and attB2 recombines with attP2. The target DNA with flanking attB1 and attB2 sequences is mixed with a vector (donor vector) that has attP1 and attP2 sites flanking a toxin gene that will be used for negative selection following transformation into a host cell (Fig. 2.13A). Integrase and integration host factor are added to the mixture of DNA molecules to catalyze in vitro recombination between the attB1 and attP1 sites and between the attB2 and attP2 sites. As a consequence of the two recombination events, the toxin gene sequence between the attP1 and attP2 sites on the donor vector is replaced by the target gene. The recombination events create new attachment sites flanking the target gene sequence (designated attL1 and attL2), and the plasmid with the attL1-target gene-attL2 sequence is referred to as an entry clone. The mixture of original and recombinant DNA molecules is transformed into E. coli, and cells that are transformed with donor vectors that have not undergone recombination retain the toxin gene and therefore do not survive. Host cells carrying the entry clone are positively selected by the presence of a selectable marker.
Figure 2.13 Recombinational cloning. (A) Recombination (thin vertical lines) between a target gene with flanking attachment sites (attB1 and attB2) and a donor vector with attP1 and attP2 sites on either side of a toxin gene results in an entry clone where the target gene is flanked by attL1 and attL2 sites. The selectable marker (SM1) enables selection of cells transformed with an entry clone. The protein encoded by the toxin gene kills cells transformed with nonrecombined donor vectors. The origin of replication of the donor vector is not shown. (B) Recombination between the entry clone with flanking attL1 and attL2 sites and a destination vector with attR1 and attR2 sites results in an expression clone with attB1 and attB2 sites flanking the target gene. The selectable marker (SM2) enables selection of transformed cells with an expression clone. The second plasmid, designated as a by-product, has the toxin gene flanked by attP1 and attP2 sites. Cells with an intact destination vector that did not undergo recombination or that retain the by-product plasmid are killed by the toxin. Transformed cells with an entry clone, which lacks the SM2 selectable marker, are selected against. The origins of replication and the sequences for expression of the target gene are not shown.
The advantage of this procedure is the ability to easily transfer the target gene to a variety of vectors that have been developed for different purposes. For example, to produce high levels of the protein encoded on the cloned gene, the target DNA can be transferred to a destination vector that carries a promoter and other expression signals. An entry clone is mixed with a destination vector that has attR1 and attR2 sites flanking a toxin gene (Fig. 2.13B). In the presence of integration host factor, integrase, and bacteriophage λ excisionase, the attL1 and attL2 sites on the entry clone recombine with the attR1 and attR2 sites, respectively, on the destination vector. This results in the replacement of the toxin gene on the destination vector with the target gene from the entry clone, and the resultant plasmid is designated an expression clone. The reaction mixture is transformed into E. coli, and a selectable marker is used to isolate transformed cells that carry an expression clone. Cells that carry an intact destination vector or the exchanged entry plasmid (known as a by-product plasmid) will not survive, because these carry the toxin gene. Destination vectors are available for maintenance and expression of the target gene in various host cells such as E. coli and yeast, insect, and mammalian cells.
Genomic Libraries
A genomic library is a collection of DNA fragments, each cloned into a vector, that represents the entire genomic DNA, or cDNA derived from the total mRNA, in a sample. For example, the genomic library may contain fragments of the entire genome extracted from cells in a pure culture of bacteria or from tissue from a plant or animal. A genomic library can also contain the genomes of all of the organisms present in a complex sample such as from the microbial community on human tissue. Such libraries are known as metagenomic libraries. Whole-genome libraries may be used to identify genes that contain specific sequences, encode particular functions, or interact with other molecules.
To create a genomic library, the DNA extracted from the cells (cell cultures or tissues) of a source organism (or a community of organisms for a metagenomic library) is first digested with a restriction endonuclease. Often a restriction endonuclease that recognizes a sequence of four nucleotides, such as Sau3AI, is used. Although four-cutters will theoretically cleave the DNA approximately once in every 256 bp, the reaction conditions are set to give a partial, not a complete, digestion to generate fragments of all possible sizes (Fig. 2.14). This is achieved by using either a low concentration of restriction endonuclease or shortened incubation times, and usually some optimization of these parameters is required to determine the conditions that yield fragments of suitable size. The range of fragment sizes depends on the goal of the experiment. For example, for genome sequencing, large (100- to 200-kb) fragments are often desirable. To identify genes that encode a particular enzymatic function, that is, genes that are expressed to produce proteins in the size range of an average protein, smaller (∼3- to 40-kb) fragments are cloned.
Figure 2.14 Construction of a genomic DNA library. Genomic DNA extracted from cells or tissues is partially digested with a restriction endonuclease. Conditions are set so that the enzyme does not cleave at all possible sites. This generates overlapping DNA fragments of various lengths that are cloned into a vector.
The number of clones in a genomic library depends on the size of the genome of the organism, the average size of the insert in the vector, and the average number of times each sequence is represented in the library (sequence coverage). To ensure that the entire genome, or most of it, is contained within the clones of a library, the sum of the inserted DNA in all of the clones of the library should be at least three times the amount of DNA in the genome. For example, the size of the E. coli genome is approximately 4.6 × 106 bp; if inserts of an average size of 1,000 bp are desired, then 13,800 clones are required for threefold coverage (i.e., 3[(4.6 × 106)/103]). For the human genome, which contains 3.3 × 109 bp, about 80,000 clones with an average insert size of 150,000 bp are required for fourfold coverage (i.e., 4[(3.3 × 109)/(15 × 104)]). Statistically, the number of clones required for a comprehensive genomic library can be estimated from the relationship N = ln(1 − P)/ln(1 − f), where N is the number of clones, P is the probability of finding a specific gene, and f is the ratio of the length of the average insert to the size of the entire genome. On this basis, about 700,000 clones are required for a 99% chance of discovering a particular sequence in a human genomic library with an average insert size of 20 kb.
Several strategies can be used to identify target DNA in a genomic library. Genomic or metagenomic libraries can be screened to identify members of the library that carry a gene encoding a particular protein function. In functional complementation, the host cell does not have the protein activity of interest, in some cases because the host gene encoding the protein carries a mutation that abolishes the activity of the protein. A DNA library is constructed that carries fragments of genomic DNA from an organism that has the desired protein activity. Host cells with the genetic deficiency are transformed with plasmids of the DNA library, and transformed cells that have restored normal protein function are selected. The genomic DNA that is used to prepare the library can be from a variety of donor organisms, such as the wild-type strain of the host bacterium that carries a functional copy of the gene encoding the protein, a different organism that can be either another prokaryote or perhaps a eukaryote, or uncultured organisms that are present in an environmental sample.
Many genes encoding enzymes that catalyze specific reactions have been isolated from a variety of organisms by plating the cells of a genomic library on medium supplemented with a specific substrate. For efficient screening of thousands of clones, colonies that carry a cloned gene encoding a functional catabolic enzyme must be readily identifiable, often by production of a colored product or a zone of substrate clearance around the colony. Alternatively, genes may be identified that confer desired properties on host cells. In one example, genes encoding proteins that confer resistance to salt were isolated from a metagenomic library of DNA from microbes present in a hypersaline environment (Fig. 2.15). A library of 192,000 clones with an average insert size of 3 kb representing 1.2 × 109 bp of metagenomic DNA was constructed in an osmosensitive strain of E. coli from DNA extracted from bacteria collected from the saline soil around the roots of the halophyte Arthrocnemum macrostachyum. The library was screened by plating the transformed E. coli cells on media containing 3% sodium chloride, which is normally lethal for the host E. coli strain. Eight different genes that conferred salt resistance were identified. Based on sequence similarity to known genes, these are predicted to encode enzymes involved in nucleic acid structure and repair, an outer membrane protein, a glycerol permease, a proton pump, and two hypothetical proteins.
Figure 2.15 Isolation of genes involved in salt resistance from a metagenomic DNA library of halophilic (thrive in high salt concentrations) bacteria. Genomic DNA extracted from bacteria found in the soil around the roots of a plant (rhizosphere) growing in a hypersaline soil was fragmented and cloned into a plasmid to generate a metagenomic library. E. coli host cells carrying the cloned DNA were plated on a solid medium containing 3% sodium chloride. The E. coli strain normally does not grow on 3% sodium chloride. The cloned DNA fragments from colonies that developed on the plates were sequenced to identify genes that confer resistance to salt. Mirete et al., Front. Microbiol. 6:1121, 2015.
The presence of particular proteins produced by a genomic library can also be detected using an immunological assay. Rather than screening for the function of a protein, the library is screened using an antibody that specifically binds to the protein encoded by a target gene. The colonies are arrayed on a solid medium, transferred to a matrix, and then lysed to release the cellular proteins (Fig. 2.16). An antibody (primary antibody) is applied that specifically binds to the target protein on the matrix. Following the interaction of the primary antibody with the target protein, any unbound antibody is washed away, and the matrix is treated with a second antibody (secondary antibody) that is specific for the primary antibody. The secondary antibody is attached to an enzyme, such as alkaline phosphatase, that converts a colorless substrate to a colored or light-emitting (chemiluminescent) product that can readily identify positive interactions.
Figure 2.16 Screening of a genomic DNA library using an immunological assay. Transformed cells are plated onto solid agar medium under conditions that permit transformed but not nontransformed cells to grow. (1) From the discrete colonies formed on this master plate, a sample from each colony is transferred to a solid matrix such as a nylon membrane. (2) The cells on the matrix are lysed, and their proteins are bound to the matrix. (3) The matrix is treated with a primary antibody that binds only to the target protein. (4) Unbound primary antibody is washed away, and the matrix is treated with a secondary antibody that binds only to the primary antibody. (5) Any unbound secondary antibody is washed away, and a colorimetric (or chemiluminescent) reaction is carried out. The reaction can occur only if the secondary antibody, which is attached to an enzyme (E) that performs the reaction, is present. (6) A colony on the master plate that corresponds to a positive response on the matrix is identified. Cells from the positive colony on the master plate are subcultured because they may carry the plasmid–insert DNA construct that encodes the protein that binds the primary antibody.
Genome Engineering Using CRISPR Technology
Recently, researchers have designed strategies to insert, replace, or disrupt sequences at targeted sites in intact genomes in vivo. The method is based on a prokaryotic system that protects bacteria against invasion by foreign DNA such as bacteriophage genomic DNA and plasmids. This is a type of bacterial adaptive immune system that consists of genomic clustered regularly interspaced short palindromic repeats (CRISPR) containing fragments of foreign DNA molecules that the bacterium was previously exposed to and CRISPR-associated (Cas) proteins, including an endonuclease that cleaves homologous foreign DNA upon subsequent exposures. The CRISPR-Cas system has been adapted to introduce or replace genes in the genomes of a variety of organisms, both prokaryotes and eukaryotes, and also to edit genomes, that is, remove or alter targeted nucleotides (chapters 6 and 12).
In the natural bacterial CRISPR-Cas systems, short sequences (“protospacers”) from an invading DNA molecule are incorporated as “spacers” between repeat sequences in the CRISPR locus of the bacterial genome (Fig. 2.17A). Thus, the CRISPR locus contains an array of spacers separated by repeat sequences that are a record of past foreign DNA invasions from which the bacterium survived. When the bacterium is subsequently invaded by a virus or plasmid whose DNA contains a sequence that is homologous to a spacer sequence, the spacer DNA is transcribed, producing a CRISPR RNA (crRNA) molecule that binds to and guides the Cas endonuclease complex to the target sequence in the invading DNA, which is cleaved (Fig. 2.17B). Recognition of the target sequence on the invading DNA requires that it is adjacent to a short specific sequence known as a protospacer adjacent motif (PAM). For example, the Streptococcus pyogenes endonuclease Cas9 recognizes a target sequence that is complementary to a crRNA only if it is immediately upstream of the motif NGG (where N is any nucleotide). PAMs are also important for selection of protospacers during spacer acquisition. The PAM requirement prevents cleavage of the bacterium’s own genome at sequences that are complementary to the crRNA, including the site in the CRISPR array from which the crRNA was transcribed, which lacks a PAM.
Figure 2.17 Bacterial CRISPR-Cas system for protection against invading bacteriophage. (A) Fragments of bacteriophage DNA (protospacer) are incorporated into the host bacterial genome as spacers between repeat sequences (gray) in the CRISPR array. (B) On subsequent invasion, the spacer DNA is transcribed to produce CRISPR RNA (crRNA) that guides an endonuclease (Cas) to a sequence in the invading DNA that is homologous to the spacer sequence and is adjacent to a protospacer adjacent motif (PAM). The viral genome is cleaved. Adapted by permission from Macmillan Publishers Ltd. from Yosef and Qimron, Nature 519:166–167, 2015.
Because of its relative simplicity compared to systems in other bacteria, the CRISPR-Cas system from S. pyogenes has been adapted for use as a genome engineering tool. In the natural S. pyogenes system, two RNA molecules, crRNA and transactivating crRNA (tracrRNA), form a crRNA:tracrRNA hybrid that directs the Cas9 endonuclease to the target site. For ease of use in genome engineering, the two RNAs are combined into a single guide RNA (sgRNA) that is 80 to 100 nucleotides long. The sgRNA is designed to include a 20-nucleotide sequence that is complementary to the target site (which is located adjacent to a PAM), and the fused crRNA:tracrRNA sequence that forms a stem loop structure involved in endonuclease binding (Fig. 2.18A). Following binding to the target sequence, the endonuclease makes a double-stranded break in the target DNA (Fig. 2.18B). This damage activates the cellular systems for DNA repair either by homologous recombination, in which DNA sequences with sufficient similarity are exchanged, or nonhomologous end joining, in which sequences are deleted or inserted. The repair systems can be harnessed to disrupt, insert, or replace a DNA sequence at a targeted site.
Figure 2.18 CRISPR-Cas system for genome editing. (A) An 80- to 100- nucleotide long single guide RNA (sgRNA) is constructed that contains a 20-nucleotide guide sequence (orange) that is complementary to the target site. The secondary structure, stabilized by intramolecular base-pairing between regions of the fused crRNA and tracrRNA sequences, is required for binding to the Cas9 endonuclease. (B) The sgRNA guides Cas9 to the target sequence (blue) in the genome. Target recognition requires an adjacent PAM sequence (red) NGG and complementarity between the guide sequence and the target sequence. Cas9 makes a double-stranded break in the target DNA (arrows) which is repaired by homologous recombination or nonhomologous end joining. The repair systems generate deletions and insertions at the target site.
Insertion of a DNA sequence (donor sequence) into a target site requires introduction of the sequences for the sgRNA, the Cas9 endonuclease, and the donor DNA into a recipient cell. The sgRNA and Cas9 coding sequences may be introduced on a vector (Fig. 2.19A), or the sgRNA and Cas9 mRNA may be directly injected along with the donor DNA. When the genes are introduced on a vector, the promoters that drive expression of the sgRNA and endonuclease, and the coding sequence (e.g., codon usage) for the endonuclease are optimized for expression in the chosen host. The donor DNA sequence is flanked by sequences that are homologous to the target genomic site for insertion by homologous recombination (Fig. 2.19B). The vector is introduced into the recipient cell and following expression of the sgRNA and the endonuclease the recipient cell genome is cleaved at the target site. Activation of recombinases that mediate DNA repair results in recombination between homologous sequences on the vector and in the recipient genome, and thereby, insertion of the donor DNA into the genome at the target site (Fig. 2.19B).
Figure 2.19 Vector for production of sgRNA and Cas9 in host cells (A). The gene encoding sgRNA contains a 20-nucleotide sequence (hatched region) that is complementary to the target site in the host genome. Promoters (arrows) for the sgRNA and Cas9 genes, and codon usage for Cas9, must be suitable for expression in host cells. An origin of replication (ori) and a selectable marker (e.g., bla encoding β-lactamase, which confers resistance to the antibiotic ampicillin) are included for initial vector construction in E. coli. The vector and donor DNA are introduced into a recipient cell. Following expression, the sgRNA guides the Cas9 endonuclease to the target sequence in the recipient cell chromosome and the endonuclease makes a double-stranded break in the target DNA. (B) The donor DNA sequence (green) is flanked by regions that are homologous to the target site (grey) for insertion by homologous recombination. Therefore, activation of recombinases that mediate DNA repair results in recombination between homologous sequences on the vector and in the recipient chromosome, and thereby, insertion of the donor DNA into the genome at the target site.
Polymerase Chain Reaction
The polymerase chain reaction (PCR) is a simple, efficient procedure for synthesizing large quantities of a specific DNA sequence in vitro (see Milestone box on page 36). The reaction exploits the mechanism used by living cells to accurately replicate a DNA template. PCR can be used to produce millions of copies from a single template molecule and to detect a specific sequence in a complex mixture of DNA.
milestone Specific Enzymatic Amplification of DNA In Vitro: the Polymerase Chain Reaction
PCR, which is the invention of Kary Mullis (U.S. patent 4,683,202), has had a tremendous impact on many research areas, including molecular biotechnology. The power of the method is in its simplicity, sensitivity, and specificity. It utilizes a mechanism similar to that used by our cells to accurately replicate a DNA template, it can detect and produce millions of copies from a single template molecule in a few hours, and, under appropriate conditions, it can amplify a specific sequence in a complex mixture of DNA molecules even when other similar sequences are present.
PCR was a unique idea that did not replace any existing technology. In the early 1980s, Kary Mullis was trying to solve the problem of using synthetic oligonucleotides to detect single nucleotide mutations in sequences that were present in low concentration. He needed a method to increase the concentration of the target sequence. He reasoned that if he mixed heat denatured DNA with two oligonucleotides that bound to opposite strands of the DNA at an arbitrary distance from each other and added some DNA polymerase and deoxynucleoside triphosphates, the polymerase would add the deoxynucleoside triphosphates to the hybridized oligonucleotides. The reaction did not yield the expected products. Mullis then heated the reaction products to separate the extended oligonucleotides from the template DNA and then repeated the process with fresh polymerase, hypothesizing that after each cycle the number of molecules carrying the specific sequence between the primers would double. Despite the skepticism of his colleague, Mullis proved that his reasoning was correct, albeit the hard way. By manually cycling the reaction through temperatures required to denature the DNA and anneal and extend the oligonucleotides, each time adding a fresh aliquot of a DNA polymerase isolated from E. coli, he was able to synthesize unprecedented amounts of target DNA (Mullis et al., Cold Spring Harbor Symp. Quant. Biol. 51:263–273, 1986). Thermostable DNA polymerases that obviate the need to add fresh polymerase after each denaturation step and automated cycling have since made PCR a routine and indispensable laboratory procedure.
The capability of generating large amounts of DNA by amplification from segments of cloned or genomic DNA has facilitated the cloning of DNA versions of rare mRNA molecules, screening gene libraries, diagnostic testing for gene mutations, sequencing of genomes, and a myriad of other applications. In fact, the first study using PCR described a diagnostic test for sickle-cell anemia (Saiki et al., Science. 230:1350–1354, 1985). Mullis received the Nobel Prize in Chemistry for his work on PCR in 1993.
Amplification of DNA Using PCR
The essential components for PCR amplification are (i) a template sequence in a DNA sample that is targeted for amplification and is from 100 to 3,000 bp in length (larger regions can also be amplified, but with reduced efficiency); (ii) two synthetic oligonucleotide primers (∼20 nucleotides each) that are complementary to regions on opposite strands that flank the target DNA sequence and that, after annealing to the sample DNA, have their 3′ hydroxyl ends oriented toward each other; (iii) a thermostable DNA polymerase that remains active after repeated heating to 95°C or higher and copies the DNA template with high fidelity; (iv) the four deoxyribonucleotides; and (v) a reaction buffer that provides optimal pH and osmotic conditions, and cofactors (e.g., magnesium) required for DNA polymerase activity.
Replication of a specific DNA sequence by PCR requires three successive steps as outlined below. Amplification is achieved by repeating the three-step cycle 25 to 40 times. All steps in a PCR cycle are carried out in an automated block heater that is programmed to change temperatures after a specified period of time.
1 Denaturation. The first step in a PCR is the thermal denaturation of the double-stranded DNA template to separate the strands. This is achieved by raising the temperature of a reaction mixture to 95°C. The reaction mixture is comprised of the sample DNA that contains the target DNA to be amplified, a vast molar excess of the two oligonucleotide primers, a thermostable DNA polymerase (e.g., Taq DNA polymerase, isolated from the bacterium Thermus aquaticus), four deoxyribonucleotides, and the reaction buffer.
2 Annealing. For the second step, the temperature of the mixture is slowly cooled. During this step, the primers base-pair, or anneal, with their complementary sequences in the DNA template. The temperature at which this step of the reaction is performed is determined by the nucleotide sequence of the primer that forms hydrogen bonds with complementary nucleotides in the target DNA. Typical annealing temperatures are in the range of 45 to 68°C, although optimization is often required to achieve the desired outcome, that is, a product consisting of fragments of target DNA sequence only.
3 Extension. In the third step, the temperature is raised to ∼70°C, which is optimal for the catalytic activity of Taq DNA polymerase. DNA synthesis is initiated at the 3′ hydroxyl end of each annealed primer, and nucleotides are added to extend the complementary strand using the sample DNA as a template.
To understand how the PCR protocol succeeds in amplifying a discrete segment of DNA, it is important to keep in mind the location of each primer annealing site and its complementary sequence within the strands that are synthesized during each cycle. During the extension phase of the first cycle, the newly synthesized DNA from each primer is extended beyond the endpoint of the sequence that is complementary to the second primer. These new strands form “long templates” that are used in the second cycle (Fig. 2.20).
Figure 2.20 PCR. During a PCR cycle, the template DNA is denatured by heating and then slowly cooled to enable two primers (P1 and P2) to anneal to complementary (black) bases flanking the target DNA. The temperature is raised to about 70°C, and in the presence of the four deoxyribonucleotides, Taq DNA polymerase catalyzes the synthesis of a DNA strand extending from the 3′ hydroxyl end of each primer. In the first PCR cycle, DNA synthesis continues past the region of the template DNA strand that is complementary to the other primer sequence. The products of this reaction are two long strands of DNA that serve as templates for DNA synthesis during the second PCR cycle. In the second cycle, the primers hybridize to complementary regions in both the original strands and the long template strands, and DNA synthesis produces more long DNA strands from the original strands and short strands from the long template strands. A short template strand has a primer sequence at one end and the sequence complementary to the other primer at its other end. During the third PCR cycle, the primers hybridize to complementary regions of original, long template, and short template strands, and DNA synthesis produces long strands from the original strands and short strands from both long and short templates. By the end of the 30th PCR cycle, the products (amplicons) consist predominantly of short double-stranded DNA molecules that carry the target DNA sequence delineated by the primer sequences. Note that in the figure, newly synthesized strands are differentiated from template strands by a terminal arrow.
During the second cycle, the original sample DNA strands and the new strands synthesized in the first cycle (long templates) are denatured and then hybridized with the primers. The large molar excess of primers in the reaction mixture ensures that they will hybridize to the template DNA before complementary template strands have the chance to reanneal to each other. A second round of synthesis produces long templates from the original strands as well as some DNA strands that have a primer sequence at one end and a sequence complementary to the other primer at the other end (“short templates”) from the long templates (Fig. 2.20).
During the third cycle, short templates, long templates, and original strands all hybridize with the primers and are replicated (Fig. 2.20). In subsequent cycles, the short templates accumulate, and by the 30th cycle, these molecules, which are the desired PCR product, are about a million times more abundant than either the original or long template strands.
The specificity, sensitivity, and simplicity of PCR have rendered it a powerful technique that is central to many applications in molecular biotechnology, as illustrated throughout this book. For example, it is used to obtain large amounts of insert DNA for cloning, to detect specific mutations that cause genetic disease, to confirm biological relatives, to identify individuals suspected of committing a crime, and to diagnose infectious diseases (see chapter 4). Specific viral, bacterial, or fungal pathogens can be detected in samples from infected patients containing complex microbial communities by utilizing PCR primers that anneal to a sequence that is uniquely present in the genome of the pathogen. This technique is often powerful enough to discriminate among very similar strains of the same species of pathogenic microorganisms, which can assist in epidemiological investigations.
Cloning PCR Products
PCR is commonly used to amplify target DNA for cloning into a vector. To facilitate the cloning process, restriction enzyme recognition sites are added to the 5′ end of each of the primers that are complementary to sequences that flank the target sequence in a genome (Fig. 2.21). This is especially useful when suitable restriction sites are not available in the regions flanking the target DNA. Although the end of the primer containing the restriction enzyme recognition site lacks complementarity and therefore does not anneal to the target sequence, it does not interfere with DNA synthesis. Base-pairing between the 20 or so complementary nucleotides at the 3′ end of the primer and the template molecule is sufficiently stable for primer extension by DNA polymerase. At the end of the first cycle of PCR, the noncomplementary regions of the primer remain single stranded in the otherwise double-stranded DNA product. However, after the extension step of the second cycle, the newly synthesized complementary strand extends to the 5′ end of the primer sequence on the template strand and therefore the PCR product contains a double-stranded restriction enzyme recognition site at one end (Fig. 2.21). Subsequent cycles yield DNA products with double-stranded restriction enzyme sites at both ends that can be cleaved to generate sticky ends for insertion into a vector. Alternatively, PCR products can be cloned using the single deoxyadenosine monophosphate (dAMP) that is added to the 3′ ends by Taq DNA polymerase, which lacks the proofreading activity of many DNA polymerases to correct mispaired bases. A variety of linearized vectors have been constructed that possess a single complementary 3′ deoxythymidine monophosphate (dTMP) overhang to facilitate cloning without using restriction enzymes (Fig. 2.22).
Figure 2.21 Addition of restriction enzyme recognition sites to PCR-amplified target DNA to facilitate cloning. Each of the two oligonucleotide primers (P1 and P2) has a sequence of approximately 20 nucleotides in the 3′ end that is complementary to a region flanking the target DNA (shown in black). The sequence at the 5′ end of each primer consists of a restriction endonuclease recognition site (shown in green) that does not base-pair with the template DNA during the annealing steps of the first and second PCR cycles. However, during the second cycle, the long DNA strands produced in the first cycle serve as templates for synthesis of short DNA strands (indicated by a terminal arrow) that include the restriction endonuclease recognition sequences at both ends. DNA synthesis during the third and subsequent PCR cycles produces double-stranded DNA molecules that carry the target DNA sequence flanked by restriction endonuclease recognition sequences. These linear PCR products can be cleaved with the restriction endonucleases to produce sticky ends for ligation with a vector. Note that not all of the DNA produced during each PCR cycle is shown.
Figure 2.22 Cloning of PCR products without using restriction endonucleases. Taq DNA polymerase adds a single dAMP (A) to the ends of PCR-amplified DNA molecules. These extensions can base-pair with complementary single dTMP (T) overhangs on a specially constructed linearized cloning vector. Ligation with T4 DNA ligase results in insertion of the PCR product into the vector.
Quantitative PCR
PCR protocols have been developed to quantify the number of target DNA molecules, or RNA molecules after conversion to cDNA, present in a sample. Quantitative PCR is based on the principle that under optimal conditions, the number of DNA molecules doubles after each cycle. Typically, the amount of DNA present after each PCR cycle is measured in real time as the amount of fluorescence emitted by a fluorescent dye bound to the double-stranded DNA product. Thus, the fluorescence intensity increases in proportion to the concentration of double-stranded DNA (Fig. 2.23).
Figure 2.23 The fluorescent dye SYBR green does not bind to single-stranded DNA (A), binds to double-stranded DNA as it is synthesized (B), and is bound to the double-stranded amplified DNA (C). Only the dye-bound DNA fluoresces.
A real-time PCR occurs in four phases (Fig. 2.24A). In the first, or linear, phase (generally about 10 to 15 cycles), fluorescence emission at each cycle has not yet risen above the background level and therefore cannot be quantified accurately. In the second, or early exponential, phase, a sufficient amount of double-stranded DNA has been produced to increase the amount of fluorescence above a threshold level that is significantly higher than the background. The cycle at which this occurs is known as the threshold cycle (CT). The CT value is inversely correlated with the amount of target DNA in the original sample. During the third, exponential phase, the amount of fluorescence continues to double as the DNA products of the reaction double in each cycle under ideal conditions. However, in the final, plateau phase, the reaction components become limited and measurements of the fluorescence intensity are no longer useful.
Figure 2.24 (A) Plot of normalized fluorescence (ΔRn) versus cycle number in a real-time PCR experiment. Four phases of PCR are shown: (1) a linear phase, where fluorescence emission is not yet above background level; (2) an early exponential phase, where the fluorescence intensity becomes significantly higher than the background (the cycle at which this occurs is generally known as CT); (3) an exponential phase, where the amount of product doubles in each cycle; and (4) a plateau phase, where reaction components are limited and amplification slows down. (B) Plot of CT versus the starting amount of a target nucleotide sequence. Fluorescence detection is linear over several orders of magnitude.
To quantify the amount of target DNA in a test sample, a standard curve is first generated by serially diluting a control sample with a known number of copies of the target DNA, and assuming all dilutions are amplified with equal efficiency, the CT values for each dilution are plotted against the known starting amount of DNA (Fig. 2.24B). The number of copies of a target DNA in a test sample can then be determined by obtaining the CT value for the test sample and extrapolating the starting amount from the standard curve. Since the amount of DNA doubles with each cycle during the exponential phase, a sample that has four times the number of starting copies of the target sequence compared to another sample would require two fewer cycles of amplification to generate the same number of product strands. Often, a melt curve is generated to assess the specificity of the products, which denature at a characteristic temperature that is determined by their nucleotide sequence.
Among its many applications, quantitative real-time PCR has been used to monitor microorganisms that cause a range of infectious diseases. For example, it has been used to quantify Salmonella enterica contamination in food samples. In this case, food samples (chicken and mung beans were tested) were rinsed with water or a saline solution, and the liquid was filtered to collect the bacterial cells. The bacterial cells were removed from the filter membrane, lysed, and subjected to real-time PCR. In this case, the entire procedure took only approximately 3 hours and was able to detect and quantify as few as 700 S. enterica cells per 100 ml of liquid.
Chemical Synthesis of DNA
The ability to chemically synthesize DNA with a specific sequence of nucleotides easily, inexpensively, and rapidly is essential for many of the methodologies of molecular biotechnology. Chemically synthesized, single-stranded DNA oligonucleotides (10−100 nucleotides) are used for amplifying specific DNA sequences by PCR, introducing mutations into cloned genes, sequencing DNA, and synthesizing whole genes and chromosomes (see box on next page).
box Synthetic Genomes
Chemically synthesized oligonucleotides have been assembled not only into genes but also into whole genomes. The first genome to be produced synthetically was the cDNA encoding the small (7,500 bp) single-stranded RNA genome of poliovirus (Cello et al., Science. 297:1016−18, 2002). The poliovirus genome sequence was known and facilitated the design of 70 nucleotide-long, single-stranded oligonucleotides. Overlapping complementarity at the termini of neighboring oligonucleotides enabled their assembly into 400 to 600 bp fragments. The fragments were ligated into three larger segments that were subsequently digested with a restriction endonuclease and cloned in the correct order and orientation into a plasmid. Expression of the full-length cDNA from a suitable promoter in HeLa cells resulted in production of viral RNA and proteins that were assembled into infectious poliovirus particles.
Construction of a much larger synthetic bacterial genome presented a greater challenge (Gibson et al., Science. 319:1215–20, 2008). The genome of the bacterium Mycoplasma genitalium was chosen because it has the smallest genome (a single chromosome of 580,076 bp) currently known for a free-living bacterium. The chromosome was initially produced in 101 segments of about 5 to 7 kb that were each assembled from synthetic oligonucleotides. The sequence of each segment overlapped its neighbor by approximately 80 bp and therefore, following brief treatment with an exonuclease to generate sticky ends, four neighboring segments were joined in vitro by complementary base-pairing at their termini and enzymatic repair of gaps. These 24-kb fragments were cloned into a bacterial artificial chromosome, which can carry large DNA inserts, and propagated in E. coli. This in vitro recombination method was repeated to assemble the fragments into successively larger pieces. Large segments carrying half- and full-genome sequences could not be cloned in E. coli and therefore quarter- and then half-genome fragments were finally assembled into a 582,970 bp sequence in the yeast Saccharomyces cerevisiae by in vivo recombination between overlapping homologous sequences.
These results demonstrated that a small bacterial genome can be constructed entirely from synthetic oligonucleotides. The next step was to show that a genome produced “from scratch” can direct the survival and growth of a living bacterium. The 1,077,947 bp genome of Mycoplasma mycoides was synthesized in a manner similar to that used to construct the M. genitalium genome (Gibson et al., Science 329:52–56, 2010); M. mycoides was chosen because it grows at a faster rate than the extremely slow growing M. genitalium. Briefly, overlapping synthetic oligonucleotides were assembled into 1,080 bp fragments. These fragments were recombined into 10-kb and then into 100-kb segments, and finally into a full-length genome by in vivo homologous recombination in yeast. The intact M. mycoides genome was extracted from yeast and transplanted into a related species Mycoplasma capricolum, replacing the recipient cell’s chromosome. Remarkably, the synthetic genome was self-replicating and controlled the functions of a living bacterium. The bacterium was able to grow logarithmically and exhibited cellular and colony morphologies similar to M. mycoides. “Watermark” sequences were inserted in four regions to enable differentiation of the synthetic and natural genomes.
A major motivation for producing a synthetic bacterium is to understand the minimal genetic requirements for life, that is, to identify the minimum set of essential genes that can support survival and reproduction of a cell. Because it takes less energy and fewer resources to maintain and propagate a small genome, more resources can be directed to the synthesis of high yields of useful products from cloned genes.
Synthesis of Oligonucleotides
Currently, the phosphoramidite method is the procedure of choice for chemical DNA synthesis. Solid-phase synthesis, in which the growing DNA strand is attached to a solid support, is used so that all the reactions can be conducted in one reaction vessel, the reagents from one reaction step can be readily washed away before the reagents for the next step are added, and the reagents can be used in excess in an attempt to drive the reactions to completion.
The chemical synthesis of DNA is a multistep process (Fig. 2.25). It does not follow the biological direction of DNA synthesis; rather, during the chemical process, each incoming nucleotide is coupled to the 5′ hydroxyl terminus of the growing chain. Before their introduction into the reaction column, the amino groups of the nucleotides’ nitrogenous bases adenine, guanine, and cytosine are derivatized by the addition of benzoyl, isobutyryl, and benzoyl groups, respectively, to prevent undesirable side reactions during polymerization. Thymine is not treated because it lacks an amino group. The initial nucleoside (base and sugar only), which will be the 3′-terminal nucleotide of the final synthesized strand, is attached to a spacer molecule by its 3′ hydroxyl terminus and the spacer molecule is covalently attached to an inert support, which is often a controlled pore glass (CPG) bead (a glass bead with uniformly sized pores) (Fig. 2.26). A dimethoxytrityl (DMT) group is attached to the 5′ end of the first nucleoside to prevent the 5′ hydroxyl group from reacting nonspecifically before the addition of the second nucleotide. Each nucleotide that is added to the growing chain has a 5′ DMT protective group and also a diisopropylamine group attached to a 3′ phosphite group that is protected by a β-cyanoethyl (CH2CH2CN) group (Fig. 2.27). This molecular assembly is called a phosphoramidite.
Figure 2.25 Flowchart for the chemical synthesis of DNA oligonucleotides. After n coupling reactions (cycles), a single-stranded piece of DNA with n + 1 nucleotides is produced.
Figure 2.26 Starting complex for the chemical synthesis of a DNA strand. The initial nucleoside has a DMT group attached to the 5′ hydroxyl group of the deoxyribose moiety and a spacer molecule attached to the hydroxyl group of the 3′ carbon of the deoxyribose. The spacer unit is attached to a solid support, which is usually a CPG bead.
Figure 2.27 Structure of a phosphoramidite. Phosphoramidites are available for each of the four bases (A, C, G, and T) that are used for the chemical synthesis of a DNA strand. A diisopropylamine group is attached to the 3′ phosphite group of the nucleoside. A β-cyanoethyl (CH2CH2CN) group protects the 3′ phosphite group, and a DMT group is bound to the 5′ hydroxyl group of the deoxyribose sugar.
After the first nucleoside is bound to the CPG beads, the cycle begins. First, the reaction column is washed extensively with an anhydrous reagent (e.g., acetonitrile) to remove water and any nucleophiles that may be present. The column is flushed with argon to purge the acetonitrile. Next, the 5′ DMT group is removed from the attached nucleoside by treatment with trichloroacetic acid (TCA) to yield a reactive 5′ hydroxyl group (Fig. 2.28). After this detritylation step, the reaction column is washed with acetonitrile to remove the TCA and then with argon to remove the acetonitrile. The machine is programmed to introduce the next prescribed base (phosphoramidite) and tetrazole simultaneously for the activation and coupling steps. The tetrazole activates the phosphoramidite so that its 3′ phosphite forms a covalent bond with the 5′ hydroxyl group of the initial nucleoside (Fig. 2.29). Unincorporated phosphoramidite and tetrazole are removed by flushing the column with argon.
Figure 2.28 Detritylation. The 5′ DMT group is removed by treatment with TCA. In this example, the detritylation of the first nucleoside is depicted.
Figure 2.29 Activation and coupling. The activation of a phosphoramidite enables its 3′ phosphite group to attach to the 5′ hydroxyl group of the bound detritylated nucleoside.
Not all of the support-bound nucleosides are linked to a phosphoramidite during the first coupling reaction, and therefore, the unlinked residues must be prevented from linking to the next nucleotide during the following cycle. To do this, acetic anhydride and dimethylaminopyridine are added to acetylate the unreacted 5′ hydroxyl groups (Fig. 2.30). If this capping step is not carried out, then, after a number of cycles, the growing chains will differ in both length and nucleotide sequence.
Figure 2.30 Capping. The available 5′ hydroxyl groups of unreacted detritylated nucleosides are acetylated to prevent them from participating in the coupling reaction of the next cycle.
At this stage of the process, the linkage between the nucleotides is in the form of a phosphite triester bond, which is unstable and prone to breakage in the presence of either acid or base. Therefore, the phosphite triester is oxidized with an iodine mixture to form the more stable pentavalent phosphate triester (Fig. 2.31). After this oxidation step and a subsequent wash of the reaction column, the cycle of detritylation, phosphoramidite activation, coupling, capping, and oxidation is repeated (Fig. 2.25). This cycling continues with each successive phosphoramidite until the last programmed residue has been added to the growing chain. When the final cycle is completed, the newly synthesized DNA strands are bound to the CPG beads; each phosphate triester contains a β-cyanoethyl group; every guanine, cytosine, and adenine carries its amino-protecting group; and the 5′ terminus of the last nucleotide has a DMT group.
Figure 2.31 Oxidation. The phosphite triester internucleotide linkage is oxidized to the pentavalent phosphate triester. This reaction stabilizes the phosphodiester bond and makes it less susceptible to cleavage under either acidic or basic conditions.
The β-cyanoethyl groups are removed by a chemical treatment in the reaction column. The DNA strands are then cleaved from the spacer molecule leaving a 3′ hydroxyl terminus. The DNA is eluted from the reaction column, and, in succession, the benzoyl and isobutyryl groups are stripped away and the DNA is detritylated. The 5′ terminus of the DNA strand is phosphorylated either enzymatically with T4 polynucleotide kinase or by a chemical procedure. Phosphorylation can also be carried out after detritylation while the oligonucleotide is still bound to the support.
To achieve a reasonable overall yield of an oligonucleotide, the coupling efficiency should be greater than 98% at each step. The coupling efficiency of each cycle is determined by spectrophotometrically monitoring released trityl groups. If, for example, the efficiency is 99% at each cycle during the production of a 20-unit oligonucleotide (20-mer), which entails 19 coupling reactions since the first base is bound to the spacer and is not involved in a coupling step, then 83% (i.e., 0.9919 × 100) of the product will be 20 nucleotides long. If a 60-mer is synthesized with 99% efficiency at each cycle, then about 55% of the final product will contain all of the 60 nucleotides. With an average coupling efficiency consistently less than 98%, the yield of full-length oligonucleotides diminishes as a function of the required number of cycles (Table 2.3). The coupling efficiency for most commercial DNA synthesizers averages about 99.5% for each step. However, depending on the length and stringency of the end use of an oligonucleotide, it may be necessary to purify the final product using either reverse-phase high-pressure liquid chromatography or gel electrophoresis. These methods separate the longer target oligonucleotides from the shorter “failure” sequences.
Table 2.3 Overall yields of chemically synthesized oligonucleotides with different coupling efficiencies
Assembling Oligonucleotides into Genes
Oligonucleotides are the key components for assembling genes. There are a number of applications for synthetic genes including large-scale production of proteins, testing protein function after changing specific codons, and creating nucleotide sequences that encode proteins with novel properties. The production of short fragments of double-stranded DNA (less than 100 bp) is technically straightforward and can be accomplished by synthesizing two complementary oligonucleotides and then annealing them. For the production of longer DNA molecules such as entire genes, special strategies must be devised because the coupling efficiency of each cycle during chemical DNA synthesis is never 100%. For example, if a gene contains 1,000 bp and the average coupling efficiency is 99.5%, then the proportion of full-length single DNA strands after the last cycle is a minuscule 0.007%. To overcome this problem, synthetic genes are produced from oligonucleotides that are enzymatically assembled into larger double-stranded DNA molecules.
One method for building a synthetic gene utilizes a set of overlapping oligonucleotides that are about 60 nucleotides in length with approximately 20-base overlaps (Fig. 2.32). After complementary 3′ and 5′ extensions are annealed, large gaps remain, but the base-paired regions are both long enough and stable enough to hold the structure together. After all the oligonucleotides are combined, the gaps are filled by enzymatic DNA synthesis with DNA polymerase I (usually from E. coli). This enzyme uses the 3′ hydroxyl groups as replication initiation points and the single-stranded regions as templates. After the enzymatic synthesis is completed, the nicks are sealed with T4 DNA ligase. For larger genes (≥1,000 bp), smaller sections of the gene are first assembled into units of about 500 bases in length and then these are combined with other 500-base units. In turn, these larger kilobase segments are joined together until the entire sequence is completed. Computer programs are available both commercially and freely on the Internet which make it easier to determine the best set of oligonucleotides and overlaps for gene construction as well as allowing the user to select a particular codon usage, change any codon, and designate restriction endonuclease sites at specific locations. Finally, it is absolutely essential that a chemically synthesized gene have the correct sequence of nucleotides. Consequently, small synthetic genes are sequenced directly and, for larger genes, the sequences of each of the 500-base building blocks are determined before assembly.
Figure 2.32 Assembly and in vitro enzymatic DNA synthesis of a gene. Individual oligonucleotides are synthesized chemically and then hybridized. The sequences of the oligonucleotides are designed to enable them to form a stable molecule with base-paired regions separated by single-stranded regions (gaps). The gaps are filled in by in vitro enzymatic DNA synthesis. The nicks are sealed with T4 DNA ligase.
Gene Synthesis by PCR
The assembly of a gene by PCR is faster and more economical than filling in overlapping oligonucleotides using DNA polymerase and then sealing the nicks with T4 DNA ligase. One PCR-based protocol for gene construction starts with two overlapping oligonucleotides (A and B), usually about 50 nucleotides long, that represent sequences from the center of the gene (Fig. 2.33). After annealing, these oligonucleotides have recessed 3′ hydroxyl groups that provide a starting point for DNA synthesis during the elongation phase of a PCR cycle. The product is a double-stranded DNA molecule. The PCR cycle (denaturation, oligonucleotide annealing, and extension) is repeated 20 times to maximize the amount of product that is formed. Next, two additional oligonucleotides (C and D) are added to the mixture. Oligonucleotide C overlaps at its 3′ end with the 5′ end of oligonucleotide A and represents the nucleotide sequence of the gene immediately upstream of the oligonucleotide A sequence. Oligonucleotide D overlaps at its 3′ end with the 5′ end of oligonucleotide B and represents the nucleotide sequence of the gene immediately downstream of the oligonucleotide B sequence. After 20 PCR cycles, a double-stranded DNA with a specific sequence order (CABD) is produced.
Figure 2.33 Gene synthesis by PCR. Overlapping oligonucleotides (A and B) are filled in from the recessed 3′ hydroxyl ends during DNA synthesis. Oligonucleotides (C and D) that are complementary to the ends of the product of the first PCR cycle are added to a sample, overlapping molecules are formed after denaturation and renaturation, and the recessed ends are filled in during DNA synthesis. Next, oligonucleotides (E and F) that overlap the ends of the second-cycle PCR product are added to a sample, and a third PCR cycle is initiated. The final PCR product is a double-stranded DNA molecule with a specified sequence of nucleotides. The pairs of letters with or without a prime (e.g., A′ and A) represent complementary oligonucleotides. Each oligonucleotide corresponds to a sequence from a particular DNA strand.
Thereafter, pairs of oligonucleotides are added, one of the pair overlapping the upstream sequence of the DNA molecule formed in the previous round and the other overlapping the downstream sequence, and subjected to 20 PCR cycles for each pair added until the entire gene is formed. Synthesis of a gene with 1,000 bp can be carried out in one day. As with other methods for assembling genes, the last pair of oligonucleotides (i.e., the 5′ and 3′ ends of the gene) can be made with supplementary sequences outside the coding region that facilitate the cloning of the gene into a vector and, at the 5′ end, with sequences that enable the gene to be expressed in a host cell.
DNA Sequencing Technologies
Determination of the nucleotide composition and order in a gene or genome is a foundational technique in molecular biotechnology. Cloned or PCR-amplified genes and entire genomes are routinely sequenced. DNA sequences can often reveal something about the function of the protein encoded in a gene, for example, from predicted cofactor binding sites, transmembrane domains, receptor recognition sites, or DNA-binding regions. The nucleotide sequences in noncoding regions that do not encode a protein or RNA molecule may provide information about the regulation of a gene. Comparison of gene sequences among individuals can reveal mutations that contribute to phenotypic differences. For example, identification of nucleotide differences (polymorphisms) in a gene in individuals with a particular disease, but not in healthy individuals, may be used to predict disease susceptibility. Comparison of gene sequences among different organisms can lead to the development of hypotheses about the evolutionary relationships among organisms.
For more than three decades, the dideoxynucleotide procedure developed by the English biochemist Frederick Sanger (see Milestone box on page 53) has been used for DNA sequencing. This includes sequencing of DNA fragments containing one to a few genes and also the entire genomes from many different organisms, including the human genome. However, the interest in sequencing large numbers of DNA molecules in less time and at a lower cost has driven the recent development of new sequencing technologies that can process thousands to millions of sequences concurrently. Many different sequencing technologies have been developed. In general, all of these methods involve (i) enzymatic addition of nucleotides to a primer based on complementarity to a template DNA fragment and (ii) detection and identification of the nucleotide(s) added. Most employ DNA polymerase to catalyze the addition of single nucleotides (sequencing by synthesis), although ligase may also be used to add a short, complementary oligonucleotide (sequencing by ligation). The techniques differ in the method by which the addition is detected.
milestone DNA Sequencing with Chain-Terminating Inhibitors
New techniques are the lifeblood of science. They enable researchers to acquire information that was previously inaccessible and that, in turn, generates insights that stimulate new research and lead to new discoveries. For molecular biotechnology, DNA sequencing is a powerful procedure that has become a laboratory mainstay. The most definitive form of molecular characterization of a gene or genome is its sequence. Among other things, the coding content of a gene, potential primer sequences for a PCR, and the presence of mutations can be determined by DNA sequencing.
Sequencing by enzymatic DNA synthesis with chain elongation inhibitors is a relatively simple, accurate, and reliable method developed by Sanger et al. (Proc. Natl. Acad. USA. 74:5463–5467, 1977). At the time the Sanger (dideoxy) method was published, most DNA sequencing was carried out by the base-specific chemical cleavage method devised by A. M. Maxam and W. Gilbert (Proc Natl Acad Sci USA. 74:560–564, 1977). Before the development of these techniques, nucleic acid sequencing was more or less limited to RNA molecules. The sequencing of a DNA molecule required transcribing a DNA fragment into RNA with RNA polymerase and then sequencing the RNA product. In general, RNA sequencing entailed treating a radiolabeled RNA molecule with different ribonucleases, chromatographically separating the digestion products, redigesting the separated products, hydrolyzing the products of the second digestion with alkali, chromatographically separating the hydrolysis products, determining the sequence of the oligonucleotides, and constructing the sequence based on overlapping stretches of nucleotides. This approach was time-consuming and tedious. With the advent of the dideoxy method, it became obsolete. The Sanger method superseded the Maxam and Gilbert sequencing procedure when the M13 bacteriophage cloning system was developed, which provided single-stranded DNA templates required for sequencing. The M13 system was no longer required following introduction of PCR-based cycle sequencing, which generates single-stranded DNA templates during a DNA denaturation step. Sanger and Gilbert received the Nobel Prize in Chemistry in 1980 for their work.
Dideoxynucleotide Procedure
The dideoxynucleotide procedure for DNA sequencing is based on the principle that during DNA synthesis, addition of a nucleotide triphosphate requires a free hydroxyl group on the 3′ carbon of the sugar of the last nucleotide of the growing DNA strand (Fig. 2.34A). However, if a synthetic dideoxynucleotide that lacks a hydroxyl group at the 3′ carbon of the sugar moiety is incorporated at the end of the growing chain, DNA synthesis stops because a phosphodiester bond cannot be formed with the next incoming nucleotide (Fig. 2.34B). The termination of DNA synthesis is the defining feature of the dideoxynucleotide DNA sequencing method.
Figure 2.34 Incorporation of a dideoxynucleotide terminates DNA synthesis. (A) Addition of an incoming deoxyribonucleoside triphosphate (dNTP) requires a hydroxyl group on the 3′ carbon of the last nucleotide of a growing DNA strand. (B) DNA synthesis stops if a synthetic dideoxyribonucleotide that lacks a 3′ hydroxyl group is incorporated at the end of the growing chain because a phosphodiester bond cannot be formed with the next incoming nucleotide.
In a dideoxynucleotide DNA sequencing procedure, a synthetic oligonucleotide primer (∼17 to 24 nucleotides) anneals to a predetermined site on the strand of the DNA to be sequenced (Fig. 2.35A). The oligonucleotide primer defines the beginning of the region to be sequenced and provides a 3′ hydroxyl group for the initiation of DNA synthesis. The reaction tube contains a mixture of the four deoxyribonucleotides (deoxyadenosine triphosphate [dATP], deoxycytidine triphosphate [dCTP], deoxyguanosine triphosphate [dGTP], and deoxythymidine triphosphate [dTTP]) and four dideoxynucleotides (dideoxyadenosine triphosphate [ddATP], ddCTP, ddGTP, and ddTTP). Each dideoxynucleotide is labeled with a different fluorescent dye. The concentration of the dideoxynucleotides is optimized to ensure that during DNA synthesis a modified DNA polymerase incorporates a dideoxynucleotide into the mixture of growing DNA strands at every possible position. Thus, the products of the reaction are DNA molecules of all possible lengths, each of which includes the primer sequence at its 5′ end and a fluorescently labeled dideoxynucleotide at the 3′ terminus (Fig. 2.35B).
Figure 2.35 Dideoxynucleotide method for DNA sequencing. An oligonucleotide primer binds to a complementary sequence adjacent to the region to be sequenced in a single-stranded DNA template (A). As DNA synthesis proceeds from the primer, dideoxynucleotides are randomly added to the growing DNA strands, thereby terminating strand extension. This results in DNA molecules of all possible lengths that have a fluorescently labeled dideoxynucleotide at the 3′ end (B). DNA molecules of different sizes are separated by capillary electrophoresis, and as each molecule passes by a laser, a fluorescent signal that corresponds with one of the four dideoxynucleotides is recorded. The successive fluorescent signals are represented as a sequencing chromatogram (colored peaks) (C).
PCR-based cycle sequencing is performed to minimize the amount of template DNA required for sequencing. Multiple cycles of denaturation, primer annealing, and primer extension produce large amounts of dideoxynucleotide-terminated fragments. These are applied to a polymer in a long capillary tube that enables separation of DNA fragments that differ in size by a single nucleotide. As each successive fluorescently labeled fragment moves through the polymeric matrix in an electric field and passes by a laser, the fluorescent dye is excited. Each of the four different fluorescent dyes emits a characteristic wavelength of light that represents a particular nucleotide, and the order of the fluorescent signals corresponds to the sequence of nucleotides (Fig. 2.35C). Generally, automated systems that employ this sequencing technology can determine with high accuracy about 500 to 600 bases per run (the read length, or read).
Pyrosequencing
Pyrosequencing was the first of the next-generation sequencing technologies to be made commercially available. The basis of the technique is the detection of pyrophosphate that is released during DNA synthesis. When a DNA strand is extended by DNA polymerase, the α-phosphate attached to the 5′ carbon of the sugar of an incoming deoxynucleoside triphosphate forms a phosphodiester bond with the 3′ hydroxyl group of the last nucleotide of the growing strand. The terminal β- and γ-phosphates of the added nucleotide are cleaved off as a unit known as pyrophosphate (Fig. 2.36A). The release of pyrophosphate correlates with the incorporation of a specific nucleotide in the growing DNA strand.
Figure 2.36 Pyrosequencing is based on the detection of pyrophosphate that is released during DNA synthesis. (A) A phosphodiester bond forms between the 3′ hydroxyl group of the deoxyribose sugar of the last incorporated nucleotide and the α-phosphate of the incoming nucleotide (blue arrow). The bond between the α- and β-phosphates is cleaved (green arrow), and pyrophosphate is released (black arrow). (B) An adaptor sequence is added to the 3′ end of the DNA sequencing template that provides a binding site for a sequencing primer. One nucleotide (deoxyribonucleoside triphosphate [dNTP]) is added at a time. If the dNTP is added by DNA polymerase to the end of the growing DNA strand, pyrophosphate (PPi) is released and detected indirectly by the synthesis of ATP. ATP is required for light generation by luciferase. The DNA sequence is determined by correlating light emission with incorporation of a particular dNTP.
To determine the sequence of a DNA fragment by pyrosequencing, a short DNA adaptor that serves as a binding site for a sequencing primer is first added to the end of the DNA template (Fig. 2.36B). Following annealing of the sequencing primer to the complementary adaptor sequence, one deoxynucleotide is introduced at a time in the presence of DNA polymerase. Pyrophosphate is released only when the complementary nucleotide is incorporated at the end of the growing strand. Nucleotides that are not complementary to the template strand are not incorporated, and no pyrophosphate is formed.
The pyrophosphate released following incorporation of a nucleotide is detected indirectly after enzymatic synthesis of ATP (Fig. 2.36B). Pyrophosphate combines with adenosine-5′-phosphosulfate in the presence of the enzyme ATP sulfurylase to form ATP. In turn, ATP drives the conversion of luciferin to oxyluciferin by the enzyme luciferase, a reaction that generates light. Detection of light after each cycle of nucleotide addition and enzymatic reactions indicates the incorporation of a complementary nucleotide. The amount of light generated after the addition of a particular nucleotide is proportional to the number of nucleotides that are incorporated in the growing strand, and therefore sequences containing tracts of up to eight identical nucleotides in a row can be determined. Because the natural nucleotide dATP can participate in the luciferase reaction, dATP is replaced with deoxyadenosine α-thiotriphosphate, which can be incorporated into the growing DNA strand by DNA polymerase but is not a substrate for luciferase. Repeated cycles of nucleotide addition, pyrophosphate release, and light detection enable determination of sequences of 300 to 500 nucleotides per run.
Sequencing Using Reversible Chain Terminators
For pyrosequencing, each of the four nucleotides must be added sequentially in separate cycles. The sequence of a DNA fragment could be determined more rapidly if all the nucleotides were added together in each cycle. However, the reaction must be controlled to ensure that only a single nucleotide is incorporated during each cycle, and it must be possible to distinguish each of the four nucleotides. Synthetic nucleotides known as reversible chain terminators have been designed to meet these criteria and form the basis of some of the next-generation sequencing-by-synthesis technologies.
Reversible chain terminators are deoxynucleoside triphosphates with two important modifications: (i) a chemical blocking group is added to the 3′ carbon of the sugar moiety to prevent addition of more than one nucleotide during each round of sequencing and (ii) a different fluorescent dye is added to each of the four nucleotides to enable identification of the incorporated nucleotide (Fig. 2.37A). The fluorophore is added at a position that does not interfere with either base-pairing or phosphodiester bond formation. Similar to the case with other sequencing-by-synthesis methods, DNA polymerase is employed to catalyze the addition of the modified nucleotides to an oligonucleotide primer as specified by the DNA template sequence (Fig. 2.37B). After recording fluorescent emissions, the fluorescent dye and the 3′ blocking group are removed. The blocking group is removed in a manner that restores the 3′ hydroxyl group of the sugar to enable subsequent addition of another nucleotide in the next cycle. Cycles of nucleotide addition to the growing DNA strand by DNA polymerase, acquisition of fluorescence data, and chemical cleavage of the blocking and dye groups are repeated to generate short read lengths (i.e., 50 to 100 nucleotides per run).
Figure 2.37 Sequencing using reversible chain terminators. (A) Reversible chain terminators are modified nucleotides that have a removable blocking group on the oxygen of the 3′ position of the deoxyribose sugar to prevent addition of more than one nucleotide per sequencing cycle. To enable identification, a different fluorescent dye is attached to each of the four nucleotides via a cleavable linker. Shown is the fluorescent dye attached to adenine. (B) An adaptor sequence is added to the 3′ end of the DNA sequencing template that provides a binding site for a sequencing primer. All four modified nucleotides are added in a single cycle, and a modified DNA polymerase extends the growing DNA chain by one nucleotide per cycle. Fluorescence is detected, and then the dye and the 3′ blocking group are cleaved before the next cycle. Removal of the blocking group restores the 3′ hydroxyl group for addition of the next nucleotide.
Sequencing by Single Molecule Synthesis
To generate sufficiently high levels of a fluorescent or light signal for detection of nucleotide addition, the sequencing methods described above require large amounts of template DNA. A DNA amplification step is often required, which increases template preparation time and can introduce mutations that are interpreted as nucleotide variations. Recently, sequencing technologies have been developed to circumvent the amplification step. In one approach, a single molecule of DNA polymerase is immobilized on a solid support (on the bottom of a nanoscale well) and captures a single DNA molecule that is bound to a primer (Fig. 2.38A). During the sequence acquisition stage, DNA polymerase extends the primer in a template dependent fashion and a signal corresponding to nucleotide addition is measured in a narrow volume at the bottom of the well (Fig. 2.38B).
Figure 2.38 Real-time single-molecule sequencing. One molecule of DNA polymerase (orange shape) is attached to the bottom of a nanoscale well. A single-stranded DNA molecule (grey strand) bound to a primer (blue strand) is captured in the active site of the polymerase (A). Each of the four different nucleoside triphosphates is attached to a different fluorophore (colored stars) at the terminal phosphate, which is released during template-dependent nucleotide incorporation into the growing DNA strand. Fluorescence emission from a zeptoliter (10–21 l) volume at the bottom of the well is detected by a laser before the cleaved pyrophosphate with attached fluorophore diffuses away (B).
The nucleotide added during the extension phase is detected in real time, as it is incorporated. For real-time sequencing, the nucleotides do not carry a blocking group on the 3′ hydroxyl group and therefore DNA synthesis is continuous. A different fluorescent tag is attached to the terminal phosphate of each nucleoside triphosphate, in a manner that does not interfere with the activity of the DNA polymerase. With each nucleotide addition to the growing DNA chain, pyrophosphate is cleaved and with it the fluorescent tag. Tag cleavage therefore corresponds to nucleotide addition. The laser used to measure fluorescence is narrowly focused on the immobilized DNA polymerase and therefore records a pulse of fluorescence only in the brief time (tens of milliseconds) when the tagged nucleotide is held in the enzyme’s active site (Fig. 2.38B). Following formation of a phosphodiester bond, the fluorescent tag cleaved from the nucleotide rapidly diffuses out of the range of the detector. Translocation of the DNA template positions DNA polymerase to accept the next nucleotide into the active site. Long sequence reads (greater than 10 kbp on average) can be generated rapidly by this method; however, accuracy is generally lower than other methods due to the short time interval between nucleotide additions, dissociation of a nucleotide before a phosphodiester bond forms, and simultaneous measurement of fluorescence from more than one nucleotide.
Sequencing Whole Genomes
Just as the sequence of a gene can provide information about the function of the encoded protein, the sequence of an entire genome can contribute to our understanding of the nature of an organism. Thousands of whole genomes have now been sequenced, from organisms of all domains of life. Initially, the sequenced genomes were relatively small, limited by the early sequencing technologies. The first DNA genome to be sequenced was from the E. coli bacteriophage ΦX174 (5,375 bp) in 1977, while the first sequenced genome from a cellular organism was that of the bacterium Haemophilus influenzae (1.8 Mbp) in 1995. Within 2 years, the sequence of the larger E. coli genome (4.6 Mbp) was reported, and the sequence of the human genome (3,000 Mbp), the first vertebrate genome, was completed in 2003.
Most of these first genome sequences were generated using a shotgun cloning approach. In this strategy, a clone library of randomly generated, overlapping genomic DNA fragments is constructed in a bacterial host. The plasmids are isolated, and then the cloned inserts are sequenced using the dideoxynucleotide method. Using this approach, the first human genome was sequenced in 13 years at a cost of $2.7 billion. The aspiration to acquire genome sequences faster and at a much lower cost has driven the development of new genome sequencing strategies. Today, many large-scale sequencing projects have been completed and many more are under way, motivated by compelling biological questions. Some will contribute to our understanding of the microorganisms that cause infectious diseases and to the development of new techniques for their detection and treatment. Others are aimed at helping us to understand what it means to be human and how we evolved. Understanding the nucleotide polymorphisms among individuals with and without a specific disease will help us to determine the genetic basis of disease.
Generally, DNA sequencing projects fall into two categories: de novo genome sequencing and resequencing. Sequencing the genome of an organism that has not previously been sequenced is de novo genome sequencing, whereas resequencing involves comparing a newly determined sequence with a known reference sequence. A large-scale sequencing project typically entails (i) preparing a library of template DNA fragments, (ii) amplifying the DNA fragments which will increase the detection signal from nucleotide addition during the sequencing reaction, (iii) sequencing the template DNA using one of the sequencing techniques describe above, and (iv) assembling the sequences generated from the fragments in the order in which they are found in the original genome. Sequencing massive amounts of DNA required not only the development of new technologies for nucleotide sequence determination but also new methods to reduce the time for preparation and processing of large libraries of sequencing templates. High-throughput next-generation sequencing approaches have circumvented the cloning steps of the shotgun sequencing strategy by attaching, amplifying, and sequencing the genomic DNA fragments directly on a solid support. Single molecule sequencing eliminates the need for an amplification step, further reducing the time for template preparation. In both cases, all of the templates are sequenced at the same time. The term used to describe this is massive parallelization.
Preparation of Genomic DNA Sequencing Libraries
Although the shotgun cloning strategy has been used successfully to obtain the sequences of many whole genomes, preparation of clone libraries in bacterial cells is costly and time-consuming for routine sequencing of the large amounts of genomic DNA that are required for many research and clinical applications. To reduce the time and cost of large-scale sequencing, high-throughput next-generation sequencing strategies have been developed that use cell-free methods to generate a library of genomic DNA fragments. First, purified, genomic DNA is fragmented either mechanically by sonication (applying high-frequency sound energy) or nebulization (forcing DNA through a small hole using compressed air), or by enzymatic digestion. Physical fragmentation tends to leave extended single-stranded ends that must be blunted (end repaired or polished) by filling in 3′ recessed ends with DNA polymerase in the presence of the four deoxyribonucleotides (Fig. 2.39A) and removing protruding 3′ ends with an exonuclease (Fig. 2.39B). Next, different oligonucleotide adaptors are ligated to each end of the polished genomic fragments. To facilitate ligation, the 5′ ends of the DNA fragments are phosphorylated with T4 polynucleotide kinase (Fig. 2.39C). The 3′ ends may be adenylated (A-tailed) by enzymatic addition of a single deoxyadenosine monophosphate to facilitate ligation of adaptors that have a single complementary 3′ deoxythymidine monophosphate overhang (Fig. 2.39D). The adaptors have sequences that anneal to PCR primers for amplification of the genomic sequence and to sequencing primers that prime the sequencing reaction (Fig. 2.40). In addition, adaptors may contain an index (barcode) sequence for tagging a genomic library. A barcode is a short (usually 8−12 nucleotides), unique nucleotide sequence that is used to identify and sort sequence reads generated from a genomic library when multiple libraries are pooled prior to sequencing (multiplexing). Generally, genomic DNA fragments are size selected by removing fragments above and below a certain size (typically 200−500 bp) to facilitate assembly of the genome sequence (described below). Following size selection, the libraries may be amplified by PCR to enrich for genomic DNA fragments that have adaptors ligated to both ends and to increase the amount of template for sequencing.
Figure 2.39 Preparation of genomic DNA fragments for ligation of adaptors. Ends of frayed DNA are repaired (end repaired) using DNA polymerase to fill in from recessed 3′ ends (A) and a 3′ exonuclease to degrade 3′ extensions (B). Note: Fragments with different combinations of extensions and recessed ends are not shown here. In all of these cases, the polymerase and/or 3′ exonuclease activities produce blunt-end DNA molecules. T4 polynucleotide kinase phosphorylates the 5′ ends of the blunt-end fragments (C). A single deoxyadenosine monophosphate may be added to the 3′ ends of blunted fragments (A-tailed) by DNA polymerase to facilitate ligation of adaptors that have a single complementary 3′ deoxythymidine monophosphate overhang (D).
Figure 2.40 Features of an adaptor used for preparation and sequencing of genomic DNA fragments. A single 3′ deoxythymidine monophosphate overhang facilitates ligation of the adaptor to the ends of A-tailed genomic DNA fragments (Fig. 2.39D). The adaptor has a sequence that anneals to an oligonucleotide primer which captures and amplifies the genomic sequence on a solid support, a sequence that anneals to a sequencing primer that primes the sequencing reaction, and a unique barcode sequence that is used to tag a genomic library when multiple libraries are combined for sequencing (multiplexing).
High-Throughput Next-Generation Sequencing
Most of the current commercially available, high-throughput next-generation sequencing strategies use PCR to generate clusters containing millions of copies of each DNA sequencing template. In one strategy, single-stranded sequencing templates (denatured library fragments) are captured by hybridization via the adaptor sequence that is complementary to oligonucleotides covalently bound to a solid surface, such as a glass slide (Fig. 2.41A). The oligonucleotides also act as primers for DNA polymerase to synthesize the strand complementary to the captured template strand. The resulting double-stranded DNA is denatured and the original template is washed away. The remaining strand is anchored to the solid support via the bound oligonucleotide primer at one end and at the other end carries the adaptor sequence that is complementary to an oligonucleotide primer that is adjacent on the glass slide, to which it hybridizes (Fig. 2.41B). Following extension of the second primer (bridge PCR) and denaturation of the double-stranded product, two single-stranded DNA molecules are anchored to the solid support (Fig. 2.41C). The process is repeated many times to generate clusters of about a thousand copies of each sequencing template. (Fig. 2.41D)
Figure 2.41 Generation of clusters of sequencing templates. Denatured genomic DNA library fragments are captured on a glass slide by annealing to a bound oligonucleotide via a complementary adaptor sequence (A). The oligonucleotide primes the synthesis of the complementary strand. The resulting double-stranded DNA is denatured and the original library fragment is washed away. The strand that remains anchored to the glass slide at one end binds to an adjacent oligonucleotide primer by the adaptor sequence at the other end (B). The complementary strand is synthesized by extension of the second primer in a process known as bridge amplification. Denaturation of the double-stranded product results in two single-stranded DNA molecules that are bound to the glass slide (C). The process is repeated many times to generate clusters of about a thousand copies of each sequencing template (D).
The nucleotide sequence of each template may be acquired by addition of a sequencing primer that is complementary to an adaptor sequence, DNA polymerase, and four differentially labeled fluorescent reversible chain terminators (Fig. 2.37). A single sequencing cycle consists of addition of these reagents to each cluster containing the PCR-amplified copies of a fragment of genomic DNA and then capturing the fluorescent signal generated by addition of a single nucleotide (as described above). The spectrum of fluorescence corresponds to the nucleotide added, and the cycle is repeated to determine the sequence of 50 to 150 nucleotides from one or both ends of each template. This process occurs simultaneously for hundreds of millions of clusters anchored to the solid support.
The accuracy of the sequence is assessed using a Phred quality score (Q), which indicates the probability (P) that a base is identified incorrectly, as described in the following equation: Q = −10 log10 P. For example, a Q score of 30 (Q30) indicates that there is a 1 in a 1,000 chance that a base is called incorrectly, or that the base call accuracy is 99.9%.
Genome Sequence Assembly
A genome sequence can be assembled by aligning the sequences of DNA reads with sequences from a previously determined and highly related (reference) genome. For example, reads from resequenced human genomes, that is, genomes from different individuals, are mapped to a reference human genome. Alternatively, when a reference sequence is not available, the reads can be assembled de novo using a computer program that aligns the matching ends of different reads. The process of generating successive overlapping sequences produces long, contiguous stretches of nucleotides called contigs. The presence of repetitive sequences in a genome can result in erroneous matching of overlapping sequences. This problem can be overcome by using the sequences from both ends of a DNA fragment (paired end reads), which are a known distance apart (when genomic DNA fragments are size selected prior to sequencing), to order and orient the reads and to assemble the contigs into larger scaffolds (Fig. 2.42). Many overlapping reads are required to ensure that the nucleotide sequence is accurate and assembled correctly. Each nucleotide site in a genome is generally sequenced many times from different fragments. The extent of sequencing redundancy, called coverage or depth of coverage, varies from 10 to more than 100, depending on the error rate of the sequencing method, the read length (shorter reads require greater coverage), the complexity of the genome, the assembly method, and the goal of the sequencing project. The assembly process generates a draft sequence; however, small gaps may remain between contigs. Although a draft sequence is sufficient for many purposes, for example, in resequencing projects that map a sequence onto a reference genome, in some cases it is preferable to close the gaps to complete the genome sequence. For de novo sequencing of genomes from organisms that lack a reference genome, gap closure is desirable. The gaps can be closed by PCR amplification of high-molecular-weight genomic DNA across each gap, followed by sequencing of the amplification product, or by obtaining short sequences from primers designed to anneal to sequences adjacent to a gap. Sequencing of additional libraries containing fragments of different sizes may be required to complete the overall sequence.
Figure 2.42 Genome sequence assembly. Sequence data generated from both ends of a DNA fragment are known as paired ends (paired ends are shown in blue for each fragment, and the distance between them is represented by a thin, black line). A large number of reads are generated and assembled into longer contiguous sequences (contigs) using a computer program that matches overlapping sequences. Paired ends help to determine the order and orientation of contigs as they are assembled into scaffolds. Shown is a scaffold consisting of three contigs.
Sequencing Metagenomes
For more than 100 years, the identification of microorganisms and characterization of their biological functions has required cultivating each strain in the laboratory. In the 1990s, with the emergence of techniques to extract DNA directly from environmental samples such as soil and seawater, researchers began to examine the sequence diversity of bacteria using the universal 16S ribosomal RNA gene as a taxonomic marker. These studies revealed that less than 1% of all bacterial species could be cultured, and therefore, novel genes that might be of considerable interest for basic and applied research were inaccessible using methods that depended on growth of bacteria in the laboratory. Considering the wealth of biotechnologically important genes and proteins that had been obtained from the relatively few culturable microorganisms, the possibility of harvesting useful genes from the much greater number of unculturable microorganisms was exciting, if not daunting. With the development of high-throughput next-generation sequencing and algorithms for assembling genome sequences, it has become possible to access the genomes of uncultured organisms from complex environmental and clinical samples. The study of the collective genomes in these samples is known as metagenomics.
The primary objective of a metagenomic project is to construct a comprehensive DNA library from all the microorganisms of a particular ecosystem or location (Fig. 2.43). The entire library is sequenced using a massively parallel approach and assembled into contigs as described above with the aim of determining the sequence of as many different genomes as possible and identifying both novel gene sequences and those that are similar (homologous) to known gene sequences. For example, a massive study that included 50 ocean samples from locations in the North Atlantic through the Panama Canal to the South Pacific yielded 6.3 billion bp of sequence. Analysis of the assembled and nonassembled sequences indicated that there might be as many as 400 new bacterial species among the samples with about 1 × 106 genes that lack significant sequence similarity with any known gene. The analysis also revealed sequences encoding potentially novel forms of many proteins including proteins for repair of ultraviolet light-induced DNA damage and RuBisCO (ribulose bisphosphate carboxylase), an enzyme that is important for carbon fixation.
Figure 2.43 Construction of metagenomic libraries. Bacteria and/or viruses in samples from various environments or tissues are concentrated before extracting and then fragmenting the DNA. Libraries containing the DNA fragments are sequenced or screened for novel genes.
Genomics
Genome sequence determination is only a first step in understanding an organism. The next steps require identification of the features encoded in a sequence and investigations of the biological functions of the encoded RNA, proteins, and regulatory elements that determine the physiology and ecology of the organism. The area of research that generates, analyzes, and manages the massive amounts of information about genome sequences is known as genomics.
Sequence data are deposited and stored in databases that can be searched using computer algorithms to retrieve sequence information (data mining or bioinformatics). Public databases such as GenBank (National Center for Biotechnology Information, Bethesda, MD), the European Molecular Biology Laboratory Nucleotide Sequence Database, and the DNA Data Bank of Japan receive sequence data from individual researchers and from large sequencing facilities and share the data as part of the International Nucleotide Sequence Database Collaboration. Sequences can be retrieved from these databases via the Internet. Many specialized databases also exist, for example, for storing genome sequences from individual organisms, protein coding sequences, regulatory sequences, sequences associated with human genetic diseases, gene expression data, protein structures, protein-protein interactions, and many other types of data.
One of the first analyses to be conducted on a new genome sequence is the identification of descriptive features, a process known as annotation. Some annotations are protein coding sequences (open reading frames), sequences that encode functional RNA molecules (e.g., rRNA and tRNA), regulatory elements, and repetitive sequences. Annotation relies on algorithms that identify features based on conserved sequence elements such as translation start and stop codons, intron-exon boundaries, promoters, transcription factor-binding sites, and known genes (Fig. 2.44). It is important to note that annotations are often predictions of sequence function based on homology to sequences of known functions. In many cases, the function of the sequence remains to be verified through experimentation.
Figure 2.44 Genome annotation utilizes conserved sequence features. Predicting protein coding sequences (open reading frames) in prokaryotes (A) and eukaryotes (B) requires identification of sequences that correspond to potential translation start (ATG or, more rarely, GTG or TTG) and stop (TAA, shown; also TAG or TGA) codons in mRNA. The number of nucleotides between the start and stop codons must be a multiple of three (i.e., triplet codons) and must be a reasonable size to encode a protein. In prokaryotes, a conserved ribosome-binding site (RBS) is often present 4 to 8 nucleotides upstream of the start codon (A). Prokaryotic transcription regulatory sequences such as an RNA polymerase recognition (promoter) sequence and binding sites for regulatory proteins can often be predicted based on similarity to known consensus sequences. Transcription termination sequences are not as readily identifiable but are often GC-rich regions downstream of a predicted translation stop codon. In eukaryotes, protein coding genes typically have several intron sequences in primary RNA that are delineated by GU and AG and contain a pyrimidine-rich tract. Introns are spliced from the primary transcript to produce mRNA (B). Transcription regulatory elements such as the TATA and CAAT boxes that are present in the promoters of many eukaryotic protein coding genes can sometimes be predicted. Sequences that are important for regulation of transcription are often difficult to predict in eukaryotic genome sequences; for example, enhancer elements can be thousands of nucleotides upstream and/or downstream from the coding sequence that they regulate.
Comparison of a genome sequence to other genome sequences can reveal interesting and important sequence features. Comparisons among closely related genomes may reveal polymorphisms and mutations based on sequence differences. Association of specific polymorphisms with diseases can be used to predict, diagnose, and treat human diseases. Traditionally, cancer genetic research has investigated specific genes that were hypothesized to play a role in tumorigenesis based on their known cellular functions, for example, genes encoding transcription factors that control expression of cell division genes. Although important, this gives an incomplete view of the genetic basis for cancers. Sequencing of tumor genomes and comparing the sequences to those of normal cells have revealed point mutations, copy number mutations, and structural rearrangements associated with specific cancers. For instance, comparison of the genome sequences from acute myeloid leukemia tumor cells and normal skin cells from the same patient revealed eight previously unidentified mutations in protein coding sequences that are associated with the disease. Comparison of the genomes of bacterial pathogens with those from closely related nonpathogens has led to the identification of virulence genes. Unique sequences can be used for pathogen detection, and genes encoding proteins that are unique to a pathogen are potential targets for antimicrobial drugs and vaccine development.
Genome comparisons among distantly related organisms enable scientists to make predictions about evolutionary relationships. For example, the Genome 10K Project aims to sequence and analyze the genomes of 10,000 vertebrate species, roughly 1 per genus. Comparison of these sequences will contribute to our understanding of the genetic changes that led to the diversity in morphology, physiology, and behavior in this group of animals.
Another goal of genomic analysis is to understand the function of sequence features. Gene function can sometimes be inferred by the pattern of transcription. Transcriptomics is the study of gene transcription profiles either qualitatively, to determine which genes are expressed, or quantitatively, to measure changes in the levels of transcription of genes. Proteomics is the study of the entire protein populations of various cell types and tissues and the numerous interactions among proteins. Some proteins, particularly enzymes, are involved in biochemical pathways that produce metabolites for various cellular processes. Metabolomics aims to characterize metabolic pathways by studying the metabolite profiles of cells. All of these “-omic” subdisciplines of genomics use a genome-wide approach to study the function of biological molecules in cells, tissues, or organisms, at different developmental stages, or under different physiological or environmental conditions.
Transcriptomics
Transcriptomics (gene expression profiling) aims to measure the levels of transcription of genes on a whole-genome basis under a given set of conditions. Transcription may be assessed as a function of medical conditions, as a consequence of mutations, in response to natural or toxic agents, in different cells or tissues, or at different times during biological processes such as cell division or development of an organism. Often, the goal of gene expression studies is to identify the genes that are up- or downregulated in response to a change in a particular condition. Two major experimental approaches for measuring RNA transcript levels on a whole-genome basis are DNA microarray analysis and high-throughput next-generation RNA sequencing.
DNA Microarrays
A DNA microarray (DNA chip or gene chip) experiment consists of hybridizing a nucleic acid sample (target) derived from the mRNAs of a cell or tissue to single-stranded DNA sequences (probes) that are arrayed on a solid platform. Depending on the purpose of the experiment, the probes on a microarray may represent an entire genome, a single chromosome, selected genomic regions, or selected coding regions from one or several different organisms. Some DNA microarrays contain sets of oligonucleotides as probes, usually representing thousands of different genes, that are synthesized directly on a solid surface. Thousands of copies of an oligonucleotide with the same specific nucleotide sequence are synthesized in a predefined position on the array surface (probe cell). The probes are typically 20 to 70 nucleotides, although longer probes can also be used, and several probes with different sequences for each gene are usually present on the microarray to minimize errors. Probes are designed to be specific for their target sequences, to avoid hybridization with nontarget sequences, and to have similar melting (annealing) temperatures so that all target sequences can bind to their complementary probe sequence under the same conditions. A complete whole-genome oligonucleotide array may contain more than 500,000 probes representing as many as 30,000 genes.
For most gene expression profiling experiments that utilize microarrays, mRNA is extracted from cells or tissues and used as a template to synthesize cDNA using reverse transcriptase. Usually, mRNA is extracted from two or more sources for which expression profiles are compared, for example, from diseased versus normal tissue, or from cells grown under different conditions (Fig. 2.45A). The cDNA from each source is labeled with a different fluorophore by incorporating fluorescently labeled nucleotides during cDNA synthesis. For example, a green-emitting fluorescent dye (Cy3) may be used for the normal (reference) sample and a red-emitting fluorescent dye (Cy5) for the test sample. After labeling, the cDNA samples are mixed and hybridized to the same microarray (Fig. 2.45A). Replicate samples are independently prepared under the same conditions and hybridized to different microarrays. A laser scanner determines the intensities of Cy5 and Cy3 for each probe cell on a microarray. The ratio of red (Cy5) to green (Cy3) fluorescence intensity of a probe cell indicates the relative expression levels of the represented gene in the two samples (Fig. 2.45B). To avoid variation due to inherent and sequence-specific differences in labeling efficiencies between Cy3 and Cy5, reference and test samples are often reversed labeled and hybridized to another microarray. Alternatively, for some microarray platforms, the target sequences from reference and test samples are labeled with the same fluorescent dye and are hybridized to different microarrays. Methods to calibrate the data among microarrays in an experiment include using the fluorescence intensity of a gene that is not differentially expressed among different conditions as a reference point (i.e., a housekeeping gene), including spiked control sequences that are sufficiently different from the target sequences and therefore bind only to a corresponding control probe cell, and adjusting the total fluorescence intensities of all genes on each microarray to similar values under the assumption that a relatively small number of genes are expected to change among samples.
Figure 2.45 Gene expression profiling with a DNA microarray. (A) mRNA is extracted from two samples (sample 1 and sample 2), and during reverse transcription, the first cDNA strands are labeled with the fluorescent dyes Cy3 and Cy5, respectively. The cDNA samples are mixed and hybridized to an ordered array of either gene sequences or gene-specific oligonucleotides. After the hybridization reaction, each probe cell is scanned for both fluorescent dyes and the separate emissions are recorded. Probe cells that produce only a green or red emission represent genes that are transcribed only in sample 1 or 2, respectively; yellow emissions indicate genes that are active in both samples; and the absence of emissions (black) represents genes that are not transcribed in either sample. (B) Fluorescence image of a DNA microarray hybridized with Cy3- and Cy5-labeled cDNA. Reproduced with permission from http://biotech.biology.arizona.edu/Resources/DNA_analysis.html. Courtesy of N. Anderson, University of Arizona.
Genes whose expression changes in response to a particular biological condition are identified by comparing the fluorescence intensities for each gene, averaged among replicates, under two different conditions. The raw data of the fluorescence emissions of each gene are converted to a ratio, commonly expressed as fold change. Generally, positive ratios represent greater expression of the gene in the test sample than in the reference sample. Negative values indicate a lower level of expression in the test sample relative to the reference sample. The data are often organized into clusters of genes whose expression patterns are similar under different conditions or over a period of time (Fig. 2.46). This facilitates predictions of gene products that may function together in a pathway.
Figure 2.46 Gene expression profile of cirrhotic liver tissue. Columns 1 to 7 and 8 to 15 are expression data from liver samples from patients with ethanol- and hepatitis virus C-induced cirrhosis of the liver, respectively. Each patient’s sample was compared to normal liver tissue. A total of 2,965 genes were differentially expressed. The asterisks denote patients with severe cirrhosis of the liver. Adapted from Figure 1 in Lederer, S. L., et al., Virol J . 3:98, 2006.
The gene expression profile in Fig. 2.46 determined by microarray analysis clearly shows that different genes are transcribed in patients with cirrhosis of the liver compared to normal individuals, and in patients with ethanol-induced cirrhosis compared to those with cirrhosis induced by the hepatitis C virus. Moreover, there is a difference between the genes that are turned on during advanced ethanol-induced liver damage compared to those with less severe ethanol-induced cirrhosis (Fig. 2.46). No such distinction is evident among individuals with different severities of virus-induced cirrhosis (Fig. 2.46). In addition, information about the transcription of genes that contribute to a particular pathway or cellular activity can be extracted from a gene expression profile. For example, genes that are transcribed during lymphocyte proliferation and activation are highly expressed in viral-induced liver cirrhosis and to a much lesser extent in ethanol-associated cirrhotic samples (Fig. 2.47).
Figure 2.47 Gene expression profile of lymphocyte-specific genes from cirrhotic liver tissue. Columns 1 to 7 and 8 to 15 are expression data from liver samples from the patients described in Fig. 2.46 with ethanol- and hepatitis virus C-induced cirrhosis of the liver, respectively. Each patient’s sample was compared to normal liver tissue. The cluster consists of about 70 genes. The asterisks denote patients with severe cirrhosis of the liver. Adapted from Figure 2B in Lederer, S. L., et al., Virol J. 3:98, 2006.
RNA Sequencing
Similar to microarrays, RNA sequencing is used to detect and quantify the complete set of gene transcripts produced by cells under a given set of conditions. In addition, RNA sequencing can delineate the beginning and end of genes, reveal posttranscriptional modifications such as variations in intron splicing that lead to variant proteins, and identify differences in the nucleotide sequence of a gene among samples. In contrast to microarray analysis, this approach does not require prior knowledge of the genome sequence, avoids high background due to nonspecific hybridization, and can accurately quantify highly expressed genes (i.e., probe saturation is not a concern as it is for DNA microarrays). Traditionally, RNA sequencing approaches required generating cDNA libraries from isolated RNA and sequencing the cloned inserts, or the end(s) of the cloned inserts (expressed sequence tags), using the dideoxynucleotide method. New developments in sequencing technologies circumvent the requirement for preparation of a clone library and enable high-throughput sequencing of cDNA.
For high-throughput RNA sequencing, total RNA is isolated and converted to cDNA using reverse transcriptase and a mixture of oligonucleotide primers composed of six random bases (random hexamers) that bind to multiple sites on all of the template RNA molecules (Fig. 2.48A). Because rRNA makes up a large fraction (>80%) of the total cellular RNA and levels are not expected to change significantly under different conditions, these molecules are often removed prior to cDNA synthesis by hybridization to complementary oligonucleotides that are covalently linked to magnetic beads for removal. Long RNA molecules are fragmented to pieces of about 200 bp by physical (e.g., nebulization), chemical (e.g., metal ion hydrolysis), or enzymatic (e.g., controlled RNase digestion) methods either before cDNA synthesis (RNA fragmentation) or after cDNA synthesis (cDNA fragmentation).
Figure 2.48 High-throughput RNA sequencing. (A) Total RNA is extracted from a sample and rRNA may be removed. The RNA is fragmented and then converted to cDNA using reverse transcriptase. Adaptors are added to the ends of the cDNA to provide binding sites for sequencing primers. High-throughput next-generation sequencing technologies are used to determine the sequences at the ends of the cDNA molecules (paired end reads). The sequence reads are aligned to a reference genome or assembled into contigs using the overlapping sequences. Shown is the alignment of paired end reads to a gene containing one intron. (B) RNA expression levels are determined by counting the reads that correspond to a gene. Adapted with permission from Wang et al., Nat Rev Genet. 10:57–63, 2009.
The cDNA fragments are ligated at one or both ends to an adaptor that serves as a binding site for a sequencing primer (Fig. 2.48A). High-throughput next-generation sequencing technologies are employed to sequence the cDNA fragments. The sequence reads are assembled in a manner similar to that for genomic DNA, which is by aligning the reads to a reference genome or by aligning overlapping sequences to generate contigs for de novo assembly when a reference genome is not available. The reads are expected to align uniformly across the transcript (Fig. 2.48A). Gene expression levels are determined by counting the reads that correspond to each nucleotide position in a gene and averaging these across the length of the transcript (Fig. 2.48B). Expression levels are typically normalized between samples by scaling to the total number of reads per sample (e.g., reads/kilobase pair/million reads). Appropriate coverage (i.e., the number of cDNA fragments sequenced) is more difficult to determine for RNA sequencing than for genome sequencing because the total complexity of the transcriptome is not known before the experiment. In general, larger genomes and genomes that have more RNA splicing variants have greater transcriptome complexity and therefore require greater coverage. Also, accurate measurement of transcripts from genes with low expression levels requires sequencing of a greater number of transcripts. Quantification may be confounded by the high GC content of some cDNA fragments which have a higher melting temperature and therefore are inefficiently sequenced, by overrepresentation of cDNA fragments from the 5′ end of transcripts due to the use of random hexamers, and by reads that map to more than one site in a genome due to the presence of repeated sequences. However, because each transcript is represented by many different reads, these biases are expected to have minimal effects on quantification of a transcript.
Proteomics
Proteins are the molecular machines of cells. They catalyze biochemical reactions, monitor the internal and external environments of the cell and mediate responses to perturbations, and make up the structural components of cells. Some proteins are present at more or less the same levels in all cells of a multicellular individual or a population of unicellular organisms under most conditions, for example, proteins that make up ribosomes or the cytoskeleton. The levels of other proteins differ among cells according to the cells’ functions or change in response to developmental or environmental cues. Thus, analysis of the proteins that are present under particular biological conditions can provide insight into the activities of a cell or tissue.
Proteomics is the comprehensive study of all the proteins of a cell, tissue, body fluid, or organism from a variety of perspectives, including structure, function, expression profiling, and protein−protein interactions. There are several advantages to studying the protein complement (proteome) of cells or tissues compared to other genomic approaches. Although analysis of genomic sequences can often identify protein coding sequences, in many cases the function of a protein, and the posttranslational modifications that influence protein activity and cellular localization, cannot be predicted from the sequence. On the other hand, it may be possible to infer a protein’s function by determining the conditions under which it is expressed and active. While expression profiles of protein coding sequences can be determined using transcriptomics, mRNA levels do not always correlate with protein levels and do not indicate the presence of active proteins, and interactions between proteins cannot be assessed by these methods. Generally, mRNA is turned over rapidly, and therefore, transcriptomics measures actively transcribed genes, whereas proteomics monitors relatively more stable proteins. From a practical standpoint, proteomics can be used to identify proteins associated with a clinical disorder (protein biomarkers), especially in the early stages of disease development, that can aid in disease diagnosis or provide targets for treatment of disease.
Identification of Proteins
A cell produces a large number of different proteins that must first be separated in order to identify individual components of the proteome. To reduce the complexity, proteins are sometimes extracted from particular subcellular locations such as the cell membrane, nucleus, Golgi apparatus, endosomes, or mitochondria. Two-dimensional polyacrylamide gel electrophoresis (2D PAGE) is an effective method to separate proteins in a population (Fig. 2.49A). Proteins in a sample are first separated on the basis of their net charge by electrophoresis through an immobilized pH gradient in one dimension (the first dimension) (Fig. 2.49A). Some amino acids in a polypeptide have side chains with ionizable groups that contribute to the net charge of a protein; the degree of ionization (protonation) is influenced by the pH of the solution. In a gel to which an electric current is applied, proteins migrate through a pH gradient until they reach a specific pH (the isoelectric point) where the overall charge of the protein is zero and they no longer move. A particular position in the pH gradient may be occupied by two or more proteins that have the same isoelectric point. However, the proteins often have different molecular weights and can be further separated according to their molecular mass by electrophoresis at right angles to the first dimension (the second dimension) through a sodium dodecyl sulfate (SDS)-polyacrylamide gel (Fig. 2.49B). The separated proteins form an array of spots in the gel that is visualized using Coomassie blue, silver, or fluorescent protein stains.
Figure 2.49 2D PAGE for separation of proteins. (A) First dimension. Isoelectric focusing is performed to first separate proteins in a mixture on the basis of their net charge. The protein mixture is applied to a pH gradient gel. When an electric current is applied, proteins will migrate either toward the anode (+) or cathode (–) depending on their net charge. As proteins move through the pH gradient, they will gain or lose protons until they reach a point in the gel where their net charge is zero. The pH in this position of the gel is known as the isoelectric point and is characteristic of a given protein. At that point, a protein no longer moves in the electric current. (B) Second dimension. Several proteins in a sample may have the same isoelectric point and therefore migrate to the same position in the gel in the first dimension. Therefore, proteins are further separated on the basis of differences in their molecular weights (MW) by electrophoresis, at a right angle to the first dimension, through a sodium dodecyl sulphate-polyacrylamide gel.
Depending on the size of the two-dimensional polyacrylamide gel and the abundance of individual proteins, approximately 2,000 different proteins can be resolved. The pattern of spots is captured by densitometric scanning of the gel. Databases have been established with images of two-dimensional polyacrylamide gels from some different cell types, and software is available for detecting spots, matching patterns between gels, and quantifying the protein content of the spots. Proteins with either low or high molecular weights, those with highly acidic or basic isoelectric points (such as ribosomal proteins and histones), those that are found in cellular membranes, and those that are present in small amounts are not readily resolved by 2D PAGE.
After separation, individual proteins are excised from the gel and the identity of the protein is determined, usually by mass spectrometry (MS). A mass spectrometer detects the masses of the ionized form of a molecule. For identification, the protein is first fragmented into peptides by digestion with a protease, such as trypsin, that cleaves at lysine or arginine residues (Fig. 2.50). The peptides are ionized and separated according to their mass-to-charge (m/z) ratio, and then the abundance and m/z ratios of the ions are measured. Several mass spectrometers are available that differ in the type of sample analyzed, the mode of ionization of the sample, the method for generating the electromagnetic field that separates and sorts the ions, and the method of detecting the different masses. Peptide masses are usually determined by matrix-assisted laser desorption ionization–time of flight (MALDI-TOF) MS. To determine the m/z value of each peptide fragment generated from an excised protein by MALDI-TOF MS, the peptides are ionized by mixing them with a matrix consisting of an organic acid and then using a laser to promote ionization. The ions are accelerated through a tube using a high-voltage current, and the time required to reach the ion detector is determined by their molecular mass, with lower-mass ions reaching the detector first.
Figure 2.50 Peptide mass fingerprinting. A spot containing an unknown protein that was separated by 2D PAGE is excised from the gel and treated with trypsin. Purified trypsin peptides are separated by MALDI-TOF MS. The set of peptide masses from the unknown protein are used to search a database that contains the masses of tryptic peptides for every known sequenced protein and the best match is determined. The trypsin cleavage sites of known proteins are determined from the amino acid sequence and, consequently, the masses of the tryptic peptides are easy to calculate. Only some of the tryptic peptide masses for the unknown protein are listed in this example.
To facilitate protein identification, computer algorithms have been developed for processing large amounts of MS data. Databases have been established that contain the masses of tryptic peptides for all known proteins. The databases are searched to identify a protein whose peptide masses match the values of the peptide masses of an unknown protein that were determined by MALDI-TOF MS (Fig. 2.50). This type of analysis is called peptide mass fingerprinting.
Protein Expression Profiling
Several methods have been developed to quantitatively compare the proteomes among samples. Two-dimensional differential in-gel electrophoresis is very similar to 2D PAGE; however, rather than separating proteins from different samples on individual gels and then comparing the maps of separated protein, proteins from two different samples are differentially labeled and then separated on the same two-dimensional polyacrylamide gel (Fig. 2.51). Typically, proteins from each sample are labeled with different fluorescent dyes (e.g., Cy3 and Cy5, which have higher sensitivity than many other protein stains); the labeled samples are mixed and then run together in the same gel, which overcomes the variability between separate gel runs. The two dyes carry the same mass and charge, and therefore, a protein labeled with Cy3 migrates to the same position as the identical protein labeled with Cy5. The Cy3 and Cy5 protein patterns are visualized separately by fluorescent excitation. The images are compared, and any differences are recorded. In addition, the ratio of Cy3 to Cy5 fluorescence for each spot is determined to detect proteins that are either up- or downregulated. Unknown proteins are identified by MS.
Figure 2.51 Protein expression profiling using 2D differential in-gel electrophoresis. The proteins of two different proteomes are labeled with fluorescent dyes Cy3 and Cy5, respectively. The labeled proteins from the two samples are combined and separated by 2D PAGE. The gel is scanned for each fluorescent dye, and the relative levels of the two dyes in each protein spot are recorded. Each spot with an unknown protein is excised for identification by MS.
Another powerful technique for comparing protein populations among samples utilizes protein microarrays. Protein microarrays are similar to DNA microarrays; however, rather than arrays of oligonucleotides, protein microarrays consist of large numbers of proteins immobilized in a known position on a surface such as a glass slide in a manner that preserves the structure and function of the proteins. The proteins arrayed on the surface can be antibodies specific for a set of proteins in an organism, purified proteins that were expressed from a DNA or cDNA library, short synthetic peptides, or multiprotein samples from cell lysates or tissue specimens. The arrayed proteins are probed with samples that contain molecules that interact with the proteins. For example, the interacting molecules can be other proteins to detect protein−protein interactions, nucleic acid sequences to identify proteins that regulate gene expression by binding to DNA or RNA, substrates for specific enzymes, or small protein-binding compounds such as lipids or drugs.
Microarrays consisting of immobilized antibodies are used to detect and quantify proteins present in a complex sample. Antibodies directed against more than 1,800 human proteins have been isolated, characterized, and validated, and subsets of these that detect specific groups of proteins such as cell signaling proteins can be arrayed. To compare protein profiles in two different samples, for example, in normal and diseased tissues, proteins extracted from the two samples are labeled with two different fluorescent dyes (e.g., Cy3 and Cy5) and then applied to one antibody microarray (Fig. 2.52). Proteins present in the samples bind to their cognate antibodies, and after washing to remove unbound proteins, the antibody-bound proteins are detected with a fluorescence scanner. Interpretation of the fluorescent signals that represent the relative levels of specific proteins in the two samples on a protein microarray is very similar to analysis of a DNA microarray.
Figure 2.52 Protein expression profiling with an antibody microarray. Proteins extracted from two different samples are labeled with fluorescent dyes Cy3 and Cy5, respectively. The labeled proteins are mixed and incubated with an array of antibodies immobilized on a solid support. Proteins bound to their cognate antibodies are detected by measuring fluorescence, and the relative levels of specific proteins in each sample are determined.
To increase the sensitivity of the assay and therefore the detection of low-abundance proteins, or to detect a specific subpopulation of proteins, a “sandwich”-style assay is often employed (Fig. 2.53). In this case, unlabeled proteins in a sample are bound to an antibody microarray, and then a second, labeled antibody is applied. This approach has been used to determine whether particular posttranslational protein modifications such as phosphorylation of tyrosine or glycosylation are associated with specific diseases. Serum proteins are first captured by immobilized antibodies on a microarray. Then, an antiphosphotyrosine antibody is applied that binds only to tyrosine phosphorylated proteins (Fig. 2.53A). The antiphosphotyrosine antibody is tagged, for example, with a biotin molecule, and fluorescently labeled streptavidin, which binds specifically to biotin, is added to detect the phosphorylated protein. In a similar manner, glycosylated proteins can be detected with lectins (Fig. 2.53B). Lectins are plant glycoproteins that bind to specific carbohydrate moieties on the surface of proteins or cell membranes, and many different lectins with affinities for different glycosyl groups (glycans) are available.
Figure 2.53 Detection of post-translational modifications with antibody microarrays. (A) Detection of tyrosine phosphorylation. An antibody microarray (1) is incubated with a protein sample (2). Biotinylated antiphosphotyrosine antibody is added (3) and, for visualization, a streptavidin-fluorescent dye conjugate attaches to the biotin of the antiphosphotyrosine antibody (4). (B) Detection of glycan groups. An antibody microarray (1) is incubated with a protein sample (2). A biotinylated molecule (e.g., lectin) that binds to a specific glycan is added (3) and, for visualization, a streptavidin-fluorescent dye conjugate attaches to the biotin of the lectin (4).
In another type of microarray, purified proteins representing as many proteins of a proteome under study as possible are arrayed on a solid support and then probed with antibodies in serum samples collected from healthy (control) and diseased individuals. The purpose of these studies is to discover whether individuals produce antibodies that correlate with particular diseases or biological processes. For example, the differential expression of antibodies in serum samples from individuals with and without Alzheimer disease was tested using a microarray consisting of more than 9,000 unique human proteins (Fig. 2.54). After incubation of the serum samples with the protein microarray, bound antibodies were detected using a fluorescently labeled secondary antibody that interacts specifically with human antibodies. The screen resulted in the identification of 10 autoantibodies (i.e., directed against an individual’s own protein) that may be used as biomarkers to diagnose Alzheimer disease. Protein microarrays can also be used to identify proteins that interact with therapeutic drugs or other small molecules (Fig. 2.55). This can aid in determining the mechanism of action of a drug, for assessing responsiveness among various forms of a target protein (e.g., variants produced by different individuals), and for predicting undesirable side effects.
Figure 2.54 Identification of disease biomarkers with a human protein microarray. Serum samples are collected from diseased and healthy individuals and incubated with microarrays of purified human proteins. Serum autoantibodies bind to specific proteins on the microarray and are detected by applying a fluorescently labeled secondary antibody directed against human antibodies. Autoantibodies present in the serum from diseased individuals but not in serum from healthy individuals are potential biomarkers that can be used in diagnosis of the disease.
Figure 2.55 Protein microarrays to detect protein-drug interactions. Therapeutic drugs or other small molecules tagged with a fluorescent dye are applied to purified proteins arrayed on a solid support.
Protein−Protein Interactions
Proteins typically function as complexes comprised of different interacting protein subunits. Important cellular processes such as DNA replication, energy metabolism, and signal transduction are carried out by large multiprotein complexes. Thousands of protein-protein interactions occur in a cell. Some of these are short-lived, while others form stable multicomponent complexes that may interact with other complexes. Determining the functional interconnections among the members of a proteome is not an easy task. Several strategies have been developed to examine protein interactions, including protein microarrays, two-hybrid systems, and tandem affinity purification methods.
The two-hybrid method that was originally devised for studying the yeast proteome has been used extensively to determine pairwise protein—protein interactions in both eukaryotes and prokaryotes. The underlying principle of this assay is that the physical connection between two proteins reconstitutes an active transcription factor that initiates the expression of a reporter gene. The transcription factors employed for this purpose have two domains. One domain (DNA-binding domain) binds to a specific DNA site, and the other domain (activation domain) activates transcription (Fig. 2.56A). The two domains are not required to be part of the same protein to function as an effective transcription factor. However, the activation domain alone will not bind to RNA polymerase to activate transcription. Connection with the DNA-binding domain is necessary to place the activation domain in the correct orientation and location to initiate transcription of the reporter gene by RNA polymerase.
Figure 2.56 Two-hybrid assay for detecting pairwise protein interactions. (A) The DNA binding domain of a transcription factor binds to a specific sequence in the regulatory region of a gene which orients and localizes the activation domain that is required for the initiation of transcription of the gene by RNA polymerase. (B) The coding sequences for the DNA binding domain and the activation domain are fused to DNA X and DNA Y, respectively, and both constructs (hybrid genes) are introduced into a cell. After translation, the DNA binding domain-protein X fusion protein binds to the regulatory sequence of a reporter gene. However, protein Y (prey) does not interact with protein X (bait) and the reporter gene is not transcribed because the activation domain does not, on its own, associate with RNA polymerase. (C) The coding sequence for the activation domain is fused to the DNA for protein Z (DNA Z) and transformed into a cell containing the DNA binding domain-DNA X fusion construct. The proteins encoded by the hybrid genes interact and the activation domain is properly oriented to initiate transcription of the reporter gene demonstrating a specific protein-protein interaction.
For a two-hybrid assay, the coding sequences of the DNA-binding and activation domains of a specific transcription factor are cloned into separate vectors (Fig. 2.56). Often, the Gal4 transcriptional factor from Saccharomyces cerevisiae or the bacterial LexA transcription factor is used. A DNA (or cDNA generated from a eukaryotic mRNA) sequence that is cloned in frame with the DNA-binding domain sequence produces a fusion (hybrid) protein and is referred to as the “bait.” This is the target protein for which interacting proteins are to be identified. Another DNA sequence is cloned into another vector in frame with the activation domain coding sequence. A protein attached to the activation domain is called the “prey” and potentially interacts with the bait protein. Host yeast cells are transformed with both bait and prey DNA constructs. After expression of the fusion proteins, if the bait and prey do not interact, then there is no transcription of the reporter gene (Fig. 2.56B). However, if the bait and prey proteins interact, then the DNA-binding and activation domains are also brought together. This enables the activation domain to make contact with RNA polymerase and activate transcription of the reporter gene (Fig. 2.56C). The product of an active reporter gene may produce a colorimetric response or may allow a host cell to proliferate in a specific medium.
For a whole-proteome protein interaction study, two libraries are prepared, each containing thousands of cDNAs generated from total cellular mRNA (or genomic DNA fragments in a study of proteins from a prokaryote). To construct the bait library, cDNAs are cloned into the vector adjacent to the DNA sequence for the DNA-binding domain of the transcription factor Gal4 and then introduced into yeast cells. To construct the prey library, the cDNAs are cloned into the vector containing the sequence for the activation domain, and the constructs are transferred to yeast cells. The libraries are typically screened for bait-prey protein interactions in one of two ways. In one method, a prey library of yeast cells is arrayed on a grid. The prey library is then screened for the production of proteins that interact with a bait protein by introducing individual bait constructs to the arrayed clones by mating (Fig. 2.57A). Alternatively, each yeast clone in a bait library is mated en masse with a mixture of strains in the prey library, and then positive interactions are identified by screening colonies on plates for activation of the reporter gene (Fig. 2.57B). Challenges with using the two-hybrid system for large-scale determination of protein−protein interactions include the inability to clone all possible protein coding genes in frame with the activation and DNA-binding domains, which leads to missed interactions (false negatives), and the detection of interactions that do not normally occur in their natural environments within the original cells and therefore are not biologically relevant (false positives). Nonetheless, this approach has been used to successfully identify interacting proteins in a wide range of organisms from bacteria to humans.
Figure 2.57 Large-scale screens for protein interactions using the yeast two-hybrid system. Two libraries are prepared, one containing genomic DNA fragments fused to the coding sequence for the DNA-binding domain of a transcription factor (bait library) and another containing genomic DNA fragments fused to the activation domain of the transcription factor (prey library). Two methods are commonly used to screen for pairwise protein interactions. (A) Individual yeast strains in the bait library are mated with each yeast strain in an arrayed prey library. Resulting strains in the array that produce bait and prey proteins that interact are detected by assaying for reporter gene activation (activated cells growing in a multiwell plate are indicated in green). (B) Yeast strains in the prey library are mated en masse with individual strains in the bait library. The mixture of strains are screened for reporter gene activity that identifies strains with interacting bait and prey proteins (green).
Instead of studying pairwise protein interactions, the tandem affinity purification tag procedure is designed to capture multiprotein complexes and then identify the components with MS (Fig. 2.58). In this method, a DNA (or cDNA) sequence that encodes the bait protein is fused to a DNA sequence that encodes two small peptides (tags) separated by a protease cleavage site. The peptide tags bind with a high affinity to specific molecules and facilitate purification of the target protein. A “two-tag” system allows two successive rounds of affinity binding to ensure that the target and its associated proteins are free of any nonspecific proteins. Alternatively, a “one-tag” system with a small protein tag that is immunoprecipitated with a specific antibody requires only a single purification step. In a number of trials, the tags did not alter the function of various test proteins.
Figure 2.58 Tandem affinity purfication to detect multiprotein complexes. The coding region of a cDNA (cDNA X) is cloned into a vector in frame with two DNA sequences (tag 1 and tag 2), each encoding a short peptide that has a high affinity for a specific matrix. The tagged cDNA construct is introduced into a host cell, where it is transcribed and the mRNA is translated. Other cellular proteins bind to the protein encoded by cDNA X (protein X). The complex consisting of protein X and its interacting proteins (colored shapes) is separated from other cellular proteins by the binding of tag 1 to an affinity matrix which is usually fixed to a column. The protein complex is retained on the column and the noninteracting proteins flow through. The complex is then eluted from the affinity matrix by cleaving off tag 1 with a protease, and a second purification step is carried out with tag 2 and its affinity matrix. The proteins of the complex are separated by one-dimensional PAGE. Single bands are excised from the gel and identified by MS.
A DNA–two-tag construct is introduced into a host cell, where it is expressed and a tagged protein is synthesized (Fig. 2.58). The underlying assumption is that the cellular proteins that normally interact with the native protein in vivo will also combine with the tagged protein. After the cells are lysed, the tagged protein and any interacting proteins are purified using the affinity tags. The proteins of the complex are separated according to their molecular weight by PAGE and identified with MS. Computer programs are available for generating maps of complexes with common proteins, assigning proteins with shared interrelationships to specific cellular activities, and establishing the links between multiprotein complexes.
Metabolomics
Metabolomics is a technique that provides a snapshot of the small molecules present in a complex biological sample. The metabolites present in cells and cell secretions are influenced by genotype, which determines the metabolic capabilities of an organism, and by environmental conditions such as the availability of nutrients and the presence of toxins or other stressors. Metabolite composition varies depending on the developmental and health status of an organism, and therefore, a comprehensive metabolite profile can identify molecules that reflect a particular physiological state. For example, metabolites present in diseased cells but not in healthy cells are useful biomarkers for diagnosing and monitoring disease. Metabolic profiles can also aid in understanding drug metabolism, which may reduce the efficacy of a treatment, or in understanding drug toxicity, which can help to reduce adverse drug reactions. Metabolomic analysis can be used to determine the catalytic activity of proteins, for example, by quantifying changes in metabolite profiles in response to mutations in enzyme coding genes and to connect metabolic pathways that share common intermediates.
Biological samples for metabolite analysis may be cell or tissue lysates, body fluids such as urine or blood, or cell culture media that contain a great diversity of metabolites. These include building blocks for biosynthesis of cellular components such as amino acids, nucleotides, and lipids. Also present are various substrates, cofactors, regulators, intermediates, and end products of metabolic pathways such as carbohydrates, vitamins, organic acids, amines and alcohols, and inorganic molecules. These molecules have very different properties, and therefore comprehensive detection and quantification using a single method based on chemical characteristics presents a challenge.
Metabolomics employs spectroscopic techniques such as MS and nuclear magnetic resonance (NMR) spectroscopy to identify and quantify the metabolites in complex samples. Often, multiple methods are used in parallel to obtain a comprehensive view of a metabolome. In a manner similar to protein identification described above, MS measures the m/z ratio of charged metabolites. The molecules may be ionized by various methods before separation of different ions in an electromagnetic field. MS is typically coupled with chromatographic techniques that first separate metabolites based on their properties. For example, MS may be coupled with gas chromatography to separate volatile metabolites. Some nonvolatile metabolites, such as amino acids, are chemically modified (derivatized) to increase their volatility. Liquid chromatography separates metabolites dissolved in a liquid solvent based on their characteristic retention times as they move through an immobilized matrix.
NMR spectroscopy is based on the principle that in an applied magnetic field, molecules (more precisely, atomic nuclei with an odd mass number) absorb and emit electromagnetic energy at a characteristic resonance frequency that is determined by their structure. Thus, the resonance frequencies provide detailed information about the structure of a molecule and enable differentiation among molecules with different structures, even when the difference is very small, such as between structural isomers. In contrast to MS, an initial metabolite separation step is not required, and NMR measures different types of molecules. In addition, NMR is not destructive, and in fact, it has been adapted to visualize molecules in living human cells in the diagnostic procedure magnetic resonance imaging (MRI). A drawback of NMR is low sensitivity, which means that it does not detect low-abundance molecules.
An illustration of the application of metabolome analysis is the identification of metabolites that are associated with the progression of prostate cancer to metastatic disease. Researchers compared more than 1,000 metabolites in benign prostate tissue, localized prostate tumors, and metastatic tumors from several tissues using MS combined with liquid and gas chromatography. Sixty metabolites were found in localized prostate and/or metastatic tumors but not in benign prostate tissue, and six of these were significantly higher in the metastatic tumors. The metabolite profile indicated that progression of prostate cancer to metastatic disease was associated with an increase in amino acid metabolism. In particular, levels of sarcosine, a derivative of the amino acid glycine, were much higher in the metastatic tumors than in localized prostate cancer tissue and were not detectable in noncancerous tissue (Fig. 2.59). Moreover, sarcosine levels were higher in the urine of men with prostate tissue biopsies that tested positive for cancer than in that of biopsy-negative controls, and higher in prostate cancer cell lines than in benign cell lines. Benign prostate epithelial cells became motile and more invasive upon exposure to sarcosine than did those treated with alanine as a control. From this analysis, sarcosine appears to play a key role in cancer cell invasion and shows promise as a biomarker for progression of prostate cancer and as a target for prevention.
Figure 2.59 Metabolite profiles of benign prostate, localized prostate cancer, and metastatic tumor tissues. The relative levels of a subset of 50 metabolites are shown in each row. Levels of a metabolite in each tissue (columns) were compared to the median metabolite level (black); shades of yellow represent increased levels, and shades of blue indicate decreased levels. Metastatic samples were taken from soft (A), rib or diaphragm (B), or liver (C) tissues. Modified with permission from Macmillan Publishers Ltd. from Sreekumar et al., Nature. 457:910–914, 2009.
summary
Molecular biotechnology comprises a large number of fundamental techniques to identify, isolate, transfer, and express specific genes in a variety of host organisms. The tools for these processes were developed from an understanding of the biochemistry, genetics, and molecular biology of cells, especially prokaryotic cells, and viruses. Molecular cloning is the process of inserting a gene or other DNA sequence isolated from one organism into a vector and introducing it into a host cell. The discovery of restriction endonucleases was essential for this process, as it enabled predictable and reproducible cleavage of both target (insert) and vector DNAs in preparation for joining the two molecules. A restriction endonuclease is a protein that binds to double-stranded DNA at a specific nucleotide sequence and cleaves a phosphodiester bond in each of the DNA strands within the recognition sequence. Digestion of target and vector DNA with the same restriction endonuclease generates compatible single-stranded extensions that can be joined by complementary base-pairing and the activity of the enzyme DNA ligase that catalyzes the formation of phosphodiester bonds. Another cloning method known as recombinational cloning does not utilize restriction endonucleases or DNA ligase for insertion of target DNA into a vector but, rather, exploits a system used by some viruses to integrate into the host genome via recombination at specific attachment sequences.
Cloned DNA is introduced into host cells, often bacterial cells that are competent to take up exogenous DNA, a process known as transformation. Vectors that carry the target DNA into the host cell are usually derived from natural bacterial plasmids that have been genetically engineered with several unique endonuclease recognition sequences (multiple-cloning sites) to facilitate cloning. A vector can be propagated in a host cell if it possesses a DNA sequence (origin of replication) that enables it to replicate in the host. Transformation is generally inefficient; however, transformed cells may be distinguished from nontransformed cells by testing for the activity of genes that are present on the vector, including genes for resistance to antibiotics or synthesis of colored products.
To clone and express genes that encode eukaryotic proteins in a bacterial host, the introns must first be removed. Purified mRNA, which does not contain introns, is used as a template for the synthesis of cDNA by the enzyme reverse transcriptase. Oligonucleotide primers can be designed to target a specific mRNA for cDNA synthesis or to anneal to the poly(A) tails present on most eukaryotic mRNAs to generate a cDNA library that contains all of the protein coding sequences expressed by a source eukaryote under a given set of conditions. Construction of a genomic DNA library from a prokaryote is more straightforward and entails cleaving the DNA to obtain overlapping fragments for cloning. Libraries are screened by a variety of methods to identify clones with a particular sequence or that produce a target protein.
The CRISPR-Cas system is used to edit intact genomes in vivo. In this system, an sgRNA that contains a 20-nucleotide sequence complementary to the target site in the genome is introduced into the host cell together with the endonuclease Cas9. The sgRNA guides Cas9 to the genomic target site, adjacent to a specific PAM sequence, which is cleaved by the endonuclease. Double-stranded DNA cleavage activates cellular DNA repair systems that results in nucleotide deletions that disrupt the target gene. If specific donor DNA is also introduced into the cell, it may be incorporated into the genome by recombination between the target sequence and homologous sequences flanking the donor DNA. The CRISPR-Cas technology can be used to disrupt, insert, alter the regulation of, or tag a gene in the genomes of microorganisms or multicellular organisms.
Amplification, synthesis, and sequencing of DNA are also fundamental tools of molecular biotechnology. PCR is a powerful method for generating millions of copies of a specific sequence of DNA from very small amounts of starting material. Amplification is achieved in 30 or more successive cycles of template DNA denaturation, annealing of two oligonucleotide primers to complementary sequences flanking a target gene in the single-stranded DNA, and DNA synthesis extending from the primer by a thermostable DNA polymerase. Among innumerable applications, PCR can be used to detect or quantify a specific nucleotide sequence in a complex biological sample or to obtain large amounts of a particular DNA sequence either for cloning or for sequencing.
Oligonucleotides that are used as primers for PCR are produced in a stepwise method using phosphoramidites. To make double-stranded DNA molecules, two oligonucleotides with complementary sequences are synthesized separately then allowed to anneal. In addition to primers for PCR and DNA sequencing, oligonucleotides are used as adaptors to add specific sequences to the ends of DNA fragments, such as restriction endonuclease recognition sites for cloning, and to synthesize entire genes.
The nucleotide sequence of a gene can reveal useful information about the function, regulation, and evolution of the gene. The sequencing technologies currently used involve (i) addition of nucleotides by DNA polymerase to a primer based on complementarity to a template DNA fragment and (ii) detection and identification of the nucleotide(s) added. The dideoxynucleotide method has been used for several decades to sequence genes and whole genomes. This method relies on the incorporation of a synthetic dideoxynucleotide that lacks a 3′ hydroxyl group into a growing DNA strand, which terminates DNA synthesis. Conditions are optimized so that the dideoxynucleotides are incorporated randomly, producing DNA fragments of different lengths that terminate with one of the four dideoxynucleotides, each tagged with a different fluorescent dye. The fragments are separated according to their size by electrophoresis, and the sequence of fluorescent signals is determined and converted into a nucleotide sequence. Pyrosequencing entails correlating the release of pyrophosphate, which is recorded as the emission of light, with the incorporation of a particular nucleotide into a growing DNA strand. Sequencing using reversible chain terminators also reveals the sequence of a DNA fragment by detecting single-nucleotide extensions; however, in contrast to pyrosequencing, the four nucleotides are added to the reaction together in each cycle, and after the unincorporated nucleotides are washed away, the nucleotide incorporated by DNA polymerase is distinguished by its fluorescent signal. The fluorescent dye and a blocking group that prevents addition of more than one nucleotide during each cycle are chemically cleaved, and the cycle is repeated. In another method, sequences are obtained from single DNA molecules captured by a DNA polymerase attached to the bottom of a very low volume well. Emissions from fluorescently tagged nucleotides held in the polymerase’s active site are detected in real time by a narrowly focused laser, before the cleaved pyrophosphate and attached fluorophore diffuse away.
Next-generation sequencing technologies, together with cell-free methods to generate a library of genomic DNA sequencing templates in a dense array on a solid surface, have enabled rapid and inexpensive acquisition of genome sequences. Hundreds of millions of short nucleotide reads can be acquired simultaneously (massive parallelization) and assembled into contigs. Using these approaches, the genome sequences of thousands of organisms from all domains of life, including the metagenomes of entire communities of microorganisms in an environmental sample, have been completed or are in progress. The next steps are to annotate the sequence features and to determine the functions of the genes encoded in the genomes by investigating patterns of transcription (transcriptomics), protein synthesis (proteomics), and small-molecule production (metabolomics) using a variety of techniques such as DNA and protein microarray analysis, RNA sequencing, 2D PAGE, mass spectrometry, and NMR. Comparison of genome sequences can reveal the genetic basis of a disease, the mechanism of pathogenicity of a microbe, or the evolutionary relationships among organisms, while transcript, protein, and metabolite profiles can identify biomarkers for diagnosis and treatment of disease.
REFERENCES
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 456:53−59.
Bondy-Denomy J, and Davidson AR. 2014. To acquire or resist: the complex biological effects of CRISPR-Cas systems. Trends Microbiol. 22:218−225.
Brasch, MA, Hartley JL, Vidal M. 2004. ORFeome cloning and systems biology: standardized mass production of the parts from the parts-list. Genome Res. 14:2001−9.
Cello J, Paul PV, Wimmer E. 2002. Chemical synthesis of poliovirus cDNA: generation of infectious virus in the absence of natural template. Science. 297:1016−18.
Chen CS, Korobkova E, Chen H, Zhu J, Jian X, Tao SC, He C, Zhu H. 2008. A proteome chip approach reveals new DNA damage recognition activities in Escherichia coli. Nat Methods. 5:69−74.
Cohen SN, Chang ACY, Boyer HW, Helling RB. 1973. Construction of biologically functional bacterial plasmids in vitro. Proc Natl Acad Sci USA. 70:3240–44.
Cong L, Ran FA, Cox D, Lin SL, Barretto R, Habib N, Hsu PD, Wu XB, Jiang WY, Marraffini LA, Zhang F. 2013. Multiplex genome engineering using CRISPR/Cas systems. Science. 339:819−23.
Czar MJ, Anderson JC, Bader JS, Peccoud J. 2008. Gene synthesis demystified. Trends Biotechnol. 27:63−72.
Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science. 323:133−38.
Fodor SP, Rava RP, Huang XC, Pease AC, Holmes CP, Adams CL. 1993. Multiplexed biochemical assays with biological chips. Nature. 364:555−56.
Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, et al. 2005. Protein interaction mapping: a Drosophila case study. Genome Res. 15:376−84.
Gibson DG. 2014. Programming biological operating systems: genome design, assembly and activation. Nat Methods. 11:521−26.
Gibson DG, Benders GA, Andrews-Pfannkoch C, Denisova EA, Baden-Tillson H, Zaveri J, Stockwell TB, Brownley A, Thomas DW, Algire MA, et al. 2008. Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science. 319:1215–20.
Gibson DG, Glass JI, Lartigue C, Noskov VN, Chuang RY, Algire MA, Benders GA, Montague MG, Ma l, Mododie MM, et al. 2010. Creation of a bacterial cell controlled by a chemically synthesized genome. Science. 329:52–6.
Gibson DG, Young L, Chuang R-Y, Venter JC, Hutchison CA, Smith HO. 2009. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods. 6:343−45.
Haab BB. 2006. Applications of antibody array platforms. Curr Opin Biotechnol. 17:415−21.
Handelsman J. 2004. Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 68:669−85.
Hudson ME, Pozdnyakova I, Haines K, Mor G, Snyder M. 2007. Identification of differentially expressed proteins in ovarian cancer using high-density protein microarrays. Proc Natl Acad Sci USA. 104:17494−9.
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature. 409:860–921.
Itakura K, Rossi JJ, Wallace RB. 1984. Synthesis and use of synthetic oligonucleotides. Annu Rev Biochem. 53:323–56.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 98:4569–74.
Jinek M, Chylinski K, Fonfara I, Hauer M, Doudna JA, Charpentier E. 2012. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science. 337:816–21.
Ju J, Kim DH, Bi L, Meng Q, Bai X, Li Z, Li X, Marma MS, Shi S, Wu J, et al. 2006. Four-color DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proc Natl Acad Sci USA. 103:19635−40.
Köcher T, Superti-Furga G. 2007. Mass spectrometry-based functional proteomics: from molecular machines to protein networks. Nat Methods. 4:807−15.
Lederer SL, Walters KA, Proll S, Paeper B, Robinzon S, Boix L, Fausto N, Bruix J, Katze MG. 2006. Distinct cellular responses differentiating alcohol- and hepatitis C virus-induced liver cirrhosis. Virol. J. 3:98.
Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. 2003. Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 299:682−6.
Li Y. 2011. The tandem affinity purification technology: an overview. Biotechnol Lett. 33:1487–99.
Mali P, Yang LH, Esvelt KM, Aach J, Guell M, DiCarlo JE, Norville JE, Church GM. 2013. RNA-guided human genome engineering via Cas9. Science. 339:823−6.
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 437:376−80.
Martin JA, Wang Z. 2011. Next-generation transcriptome assembly. Nat Rev Genet. 12:671–82.
Maxam AM, Gilbert W. 1977. A new method for sequencing DNA. Proc Natl Acad Sci USA. 74:560–4.
Mertz JE, Davis RW. 1972. Cleavage of DNA by R1 restriction endonuclease generates cohesive ends. Proc Natl Acad Sci USA. 69:3370–4.
Mirete S, Mora-Ruiz MR, Lamprecht-Grandío M, de Figueras CG, Rosselló-Móra R, González-Pastor JE. 2015. Salt resistant genes revealed by functional metagenomics from brines and moderate-salinity rhizosphere within a hypersaline environment. Front Microbiol. 6:1121.
Mullis KB, Faloona FA, Scharf SJ, Saiki RK, Horn GT, Erlich HA. 1986. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harbor Symp Quant Biol. 51:263–73.
Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, Chen F, Platt D, Pääbo S, Pritchard JK, et al. 2006. Sequencing and analysis of Neanderthal genomic DNA. Science. 314:1113–8.
Parrish JR, Gulyas KD, Finley RL, Jr. 2006. Yeast two-hybrid contributions to interactome mapping. Curr Opin Biotechnol. 17:387–93.
Roberts RJ. 2005. How restriction enzymes became the workhorses of molecular biology. Proc Natl Acad Sci USA. 102:5905–8.
Rothberg JM, Leamon JH. 2008. The development and impact of 454 sequencing. Nat Biotechnol. 26:1117−24.
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, et al. 2007. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 5:e77.
Saiki RK, Gelfand DH, Stoffel S, Scharf S, Higuchi R, Horn GT, Mullis KB, Erlich HA. 1988. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science. 239:487–91.
Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA. 74:5463–7.
Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. 2014. Sequencing depth and coverage: key considerations in genomic analysis. Nat Rev Genetics. 15:121−32.
Sreekumar A, Poisson LM, Rajendiran TM, Khan AP, Cao Q, Yu J, Laxman B, Mehra R, Lonigro RJ, Li Y, et al. 2009. Metabolic profiles delineate potential role for sarcosine in prostate cancer progression. Nature. 457:910–4.
The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature. 526:68–74.
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 428:37−43.
Venter JC, Adams MD, Myer EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. 2001. The sequence of the human genome. Science. 291:1304–51.
Walhout AJ, Temple GF, Brasch MA, Hartley JL, Lorson MA, van den Heuvel S, Vidal M. 2000. GATEWAY recombinational cloning: application to the cloning of large numbers of open reading frames or ORFeomes. Methods Enzymol. 328:575–92.
Wang Z, Gerstein M, Snyder M. 2009. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 1057–63.
Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature. 452:872–7.
Williams RJ. 2003. Restriction endonucleases. Mol Biotechnol. 23:225–243.
Yanisch-Perron C, Vieira J, Messing J. 1985. Improved M13 phage cloning vectors and host strains: nucleotide sequences of the M13mp18 and pUC19 vectors. Gene. 33:103–19.
review questions
1. Describe a strategy using restriction endonucleases to clone a bacterial gene into a vector for propagation in E. coli. Assume that the target sequence is known. Describe the selection for E. coli cells that carry the cloned gene. Consider methods to minimize unwanted products.
2. Describe the features that make pUC19 a useful cloning vector.
3. Outline a strategy to clone a eukaryotic gene into a vector for expression in E. coli. Briefly describe the activity of the enzymes used in the process.
4. Describe how a library of open reading frames that represents a proteome is constructed by recombinational cloning.
5. A genomic DNA library of the bacterium Pseudomonas putida was constructed by partially digesting the genomic DNA with Sau3AI and inserting the fragments into pUC19 digested with BamHI. Why were two different restriction enzymes used in this experiment? How is the partial digestion performed, and what is the result? Why was a partial digestion used to construct the library?
6. How is the CRISPR-Cas system used to delete nucleotides at a specific site in a eukaryotic genome?
7. How is the CRISPR-Cas system used to insert a sequence at a specific site in a eukaryotic genome?
8. Outline the steps in a PCR cycle. What component of a PCR determines the specificity of the amplified product?
9. Describe how PCR is used to clone a specific gene.
10. Why is real-time PCR quantitative?
11. If your DNA synthesizer has an average coupling efficiency of 98.5%, what overall synthesis yield would you expect after the synthesis of a 50-mer oligonucleotide?
12. Suggest two different strategies for synthesizing a 0.5-kb gene.
13. What is a dideoxynucleotide? How is it used to determine the sequence of a DNA molecule?
14. Outline the basic features of pyrosequencing.
15. How are incorporated nucleotides recognized after each cycle of sequencing using reversible chain terminators? How does this differ from pyrosequencing?
16. What are some advantages of sequencing DNA by real-time single molecule synthesis?
17. Why are adaptors often ligated to DNA fragments prior to sequencing?
18. Describe how a cluster of thousands of copies of a DNA sequencing template can be generated on a solid support.
19. Define the following terms as they relate to genome sequencing: shotgun cloning, multiplexing, massive parallelization, barcode, Phred quality score, contig, scaffold, paired end reads, coverage, de novo sequencing, resequencing, reference genome, annotation, metagenome.
20. Outline a DNA microarray experiment. List some applications for this technology.
21. What are some of the advantages of using RNA sequencing rather than DNA microarrays to profile gene expression?
22. How are gene expression levels quantified using high-throughput RNA sequencing?
23. Explain why random hexamers used for RNA sequencing result in overrepresentation of sequences from the 5′ end of a gene.
24. How can 2D PAGE be used to identify proteins that are differentially expressed in two samples?
25. Describe some applications for protein microarrays.
26. What biological information may be provided by a protein microarray assay that is not provided by using a DNA microarray?
27. Describe two methods to determine protein-protein interactions.
28. What is an affinity tag?
29. What biological information may be provided by the tandem affinity purification tag system that is not provided by a two-hybrid assay?
30. Explain how metabolomics may be used to identify biomarkers of disease.