Читать книгу Principles of Virology - Jane Flint, S. Jane Flint - Страница 101

BOX 2.9 EXPERIMENTS Pathogen de-discovery

High-throughput sequencing of nucleic acids has accelerated the pace of virus discovery, but at a cost: contaminants are much easier to detect.

During a search for the causative agent of seronegative hepatitis (disease not caused by hepatitis A, B, C, D, or E virus) in Chinese patients, a new virus with a single-stranded DNA genome was discovered in sera by high-throughput sequencing. Seventy percent of 90 patient serum samples were positive for viral DNA by PCR, and sera from 45 healthy controls were negative. Furthermore, 84% of patients were positive for antibodies against the virus. Among healthy controls, 78% were antibody positive. The authors concluded that this virus was highly prevalent in some patients with seronegative hepatitis. A second independent laboratory identified the same virus in sera from patients in the United States with non-A-to-E hepatitis, while a third group identified the virus in diarrheal stool samples from Nigeria.

The first clue that something was amiss was the observation that the new virus identified in all three laboratories shared 99% nucleotide and amino acid identity: this similarity would not be expected in virus samples from such geographically, temporally, and clinically diverse samples. Another problem was that in the U.S. non-A-to-E hepatitis study, all pools of patient sera were positive for viral sequences. These observations suggested the possibility of viral contamination.

When nucleic acids were repurified from the U.S. non-A-to-E hepatitis samples using a different method, none were positive for the new virus. The presence of the virus was traced to the use of column-based purification kits manufactured by Qiagen, Inc. (pictured). Nearly the entire viral genome could be detected by deep sequencing of sterile water that was passed through these columns. The nucleic acid purification columns contaminated with the new virus were used to purify nucleic acid from patient samples. These columns, produced by a number of manufacturers, are typically an inch in length and contain a silica gel membrane that binds nucleic acids. The clinical samples are added to the column, which is then centrifuged briefly to remove liquids (hence the name “spin” columns). The nucleic acid adheres to the silica gel membrane. Contaminants are washed away, and the nucleic acids are then released from the silica by the addition of a buffer.

Why were the Qiagen spin columns contaminated with viral DNA? A search of the publicly available environmental metagenomic data sets revealed the presence of sequences highly related to this virus (87 to 99% nucleotide identity). The data sets containing these sequences were obtained from seawater collected off the Pacific coast of North America and coastal regions of Oregon and Chile. The source of contamination could be explained if the silica in the Qiagen spin columns was produced from ocean-dwelling diatoms that were infected with the virus.

In retrospect, it was easy to be fooled into believing that the novel virus might be a human pathogen because it was detected only in sick and not healthy patients. Why antibodies to the virus were detected in samples from both sick and healthy patients remains to be explained. However, the virus is not likely to be associated with any human illness: when non-Qiagen spin columns were used, the viral sequences were not found in any patient sample.

The lesson to be learned from this story is clear: high-throughput sequencing is a very powerful and sensitive method but must be applied with great care. Every step of the virus discovery process must be carefully controlled, from the water used to the plastic reagents. Most importantly, laboratories carrying out pathogen discovery must share their sequence data, something that took place during this study.

Naccache SN, Greninger AL, Lee D, Coffey LL, Phan T, Rein-Weston A, Aronsohn A, Hackett J, Jr, Delwart EL, Chiu CY. 2013. The perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns. J Virol 87:11966–11977.

Xu B, Zhi N, Hu G, Wan Z, Zheng X, Liu X, Wong S, Kajigaya S, Zhao K, Mao Q, Young NS. 2013. Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing. Proc Natl Acad Sci U S A 110:10264–10269.

Computational biology. The generation of nucleotide sequences at an unprecedented rate has spawned a new branch of bioinformatics to develop algorithms for assembling sequence reads into continuous strings and to determine whether they are from a new or previously discovered virus. Storing, analyzing, and sharing massive quantities of data constitute an immense challenge: the number of bases in GenBank, an open-access, annotated collection of all publicly available nucleotide sequences produced and maintained by the National Center for Biotechnology Information, has doubled every 18 months since 1982. As of June 2019 GenBank held 329,835,282,370 bases.

Computational problems must be solved at multiple steps during the process of genome sequencing. The initial problem is that sequence reads are typically short, and there are many of them (e.g., high throughput). These short sequences must be overlapped and, if possible, mapped to a genome. Many computer programs have been developed to address this problem. Some carry out alignment of sequence reads to a reference genome, while others perform this process de novo, i.e., in the absence of a reference genome.

When clinical or environmental samples are subjected to high-throughput sequencing for pathogen discovery, it is essential to identify viral sequences in what is typically a mix of host, bacterial, and fungal sequences. This task relies on alignment of sequences to reference viral databases. However, such databases are limited because most of the sequences retrieved in metagenomic studies are unknown (so-called “dark matter”) and therefore cannot be annotated. Consequently, computational pipelines have been designed to analyze high-throughput sequencing data to search for those likely to be of viral origin.

Some computational pipelines are designed to define the abundance and types of viruses in a sample, such as Viral Informatics Resource for Metagenome Exploration (VIROME), the Viral MetaGenome Annotation Project (VMGAP), and Basic Local Alignment Search Tool (BLAST). Other virus discovery programs (MePIC, READSCAN, CaPSID, VirusFinder, and SRSA) rely on nucleotide sequence alignment and will work only for the detection of viruses with high sequence similarity to known viruses. PathSeq, SURPI, VirFind, and VirusHunter identify viruses by amino acid searches, a computationally demanding exercise that is critical for new virus identification. VirusSeeker-Virome (VS-Virome) is a computational pipeline designed for defining both the type and abundance of known and novel viral sequences in metagenomic data sets (Fig. 2.17).

Genome sequences can provide considerable insight into the evolutionary relationships among viruses. Such information can be used to understand the origin of viruses and how selection pressures change viral genomes and to assist in epidemiological investigations of viral outbreaks. When few viral genome sequences were available, pairwise homologies were often displayed in simple tables. As sequence databases increased in size, tables of multiple alignments were created, but these were still based only on pairwise comparisons. Today, phylogenetic trees are used to illustrate the relationships among numerous viruses or viral proteins (Box 2.10). Not only are such trees important tools for understanding evolutionary relationships, but they may allow conclusions to be drawn about biological functions: examination of a phylogenetic tree may allow determination of how closely or distantly a sequence relates to one of known function. Software programs such as AdaPatch, AntiPatch, and AntigenicTree have been developed to produce phylogenetic trees. However, these approaches do not account for horizontal gene transfer, recombination, or the evolutionary relationships between viruses and their hosts, which will require unconventional computational methods to resolve.

Algorithms have also been written to apply high-throughput sequencing methods to a variety of genome-wide analyses, including detection of single-nucleotide polymorphisms (SNP), RNA-seq, ChiP-seq, CLIP, and more (see below).

Подняться наверх