Читать книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari - Страница 9

Оглавление

1

Introduction to Protein Complex Prediction

Unfortunately, the proteome is much more complicated than the genome.

—Carol Ezzell [Ezzel et al. 2002]

In an early survey, American biochemist Bruce Alberts termed large assemblies of proteins as protein machines of cells [Alberts et al. 1998]. Protein assemblies are composed of highly specialized parts that coordinate to execute almost all of the biochemical, signaling, and functional processes in cells [Alberts et al. 1998]. It is not hard to see why protein assemblies are more advantageous to cells than individual proteins working in an uncoordinated manner. Compare, for example, the speed and elegance of the DNA replication machinery that simultaneously replicates both strands of the DNA double helix with what could ensue if each of the individual components—DNA helicases for separating the double-stranded DNA into single stands, DNA polymerases for assembling nucleotides, DNA primase for generating the primers, and the sliding clamp to hold these enzymes onto the DNA—acted in an uncoordinated manner [Alberts et al. 1998]. Although what might seem like individual parts brought together to perform arbitrary functions, protein assemblies can be very specific and enormously complicated. For example, the spliceosome is composed of 5 small nuclear RNAs (snRNAs or “snurps”) and more than 50 proteins, and is thought to catalyze an ordered sequence of more than 10 RNA rearrangements at a time as it removes an intron from an RNA transcript [Alberts et al. 1998, Baker et al. 1998]. The discovery of this intron-splicing process won Phillip A. Sharp and Richard J. Roberts the 1993 Nobel Prize in Physiology or Medicine.1

Protein assemblies are known to be in the order of hundreds even in the simplest of eukaryotic cells. For example, more than 400 protein assemblies have been identified in the single-celled eukaryote Saccharomyces cerevisiae (budding yeast) [Pu et al. 2009]. However, our knowledge of these protein assemblies is still fragmentary, as is our conception of how each of these assemblies work together to constitute the “higher level” functional architecture of cells. A faithful attempt toward identification and characterization of all protein assemblies is therefore crucial to elucidate the functioning of the cellular machinery.

To identify the entire complement of protein assemblies, it is important to first crack the proteome—a concept so novel that the word “proteome” first appeared only around 20 years ago [Wilkins et al. 1996, Bryson 2003, Cox and Mann 2007]. The proteome, as defined in the UniProt Knowledgebase, is the entire complement of proteins expressed or derived from protein-coding genes in an organism [Bairoch and Apweiler 1996, UniProt 2015]. With the introduction of high-throughput experimental (proteomics) techniques including mass spectrometric [Cox and Mann 2007, Aebersold and Mann 2003] and protein quantitative trait locus (QTL) technologies [Foss et al. 2007], mapping of proteins on a large scale has become feasible. Just like how genomics techniques (including genome sequencing) were first demonstrated in model organisms, proteome-mapping has progressed initially and most rapidly for model prokaryotes including Escherichia coli (bacteria) and model eukaryotes including Saccharomyces cerevisiae (budding or baker’s yeast), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (a nematode), and Arabidopsis thaliana (a flowering plant). Table 1.1 summarizes the numbers of proteins or protein-coding genes identified from these organisms. Of these, the proportions of protein-coding genes that are essential (genes that are thought to be critical for the survival of the cell or organism; “fitness genes”) range from ∼2% in Drosophila to ∼6.5% in Caenorhabditis and ∼18% in Saccharomyces [Cherry et al. 2012, Chen et al. 2012]. Recent landmark studies using large-scale proteomics [Wilhelm et al. 2014, Kim et al. 2014, Uhlén et al. 2010, Uhlén et al. 2015] on Homo sapiens (human) cells have characterized >17,000 (or >90%) putative protein-coding genes from ≥40 tissues and organs in the human body. An encyclopedic resource on these proteins covering their levels of expression and abundance in different human tissues is available from the ProteomicsDB (http://www.proteomicsdb.org/) [Wilhelm et al. 2014], The Human Proteome Map (http://humanproteomemap.org/) [Kim et al. 2014], and The Human Protein Atlas (http://www.proteinatlas.org/) [Uhlén et al. 2010, Uhlén et al. 2015] projects. GeneCards (http://www.genecards.org/) [Safran et al. 2002, Safran et al. 2010] aggregates information on human protein-coding genes from >125 Web sources and presents the information in an integrative user-friendly manner. The expression levels of nearly 200 proteins that are essential for driving different human cancers are available from The Cancer Proteome Atlas (TCPA) project (http://app1.bioinformatics.mdanderson.org/tcpa/_design/basic/index.html) [Li et al. 2013], measured from more than 3,000 tissue samples across 11 cancer types studied as part of The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/). Short-hairpin RNA (shRNA)-mediated knockdown [Paddison et al. 2002, Lambeth and Smith 2013], clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9-based gene editing [Sanjana et al. 2014, Baltimore et al. 2015, Shalem et al. 2015], and disruptive mutagenesis [Bökel 2008] screening using MCF-10A (near-normal mammary), MDA-MB-435 (breast cancer), KBM7 (chronic myeloid leukemia), HAP1 (haploid), A375 (melanoma), HCT116 (colorectal cancer), and HUES62 (human embryonic stem) cells have characterized 1,500–1,880 (or 8–10%) “core” protein-coding genes as essential in human cells [Marcotte et al. 2016, Silva et al. 2008, Wang et al. 2014, Hart et al. 2015, Hart et al. 2014, Wang et al. 2015, Blomen et al. 2015].

Table 1.1 Examples of proteome resources for some model and higher-order organisms (as of December 2015), covering also Danio rerio (Zebrafish), Mus musculus (house mouse), Rattus norvegicus (Norwegian rat), Schizosaccharomyces pombe (fission yeast), and Xenopus laevis (African clawed frog)


Comparative analyses of proteomes from different species have revealed interesting insights into the evolution and conservation of proteins. For example, it is estimated that the genomes (proteomes) of human and budding yeast diverged about 1 billion years ago from a common ancestor [Douzery et al. 2014], and these share several thousand genes accounting for more than one-third of the yeast genome [O’Brien et al. 2005, Östlund et al. 2010]. Yeast and human orthologs are highly diverged; the amino-acid sequence similarity between human and yeast proteins ranges from 9–92%, with a genome-wide average of 32%. But, sequence similarity predicts only a part of the picture [Sun et al. 2016]. Recent studies [Kachroo et al. 2015, Laurent et al. 2015] have reported that 414 (or nearly half of the) essential protein-coding genes in yeast could be “replaced” by human genes, with replaceability depending on gene (protein) assemblies: genes in the same process tend to be similarly replaceable (e.g., sterol biosynthesis) or not replaceable (e.g., DNA replication initiation).

Irrespective of whether in a lower-order model or a higher-order complex organism, a protein has to physically interact with other proteins and biomolecules to remain functional. Estimates in human suggest that over 80% of proteins do not function alone, but instead interact to function as macromolecular assemblies [Berggárd et al. 2007]. This organization of individual proteins into assemblies is tightly regulated in cellular space and time, and is supported by protein conformational changes, posttranslational modifications, and competitive binding [Gibson and Goldberg 2009]. On the basis of the stability (area of interaction surface and duration of interaction) and partner specificity, the interactions between proteins are classified as homo- or hetero-oligomeric, obligate or non-obligate, and permanent or transient [Zhang 2009, Nooren and Thornton 2003]. Proteins in obligate interactions cannot exist as stable structures on their own and are frequently bound to their partners upon translation and folding, whereas proteins in non-obligate interactions can exist as stable structures in bound and unbound states. Obligate interactions are generally permanent or constitutive, which once formed exist for the entire lifetime of the proteins, whereas non-obligate interactions may be permanent, or alternatively transient, wherein the protein interacts with its partners for a brief time period and dissociates after that. Depending on the functional, spatial, and temporal context of the interactions, protein assemblies are classified as protein complexes, functional modules, and biochemical (metabolic) and signaling pathways.

Protein complexes are the most basic forms of protein assemblies and constitute fundamental functional units within cells. Complexes are stoichiometrically stable structures and are formed from physical interactions between proteins coming together at a specific time and space. Complexes are responsible for a wide range of functions within cells including formation of cytoskeleton, transportation of cargo, metabolism of substrates for the production of energy, replication of DNA, protection and maintenance of the genome, transcription and translation of genes to gene products, maintenance of protein turn over, and protection of cells from internal and external damaging agents. Complexes can be permanent—i.e., once assembled can function for the entire lifetime of cells (e.g., ribosomes)—or transient—i.e., assembled temporarily to perform a specific function and are disassembled after that (e.g., cell-cycle kinase-substrate complexes formed in a cell-cycle dependent manner).

Functional modules are formed when two or more protein complexes interact with each other and often other biomolecules (viz. nucleic acids, sugars, lipids, small molecules, and individual proteins) at a specific time and space to perform a particular function and disassociate after that. This molecular organization has been termed “protein sociology” [Robinson et al. 2007]. For example, the DNA replication machinery, highlighted earlier, is formed by a tightly coordinated assembly of DNA polymerases, DNA helicase, DNA primase, the sliding clamp and other complexes within the nucleus to ensure error-free replication of the DNA during cell division.

Pathways are formed when sets of complexes and individual proteins interact via an ordered sequence of interactions to transduce signals (signaling pathways) or metabolize substrates from one form to another (metabolic pathways). For example, the MAPK pathway is composed of a sequence of microtubule-associated protein kinases (MAPKs) that transduce signals from the cell membrane to the nucleus, to induce the transcription of specific genes within the nucleus. Unlike complexes and functional modules, pathways do not require all components to co-localize in time and space.

1.1 From Protein Interactions to Protein Complexes

Physical interactions between proteins are fundamental to the formation of protein complexes. Therefore, mapping the entire complement of protein interactions (the “interactome”) occurring within cells (in vivo) is crucial for identifying and characterizing complexes. However, inferring all interactions occurring during the entire lifetime of cells in an organism is challenging, and this challenge increases multifold as the complexity of the organism increases—e.g., for multicellular organisms made up of multiple cell types.

The development of high-throughput proteomics technologies including yeast two-hybrid- (Y2H) [Fields and Song 1989], co-immunoprecipitation (Co-IP) [Golemis and Adams 2002] and affinity-purification (AP)-based [Rigaut et al. 1999] screens have revolutionized our ability to interrogate protein interactions on a massive scale, and have enabled global surveys of interactomes from a number of organisms. In particular, up to 70% of the interactions from model organisms including yeast [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002, Gavin et al. 2002, Gavin et al. 2006, Krogan et al. 2006], fly [Guruharsha et al. 2011], and nematode [Butland et al. 2005, Li et al. 2004] have been mapped, and the identification of interactions from higher-order multicellular organisms including species of flowering plant Arabidopsis, fish Danio (zebrafish), and several mammals—Mus musculus (house mouse), Rattus norvegicus (Norwegian rat), and humans—is rapidly underway; the interactions are cataloged in large public databases [Stark et al. 2011, Rolland et al. 2014].

The earliest and most widely used experimental techniques to capture binary interacting proteins on a high-throughput scale were mostly yeast two-hybrid (Y2H) [Fields and Song 1989]. However, datasets of protein interactions inferred from Y2H screens were found to have significant numbers of spurious interactions [Von Mering et al. 2002, Bader and Hogue 2002, Bader et al. 2004]. This is attributed in part to the nature of the Y2H protocol in which all potential interactors are tested within the same compartment (nucleus) even though some of these do not meet during their lifetimes due to compartmentalization (different subcellular localizations) within living cells.

Co-immunoprecipitation or affinity-purification (Co-IP/AP) techniques were introduced later and these are more specific in detecting interactions between co-complexed proteins [Golemis and Adams 2002, Rigaut et al. 1999, Köcher and Superti-Furga 2007]. In these protocols, cohesive groups or complexes of proteins are “pulled down,” from which the binary interactions between the proteins are individually inferred. However, this indirect inference could lead to over- or under-estimation of protein interactions. In the tandem affinity purification (TAP) procedure [Rigaut et al. 1999, Puig et al. 2001], proteins of interest (“baits”) are TAP-tagged and purified in an affinity column with potential interaction partners (“preys”). The pulled-down complexes are subjected to mass spectrometric (MS) analysis to identify individual components within the complexes. However, although more reliable than Y2H, the TAP/MS procedure can be elaborate and with the inclusion of MS, it can be expensive too. The exhaustiveness of TAP/MS depends on the baits used—there is no way to identify all possible complexes unless all possible baits are tested. Proteins which do not interact directly with the chosen bait but interact with one or more of the preys, might also get pulled down as part of the purified complex. In some cases, these proteins are indeed part of the real complex whereas in other cases these proteins are not (i.e., they are contaminants); therefore multiple purifications are required, possibly with each protein as a bait and as a prey, to identify the correct set of proteins within the complex. The TAP procedure therefore offers two successive affinity purifications so that the chance of retained contaminants reduces significantly. Conversely, a chosen bait might form a real complex with a set of proteins without actually interacting directly with every protein from the set, and therefore some proteins might not get pulled down as part of the purified complex. In these cases, multiple baits would need to be tested to assemble the complete complex. Moreover, since some proteins participate in more than one complex, multiple independent purifications are required to identify all hosting complexes for these proteins.

Binary interactions between the proteins in a pulled-down protein complex are inferred using two models: matrix and spoke. In the matrix model, a binary interaction is inferred between every pair of proteins within the complex, whereas in the spoke model interactions are inferred only between the bait and all its preys. Since all pairs of proteins within a complex do not necessarily interact, the matrix model is usually an overestimation of the total number of binary interactions, whereas the spoke model is an underestimation. Therefore, usually a balance is struck between the two models that is close enough to the estimated total number of interactions for the species or organism.

Table 1.2 Numbers of mapped physical interactions between proteins across different model and higher-order organisms

Organism No. of Interactions No. of Proteins
A. thaliana 34,320 9,240
C. elegans 5,783 3,269
D. rerio 188 181
D. melanogaster 36,741 8,071
E. coli 99 104
H. sapiens 230,843 20,006
M. musculus 18,465 8,611
R. norvegicus 4,537 3,328
S. cerevisiae 82,327 6,278
S. pombe 9,492 2,944
X. laevis 532 471

Based on BioGrid version 3.4.130 (November 2015) [Stark et al. 2011, Chatr-Aryamontri et al. 2015].

Despite differences in procedures and technologies, the use of different experimental protocols can effectively complement one another in detecting interactions. While TAP can be more specific and detect mainly stable (co-complexed) protein interactions, Y2H can be more exhaustive and detect even transient and between-complex interactions. Based on BioGrid version 3.4.130 (November 2015) (http://thebiogrid.org/) [Stark et al. 2011, Chatr-Aryamontri et al. 2015], the numbers of mapped physical interactions range from 99 in E. coli to ~82,300 in S. cerevisiae and ~230,900 in H. sapiens (summarized in Table 1.2). It remains to be seen how many of these interactions actually occur in the physiological contexts of living cells or cell types, how many are subject to genetic and physiological variations, and how many still remain to be mapped.

The binary interactions inferred from the different experiments are assembled into a protein-protein interaction network, or simply, PPI network. The PPI network presents a global or “systems” view of the interactome, and provides a mathematical (topological) framework to analyze these interactions. Protein complexes are expected to be embedded as modular structures within the PPI network [Hartwell et al. 1999, Spirin and Mirny 2003]. Topologically, this modularity refers to densely connected subsets of proteins separated by less-dense regions in the network [Newman 2004, Newman 2010]. Biologically, this modularity represents division of labor among the complexes, and provides robustness against disruptions to the network from internal (e.g., mutations) and external (e.g., chemical attacks) agents. Computational methods developed to identify protein complexes therefore mine for modular subnetworks in the PPI network. While this strategy appears reasonable in general, limitations in PPI datasets, arising due to the shortcomings highlighted above in experimental protocols, severely restrict the feasibility of accurately predicting complexes from the network. Specifically, the limitations in existing PPI datasets that directly impact protein complex prediction include:

1. presence of a large number of spurious (noisy) interactions;

2. relative paucity of interactions between “complexed” proteins; and

3. missing contextual—e.g., temporal and spatial—information about the interactions.

These limitations translate to the following three main challenges currently faced by computational methods for protein complex prediction:

1. difficulty in detecting sparse complexes;

2. difficulty in detecting small (containing fewer than four proteins) and sub-complexes; and

3. difficulty in deconvoluting overlapping complexes (i.e., complexes that share many proteins), especially when these complexes occur under different cellular contexts.

While the interactome coverage can be improved by integrating multiple PPI datasets, the lack of agreement between the datasets from different experimental protocols [Von Mering et al. 2002, Bader et al. 2004], and the multifold increase in accompanying noise (spurious interactions), tend to cancel out the advantage gained from the increased coverage. Consequently, the confidence of each interaction has to be assessed (confidence scoring) and low-confidence interactions have to be first removed from the datasets (filtering) before performing any downstream analysis. To summarize, computational identification of protein complexes from interaction datasets follows these steps (Figure 1.1):

1. integrating interactions from multiple experiments and stringently assessing the confidence (reliability) of these interactions;

2. constructing a reliable PPI network using only the high-confidence interactions;

Figure 1.1 Identification of protein complexes from protein interaction data. (a) A high-confidence PPI network is assembled from physical interactions between proteins after discarding low-confidence (potentially spurious) interactions. (b) Candidate protein complexes are predicted from this PPI network using network-clustering approaches. The quality of the predicted complexes is validated against bona fide complexes, whereas novel complexes are functionally assessed and assigned new roles where possible.

3. identifying modular subnetworks from the PPI network to generate a candidate list of protein complexes; and

4. evaluating these candidate complexes against bona fide complexes, and validating and assigning roles for novel complexes.

As we shall see in the following chapters, several sophisticated approaches have been developed over the years to overcome some of the above-mentioned challenges.

Computational methods have co-evolved with proteomics technologies, and over the last ten years a plethora of computational methods have been developed to predict complexes from PPI networks, which is the subject of this book. In general, computational methods complement experimental approaches in several ways. These methods have helped counter some of the limitations arising in proteomic studies, e.g., by eliminating spurious interactions via interaction scoring, and by enriching true interactions via prediction of missing interactions. The novel interactions and protein complexes predicted from these methods have been added back to proteomics databases, and these have helped to further enhance our resources and knowledge in the field.

1.2 Databases for Protein Complexes

Several high-quality resources for protein complexes have been developed over the years covering both lower-order model and higher-order organisms (summarized in Table 1.3). In total, Aloy [Aloy et al. 2004], CYC2008 [Pu et al. 2009], and MIPS [Mewes et al. 2008] contain over 450 manually curated complexes from S. cerevisiae (budding yeast). CORUM [Reuepp et al. 2008, 2010] contains ∼3,000 mammalian complexes of which ∼1,970 are protein complexes identified from human cells. The European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI) maintain a database of manually curated protein complexes from 18 different species including C. elegans, H. sapiens, M. musculus, S. cerevisiae, and S. pombe [Meldal et al. 2015].

Havugimana et al. [2012] present a dataset of 622 putative human soluble protein complexes (http://human.med.utoronto.ca/) identified using high-throughput AP/MS pulldown and PPI-clustering approaches. Huttlin et al. [2015] present 352 putative human complexes identified from human embryonic (HEK293T) cells (http://wren.hms.harvard.edu/bioplex/). Wan et al. [2015] present a catalog of conserved metazoan complexes (http://metazoa.med.utoronto.ca/) identified by clustering of high-quality pulldown interactions from C. elegans, D. melanogaster, H. sapiens, M. musculus, and Strongylocentrotus purpuratus (purple sea urchin). This dataset includes ~300 complexes composed of entirely ancient proteins (evolutionarily conserved from lower-order organisms), and ~500 complexes composed of largely ancient proteins conserved ubiquitously among eurkaryotes. Drew et al. [2017] present a comprehensive catalog of >4,600 computationally predicted human protein complexes covering >7,700 proteins and >56,000 interactions by analyzing data from >9,000 published mass spectrometry experiments. Vinayagam et al. [2013] present COMPLEAT (http://www.flyrnai.org/compleat/), a database of 3,077, 3,636, and 2,173 literature-curated protein complexes from D. melanogaster, H. sapiens, and S. cerevisiae, respectively. Ori et al. [2016] combined mammalian complexes from CORUM and COMPLEAT to generate a dataset of 279 protein complexes from mammals.

Table 1.3 Publicly available databases for protein complexes a


a. No. of complexes as of 2016.

b. COMPLEAT includes protein complexes from D. melanogaster, H. sapiens, and S. cerevisiae. The EMBL-EBI portal includes protein complexes from 18 different species of which are C. elegans (16 complexes), H. sapiens (441), M. musculus (404), S. cerevisiae (399), and S. pombe (16). CORUM includes mammalian protein complexes, mainly from H. sapiens (64%), M. musculus (house mouse) (15%) and R. norvegicus (12%) (Norwegian rat).

c. Includes mainly conserved complexes among the metazoans, C.elegans, D. melanogaster, H. sapiens, M. musculus, and Strongylocentrotus purpuratus (purple sea urchin), consisting of 344 complexes with entirely ancient proteins and 490 complexes with largely ancient proteins conserved ubiquitously among eurkaryotes.

1.3 Organization of the Rest of the Book

The rest of this book reads as follows. Chapter 2 discusses important concepts underlying PPI networks and presents prerequisites for understanding subsequent chapters. We discuss different high-throughput experimental techniques employed to infer PPIs (including the Y2H and AP/MS techniques mentioned earlier), explaining briefly the biological and biochemical concepts underlying these techniques and highlighting their strengths and weaknesses. We explain computational approaches that denoise (PPI weighting) and integrate data from multiple experiments to construct reliable PPI networks. We also discuss topological properties of PPI networks, theoretical models for PPI networks, and the various databases and software tools that catalog and visualize PPI networks. Chapter 3 forms the main crux of this book as it introduces and discusses in depth the algorithmic underpinnings of some of the classical (seminal) computational methods to identify protein complexes from PPI networks. While some of these methods work solely on the topology of the PPI network, others incorporate additional biological information—e.g., in the form of functional annotations—with PPI network topology to improve their predictions. Chapter 4 presents a comprehensive empirical evaluation of six widely used protein complex prediction methods available in the literature using unweighted and weighted PPI networks from yeast and human. Taking a known human protein complex as an example, we discuss how the methods have fared in recovering this complex from the PPI network. Based on this evaluation, we explain in Chapter 5 the shortcomings of current methods in detecting certain kinds of protein complexes, e.g., protein complexes that are sparse or that overlap with other complexes. Through this, we highlight the open challenges that need to be tackled to improve coverage and accuracy of protein complex prediction. We discuss some recently proposed methods that attempt to tackle these open challenges and to what extent these methods have been successful. Chapter 6 is dedicated to an important class of protein complexes that are dynamic in their protein composition and assembly. While some of these protein complexes are temporal in nature—i.e., assemble at a specific timepoint and dissociate after that—others are structurally variable—e.g., change their 3D structure and/or composition—based on the cellular context. Quite obviously, it is not possible to detect dynamic complexes solely by analyzing the PPI network; methods that integrate gene or protein expression and 3D structural information are required. These more-sophisticated methods are covered here. Chapter 7 discusses methods to identify protein complexes that are conserved between organisms or species; these evolutionarily conserved complexes provide important insights into the conservation of cellular processes through the evolution. Finally, in today’s era of systems biology where biological systems are studied as a complex interplay of multiple (biomolecular) entities, we explain how protein complex prediction methods are playing a crucial role in shaping up the field; these applications are covered in Chapter 8. We discuss the application of these methods for predicting dysregulated or dysfunctional protein complexes, identifying rewiring of interactions within complexes, and in discovery of new disease genes and drug targets. We conclude the book in Chapter 9 by reiterating the diverse applications of protein complex prediction methods and thereby the importance of computational methods in driving this exciting field of research.

1. http://www.nobelprize.org/nobel_prizes/medicine/laureates/1993/

Computational Prediction of Protein Complexes from Protein Interaction Networks

Подняться наверх