Читать книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari - Страница 10

Оглавление

2

Constructing Reliable Protein-Protein Interaction (PPI) Networks

No molecule arising naturally (MAN) is an island, entire of itself.

—John Donne (1573–1631), English poet and cleric (modified [Dunn 2010] from original quote, “No man is an island, entire of itself.”)

The identification of PPIs yields insights into functional relationships between proteins. Over the years, a number of different experimental techniques have been developed to infer PPIs. This inference of PPIs is orthogonal, but also complementary, to experiments inferring genetic interactions; both provide lists of candidate interactions and implicate functional relationships between proteins [Morris et al. 2014].

2.1 High-Throughput Experimental Systems to Infer PPIs

Physical interactions between proteins are inferred using different biochemical, biophysical, and genetic techniques (summarized in Table 2.1). Yeast two-hybrid (Y2H; less commonly, YTH) [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002] and protein-fragment complement assays [Michnick 2003, Remy and Michnick 2004, Remy et al. 2007] enable identification of direct binary physical interactions between the proteins, whereas co-immunoprecipitation or affinity purification assays [Golemis and Adams 2002, Rigaut et al. 1999, Köcher and Superti-Furga 2007, Dunham et al. 2012] enable pull down of whole protein complexes from which the binary interactions are inferred. Protein-fragment complementation assay (PCA) coupled with biomolecular fluorescence complementation (BIFC) [Grinberg et al. 2004] enables mapping of interaction surfaces of proteins, and is thus a good tool to confirm protein binding. Membrane YTH and mammalian membrane YTH (MaMTH) [Lalonde et al. 2008, Kittanakom et al. 2009, Lalonde et al. 2010, Petschnigg et al. 2014, Yao et al. 2017] enable identification of interactions involving membrane or membrane-bound proteins which are typically difficult to identify using traditional Y2H and AP techniques. Techniques inferring genetic interactions [Brown et al. 2006] enable detection of functional associations or genetic relationships between the proteins (genes), but these associations do not always correspond to physical interactions. Here, we present only an overview of each of the experimental techniques; for a more descriptive survey, the readers are referred to Brückner et al. [2009], Shoemaker and Panchenko [2007], and Snider et al. [2015].

Table 2.1 Experimental techniques for screening protein interactions; these techniques can be employed in a high-throughput manner to screen whole protein libraries for potential interactors


Yeast Two-Hybrid (Y2H) Screening System

Y2H was first described by Fields and Song [1989] and is based on the modularity of binding domains in eukaryotic transcription factors. Eukaryotic transcription factors have at least two distinct domains: (1) the DNA binding domain (BD) which directs binding to a promoter DNA sequence (upstream activating sequence (UAS)); and (2) the transcription activating domain (AD) which activates the transcription of target reporter genes. Splitting the BD and AD domains inactivates transcription, but even indirectly connecting AD and BD can restore transcription resulting in activation of specific reporter genes. Plasmids are engineered to produce a protein product (chimeric or “hybrid”) in which the BD fragment is in-frame fused onto a protein of interest (the bait), while the AD fragment is in-frame fused onto another protein (the prey) (Figure 2.1). The plasmids are then transfected into cells chosen for the screening method, usually from yeast. If the bait and prey proteins interact, the AD and BD domains are indirectly connected, resulting in the activation of reporters within nuclei of cells. Typically, multiple independent yeast colonies are assayed for each combination of plasmids to account for the heterogeneity in protein expression levels and their ability to activate reporter transcription.

This basic Y2H technique has been improved over the years to enable large library screening [Chien et al. 1991, Dufree et al. 1993, Gyuris et al. 1993, Finley and Brent 1994]. Interaction mating is one such protocol that can screen more than one bait against a library of preys, and can save considerable time and materials. In this protocol, the AD- and BD-fused proteins begin in two different haploid yeast strains with opposite mating types. These proteins are brought together by mating, a process in which two haploid cells fuse to form a single diploid cell. The diploids are then tested using conventional reporter activation for possible interactors. Therefore, different bait-expressing strains can be mated with a library of prey-expressing strains, and the resulting diploids can be screened for interactors. It is important to know how many viable diploids have arisen and to determine the false-positive frequency of the detected interactions. True interactors tend to come up in a timeframe specific for each given bait, with false positives clustering at a different timepoint. Multiple yeast colonies are assayed to confirm the interactors.

Figure 2.1 Schematic representation of the yeast two-hybrid protocol to detect interaction between bait and prey proteins. If the bait and prey proteins interact, the DNA binding domain (BD) fused to the bait and the transcription activing domain (AD) fused to the prey are indirectly connected resulting in the activation of the reporter gene. UAS: upstream activating sequence (promoter) of the reporter gene.

Y2H screens have been extensively used to detect protein interactions among yeast proteins, with two of the earliest studies reporting 692 [Uetz et al. 2000] and 841 [Ito et al. 2000] interactions for S. cerevisiae. In the bacteria Helicobacter pylori, one of the first applications of Y2H identified over 1,200 interactions, covering about 47% of the bacterial proteome [Rain et al. 2001]. Applications on fly proceeded on an even greater scale when Giot et al. [2003] identified 10,021 protein interactions involving 4,500 proteins in D. melanogaster. More recently, Vo et al. [2016] used Y2H to map binary interactions in the yeast S. pombe (fission yeast). This network, called FissionNet, consisted of 2,278 interactions covering 4,989 protein-coding genes in S. pombe. The Y2H system has also been applied for humans, with two initial studies [Rual et al. 2005, Stelzl et al. 2005] yielding over 5,000 interactions among human proteins. More recently, Rolland et al. [2014] employed Y2H to characterize nearly 14,000 human interactions.

However, inherent to this type of library screening, the number of detected false-positive interactions is usually high. Among the possible reasons for the generation of false positives is that the experimental compartmentalization (within the nucleus) for bait and prey proteins does not correspond to the natural cellular compartmentalization. Moreover, proteins that are not correctly folded under experimental conditions or are “sticky” may show non-specific interactions. The third source of false positives is the interaction of the preys themselves with reporter proteins, which can turn on the reporter genes. Von Mering et al. [2002] estimated the accuracy of classic Y2H to be less than 10%, with subsequent evaluations suggesting the number of false positives to be between 50% and 70% in large-scale Y2H interaction datasets for yeast [Bader and Hogue 2002, Bader et al. 2004].

Co-Immunoprecipitation/Affinity Purification (AP) Followed by Mass Spectrometry (Co-IP/AP followed by MS)

Complementing the in vivo Y2H screens are the in vitro Co-IP/AP followed by MS screens that identify whole complexes of interacting proteins, from which the binary interactions between proteins can be inferred [Golemis and Adams 2002, Rigaut et al. 1999, Köcher and Superti-Furga 2007, Dunham et al. 2012]. The Co-IP/AP followed by MS screens consist of two steps: co-immunoprecipitation/affinity purification and mass spectrometry (Figure 2.2). In the first step, cells are lysed in a radioimmunoprecipitation assay (RIPA) buffer. The RIPA buffer enables efficient cell lysis and protein solubilization while avoiding protein degradation and interference with biological activity of the proteins. A known member of the set of proteins (the protein of interest or bait) is epitope-tagged and is either immunoprecipitated using a specific antibody against the tag or purified using affinity columns recognizing the tag, giving the interacting partners (preys) of the bait. Normally, this purification step is more effective when two consecutive purification steps are used with proteins that are doubly tagged (hence called tandem affinity purification or TAP). This results in an enrichment of native multi-protein complexes containing the bait. The individual components within each such purified complex are then screened by gel electrophoresis and identified using mass spectrometry.

Figure 2.2 Schematic representation of the co-immunoprecipitation/affinity purification followed by mass spectrometry (Co-IP/AP followed by MS) protocol. The protein of interest (bait) is targeted with a specific antibody and pulled down with its interactors in a cell lysate buffer. The individual components of the pulled-down complex are identified using mass spectrometry. These days, liquid chromatography with mass spectrometry (LCMS) instead of running the gel is increasingly being used more frequently for as a combined physical-separation and MS-analysis technique [Pitt 2009].

In one of the first applications of TAP/MS, Ho et al. [2002] expressed 10% of the coding open reading frames from yeast, and the identified interactions connected 25% of the yeast proteome as multi-protein complexes. Subsequently, Gavin et al. [2002], Gavin et al. [2006], and Krogan et al. [2006] purified 1,993 and 2,357 TAP-tagged proteins covering 60% and 72% of the yeast proteome, and identified 7,592 and 7,123 protein interactions from yeast, respectively. One of the first proof-of-concept studies for humans applied AP/MS to characterize interactors using 338 bait proteins that were selected based on their putative involvement in diseases, and the study identified 6,463 interactions between 2,235 proteins [Ewing et al. 2007].

Comparison of Y2H and AP/MS Experimental Techniques

A majority of the interaction data collected so far has come from Y2H screening. For example, approximately half of the data available in databases including IntAct [Hermjakob et al. 2004, Kerrien et al. 2012] and MINT [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007] are from Y2H screens [Brückner et al. 2009] (more sources of PPI data are listed in Table 2.2). This could in part be attributed to the inaccessibility of mass spectrometry due to the expensive large equipment that is required. But, in general, Y2H and AP/MS techniques are complementary in the kind of interactors they detect. If a set of proteins form a stable complex, then an AP/MS screen can determine all the proteins within the complex, but may not necessarily confirm every interacting pair (the binary interactions) within the complex. On the other hand, a Y2H screen can detect whether any given two proteins directly interact. While stable interactions between co-complexed proteins can be accurately determined using AP/MS techniques, Y2H techniques are useful for identifying transient interactions between the proteins. However, due to considerable functional cross-talk within cells, Y2H can also report an interaction even when the proteins are not directly connected. In addition, some types of interactions can be missed in Y2H due to inherent limitations in the technique—e.g., interactions involving membrane proteins, or proteins requiring posttranslational modifications to interact—but these limitations may also occur with AP/MS-based approaches [Brückner et al. 2009]. Therefore, only a combination of different approaches that necessarily also includes computational methods (to filter out the incorrectly detected interactions) will eventually lead to a fairly complete characterization of all physiologically relevant interactions in a given cell or organism.

Protein-Fragment Complementation Assay (PCA)

PCA is a relatively new technique which can detect in vivo protein interactions as well as their modulation or spatial and temporal changes [Michnick 2003, Morell et al. 2009, Tarassov et al. 2008]. Similar to Y2H, PCA is based on the principle of splitting a reporter protein into two fragments, each of which cannot function alone [Michnick 2003]. However, unlike Y2H, PCA is based on the formation of a biomolecular complex between the bait and prey, where both are fused to the split domains of the reporter. Importantly, the formation of this complex occurs in competition with alternative endogenous interaction partners present within the cell. The interaction brings the two split fragments in proximity enabling their non-covalent reassembly, folding, and recovery of protein reporter function [Morell et al. 2009]. Typically, the reporter proteins are fluorescent proteins, and the formation of biomolecular complexes is visualized using biomolecular fluorescence complementation (BIFC). BIFC can also be used to map the interaction surfaces of these complexes. This enables investigation of competitive binding between mutually exclusive interaction partners as well as comparison of their intracellular distributions [Grinberg et al. 2004].

PCA can be used as a screening tool to identify potential interaction partners of a specific protein [Remy and Michnick 2004, Remy et al. 2007], or to validate the interactions detected from other techniques such as Y2H [Vo et al. 2016]. In one of the first applications of PCA on a genome-wide in vivo scale, Tarassov et al. [2008] identified 2,770 interactions among 1,124 proteins from S. cerevisiae. Vo et al. [2016] used PCA as an orthogonal assay to reconfirm the interactions detected in S. pombe (from the FissionNet network consisting of 2,278 interactions; discussed earlier). PCA has also been employed to validate interactions between membrane proteins or membrane-associated proteins [Babu et al. 2012, Shoemaker and Panchenko 2007] (discussed next).

Techniques for Inferring Membrane-Protein Interactions

Membrane proteins are attached to or associated with membranes of cells or their organelles, and constitute approximately 30% of the proteomes of organisms [Carpenter et al. 2008, Von Heijne 2007, Byrne and Iwata 2002]. Being non-polar (hydrophobic), membrane proteins are difficult to crystallize using traditional X-ray crystallography compared to soluble proteins, and are the least studied among all proteins using high-throughput proteomics techniques [Carpenter et al. 2008].

Membrane proteins are involved in the transportation of ions, metabolites, and larger molecules such as proteins, RNA, and lipids across membranes, in sending and receiving chemical signals and propagating electrical impulses across membranes, in anchoring enzymes and other proteins to membranes, in controlling membrane lipid composition, and in organizing and maintaining the shape of organelles and the cell itself [Lodish et al. 2000]. In humans, the G-protein-coupled-receptors (GPCRs), which are membrane proteins involved in signal transduction across membranes, alone account for 15% of all membrane proteins; and 30% of all drug targets are GPCRs [Von Heijne 2007]. Due to the key roles of membrane proteins, identifying interactions involving these proteins has important applications especially in drug development.

Membrane protein complexes are notoriously difficult to study using traditional high-throughput techniques [Lalonde et al. 2008]. Intact membrane-protein complexes are difficult to pull down using conventional AP/MS systems. This is due in part to the hydrophobic nature of membrane proteins as well as the ready dissociation of subunit interactions, either between trans-membrane subunits or between trans-membrane and cytoplasmic subunits [Barrera et al. 2008]. Further, membrane protein structure is difficult to study by commonly used high-resolution methods including X-ray crystallography and NMR spectroscopy.

A major avenue by which one can understand membrane proteins and their complexes is by mapping the membrane-protein “subinteractome”—the subset of interactions involving membrane proteins. Conventional Y2H system is confined to the nucleus of the cell thereby excluding the study of membrane proteins. New biochemical techniques have been developed to facilitate the characterization of interactions among membrane proteins. Among these is the split-ubiquitin membrane yeast two-hybrid (MYTH) system [Miller et al. 2005, Kittanakom et al. 2009, Stagljar et al. 1998, Petschnigg et al. 2012]. This system is based on ubiquitin, an evolutionarily conserved 76-amino acid protein that serves as a tag for proteins targeted for degradation by the 26S proteasome. The presence of ubiquitin is recognized by ubiquitin-specific proteases (UBPs) located in the nucleus and cytoplasm of all eukaryotic cells. Ubiquitin can be split and expressed as two halves: the amino-terminal (N) and the carboxyl terminal (C). These two halves have a high affinity for each other in the cell and can reconstitute to form pseudo-ubiquitin that is recognizable by UBPs.

In MYTH, the bait proteins are fused to the C-terminal of a split-ubiquitin, and the prey proteins are fused to the N-terminal. The two halves reconstitute into a pseudo-ubiquitin protein if there is affinity between the bait and prey proteins. This pseudo-ubiquitin is recognized by UBPs, which cleaves after the C-terminus of ubiquitin to release the transcription factor, which then enters the nucleus to activate reporter genes.

Two of the earliest studies using the MYTH screens reported a fair number of interactions among membrane proteins from yeast: 343 interactions among 179 proteins by Lalonde et al. [2010], and 808 interactions among 536 proteins by Miller et al. [2005]. PCA has also been adopted to identify and/or verify membrane-protein interactions. For example, Babu et al. [2012] used PCA to validate and integrate 1,726 yeast membrane-protein interactions obtained from multiple studies, and these encompassed 501 putative membrane protein complexes.

The mammalian version of membrane yeast two-hybrid, MaMTH, is also based on the split-ubiquitin assay and is derived from the MYTH assay. Stagljar and colleagues [Petschnigg et al. 2014, Yao et al. 2017] used MaMTH to probe interactions involving the epidermal growth factor receptor/receptor tyrosine-protein kinase (RTK) ErbB-1 (EGFR/ERBB1), Erb-B2 receptor tyrosine kinase 2 (ERBB2), and other RTKs that localize to the plasma membrane in human cells. When applied to human lung cancer cells, the assay identified 124 interactors for wild-type and mutant EGFR [Petschnigg et al. 2014].

2.2 Data Sources for PPIs

Several public and proprietary databases now catalog protein interactions from both lower-order model and higher-order organisms (summarized in Table 2.2). These databases contain PPI data in an acceptable format required for data deposition, such as IMEx (http://www.imexconsortium.org/submit-your-data) [Orchard et al. 2012]. The Biomolecular Interaction Network Database (BIND) [Bader et al. 2003], now called Biomolecular Object Network Database (BOND), includes experimentally determined protein-protein, protein-small molecule, and protein-nucleic acid interactions. BioGrid [Stark et al. 2011] catalogs physical and genetic interactions inferred from multiple high-throughput experiments. The Database of Interacting Proteins (DIP) [Xenarios et al. 2002] contains experimentally determined protein interactions with a “core” subset of interactions that have passed quality assessment (for example, based on literature verification). The Centre for Cancer Systems Biology (CCSB) Interactome Database at Harvard [Rolland et al. 2014, Yu et al. 2008, Yu et al. 2011] contains yeast, plant, virus, and human interactions. STRING [Von Mering et al. 2003, Szklarczyk et al. 2011] catalogs physical and functional interactions inferred from experimental and computational techniques. MIPS Comprehensive Yeast Genome Database (CYGD) [Güldener et al. 2005] and MIPS Mammalian Protein-Protein Interaction Database (MPPI) [Pagel et al. 2005] catalog protein interactions and also expert-curated protein complexes from yeast and mammals. The Human Protein Reference Database (HPRD) [Peri et al. 2004, Prasad et al. 2009] mainly contains experimentally identified human interactions. IRefIndex [Razick et al. 2008, Turner et al. 2010] and Integrated Interaction Database (IID) [Brown and Jurisica 2005] integrate experimental and computationally predicted interactions for human and several other species.

Table 2.2 Public and proprietary databases for protein-protein, protein-small molecule, and protein-DNA interactions

PPI DatabaseSourceReference
BioGridhttp://thebiogrid.org[Stark et al. 2011]
BINDhttp://bind.ca[Bader et al. 2003]
CCSBhttp://interactome.dfci.harvard.edu/[Rolland et al. 2014, Yu et al. 2008, Yu et al. 2011]
CYGDhttp://mips.helmholtz-muenchen.de/genre/proj/yeast/[Güldener et al. 2005]
DIPhttp://dip.doe-mbi.ucla.edu[Xenarios et al. 2002, Salwinski et al. 2004]
EMBL-EBI IntActhttp://www.ebi.ac.uk/intact/[Hermjakob et al. 2004, Kerrien et al. 2012]
HAPPIhttp://discern.uits.iu.edu:8340/HAPPI/[Chen et al. 2009]
HPRDhttp://www.hprd.org/[Peri et al. 2004, Prasad et al. 2009]
OPHID/IIDhttp://ophid.utoronto.ca/[Brown and Jurisica 2005]
InnateDBhttp://www.innatedb.com/[Lynn et al. 2008]
iRefIndexhttp://irefindex.org/wiki/index.php?title=iRefIndex[Razick et al. 2008, Turner et al. 2010]
MINT/HomoMINThttp://mint.bio.uniroma2.it/mint/[Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007, Persico et al. 2005]
MIPShttp://mips.helmholtz-muenchen.de/proj/ppi/[Mewes et al. 2008]
MPPIhttp://mips.helmholtz-muenchen.de/proj/ppi/[Pagel et al. 2005]
STRINGhttp://string-db.org/[Von Mering et al. 2003, Szklarczyk et al. 2011]

2.3 Topological Properties of PPI Networks

A simple yet effective way to represent interaction data is in the form of an undirected network called a protein-protein interaction network or simply PPI network, given as G = 〈V, E〉, where V is the set of proteins and E is the set of physical interactions between the proteins. Such a network presents a global or “systems” view of the entire set of proteins and their interactions, and provides a topological (mathematical) framework to interrogate the interactions. In the definitions throughout this book, we also use V (G) and E(G) to refer to the set of proteins and interactions of a (sub)network of G. For a protein vV, the set N(v) or Nv includes all immediate neighbors of v, and deg(v) = |N(v)| is the degree of v. These neighbors together with their interactions, Ev = E(v) = {(v, u) : uN(v)} ∪ {(u, w) : uN(v), wN(v) ∩ N(u)}, constitute the local (immediate) neighborhood subnetwork of v.

PPI networks, like most real-world networks, have characteristic topological properties which are distinct from that of random networks. But, to understand this distinction we need to first understand what are random networks. Traditionally, random networks have been described using the Erdös-Rényi (ER) model, in which G(n, p) is a random network with |V| = n nodes and each possible edge connecting pairs of these nodes has probability p of existing [Erdös 1960, Bollobás 1985]. The expected number of edges in the network is , and the expected mean degree is np. Alternatively, a random network is defined as a network chosen uniformly at random from the collection of all possible networks with n nodes and m edges. If p is the probability for the existence of an edge, the probability for each network in this collection is,


where the closer p gets to 1, the more skewed the collection is toward networks with higher number of edges. When p = 1/2, the collection contains all possible networks with equal probability, .

A marked feature of PPI networks that makes them non-random is the extremely broad distribution of degrees of proteins: While a majority of proteins have small number of immediate binding partners, there exist some proteins, referred to as “hubs,” with unusually large number of binding partners. Moreover, the degrees of most proteins are larger than the average node degree of the network. This degree distribution P(k) can be approximated by a power law P(k) ∼ k−γ, where k is the node degree and γ is a small constant. Such networks are called scale-free networks [Barabási 1999, Albert and Barabási 2002].

Proteins within PPI networks exhibit an inherent tendency to cluster, which can be quantified by the local clustering coefficient for these proteins [Watts and Strogatz 1998]. For a selected protein v that is connected to |N(v)| other nodes, if these immediate neighbors were part of a clique (a complete subgraph), there will be interactions between them. The ratio of the actual number of interactions that exist between these nodes and the total number of possible interactions gives the local clustering coefficient CC(v) for the protein v:


The average clustering coefficient of the entire network is therefore given by


This clustering property gives rise to groups (subnetworks) of proteins in the PPI network that exhibit dense interactions within the groups, but sparse or less-dense interactions between the groups. The interactions between the groups occur via a few “central” proteins through which pass most paths in the network. The average shortest path length is given by


where (u, v) is the length of the shortest path between two connected nodes u and v. This average shortest path length of the PPI network is significantly shorter than that of an ER random network; in an ER network, the average shortest path length is proportional to ln n. However, both PPI networks as well as ER networks fall under “small-world” networks, because the average path length between any two nodes is still significantly smaller than the network size: n >> k >> ln n >> 1, where k is the average node degree. The average clustering coefficient of a PPI network is significantly higher than a random network constructed on the same protein set and with the same average shortest path length.

This small-world property can also be quantified using two coefficients: closeness and betweenness [Hormozdiari et al. 2007]. The closeness of v is defined based on its average shortest path length to all other proteins reachable from v (that is, within the same connected component as v) in the network,


where R(G, v) is the set of proteins reachable from v. Closeness is thus the inverse of the average distance of v to all other nodes in the network. The betweenness of the node v measures the extent to which v lies “between” any pair of proteins connected to v in the network. Let Sxy be the number of shortest paths between all pairs x, yR(G, v), and let Sxy(v) be the number of these shortest paths that pass through v. The betweenness Bet(v) of v is defined as:


The distributions of closeness and betweenness coefficients for PPI networks are significantly different from that of random networks [Hormozdiari et al. 2007]. In particular, PPI networks contain central proteins which have high betweenness and these connect and hold together different groups of proteins or regions of the network.

2.4 Theoretical Models for PPI Networks

Studying the topological properties of PPI networks has gained considerable attention in the last several years, and in particular various network models have been proposed to describe PPI networks. As new PPI data became available, these theoretical models have been improved to accurately model the data. These include the earliest models such as the Erdös-Rényi model [Erdös 1960, Bollobás 1985], and more recent ones such as the Barabási-Albert [Barabási 1999, Albert and Barabási 2002], Watts-Strogatz [Watts and Strogatz 1998], and hierarchical [Ravasz and Barabási 2003] network models.

With a fixed probability for each edge, the Erdös-Rényi (ER) networks (discussed above) look to be intuitive and seemingly correct, and were once commonly used to model real-world networks. However, in reality, ER networks are rarely found in real-world examples. Most real-world networks—including road and airline routes, social contacts, webpage links, and also PPI networks—do not have evenly distributed degrees. Moreover, although ER networks have the small-world property, they have almost no clustering effects. For these reasons, using ER networks as null models for comparison against real-world networks including PPI networks is usually inappropriate.

In the Barabási-Albert (BA) model [Barabási 1999, Albert and Barabási 2002], nodes are added one at a time to simulate network growth. The probability of an edge forming between an incoming new node and existing nodes is based on the principle of preferential attachment—that is, existing nodes with higher degrees have a higher likelihood of forming an edge with the incoming node. Hence, the degree distribution follows a power-law distribution (scale-free) P(k) ∼ k−γ and exhibits small-world behavior, but like ER, lacks modular organization (low clustering behavior). The BA model is particularly important as one of the earliest instances of network models proposed as a suitable null model, instead of ER, for comparisons against biological networks [Barabási 1999, Albert and Barabási 2002]. This led to a panoply of works describing the properties of various biological network types including metabolic [Wagner and Fell 2001] and PPI networks [Yook et al. 2004] using BA. At the same time, the deficiencies of the BA model started to become more clear: while the scale-free property is preserved, modularity is not. Thus, better null models are needed.

The Watts-Strogatz (WS) model [Watts and Strogatz 1998] is seldom used for modeling biological networks but demonstrates how easy it is to transition from a non-small world network (e.g., a lattice graph) to small-world network with random rewiring of a few edges. The generation procedure is amazingly simple: From a ring lattice with n vertices and k edges per vertex, each edge is rewired at random with probability p. If p = 0, then the graph is completely regular, and if 1, the graph is completely random/disordered. For the WS model, the intermediate region, where 0 < p < 1 can be selected to examine its intrinsic properties. Aside from potential use for investigating the level of ordered-ness of an observed network, the WS model has seen few applications in biology. Its major contribution toward network biology is that given how easy the small-world property can emerge from minor disruption of a regular lattice graph, it is not a unique defining property. Therefore, the myriad of research papers that describe the biological network under analysis as small world are not reporting particularly useful information.

The BA model is able to capture the scale-free property observed in biological networks but has very low internal clustering. This seemed at odds with the idea that biological networks, or at least PPI networks, are modular in nature. Proteins achieve their functionality by virtue of extensive interconnections with other proteins, forming simple physical interactions, which at higher levels can be envisaged as complexes and functional modules. To better capture the clustering effects in real biological networks, hierarchical models were proposed. Hierarchical network (HN) models [Ravasz and Barabási 2003] are iterative approaches for generating networks that encapsulate both scale-free and highly clustered behavior. For a real network, the average clustering coefficient for the entire network is higher than the BA model and ER model. To construct a HN model, tightly clustered cores with high clustering coefficient are first generated. These are then iteratively connected by selecting random nodes in each core, and having these connect to another. The downside, however, is that developing HN models requires making certain assumptions. For instance, we have to assume that we know the distribution of the sizes and clustering densities of the embedded modules. We also assume that these modules combine in an iterative manner. In biological networks, however, it seems that the distinction boundary between modules is not very sharp with high levels of interconnectivity.

PPI networks from the yeast S. cerevisiae and the bacteria H. pylori resulting from some of the high-throughput studies mentioned earlier [Uetz et al. 2000, Ito et al. 2000, Xenarios et al. 2002] have been shown to have scale-free degree distributions [Pržulj et al. 2004, Jeong et al. 2001, Maslov and Sneppen 2002]. However, the larger D. melanogaster (fruit fly) PPI network has been shown to decay faster than a power law [Giot et al. 2003]. Furthermore, the shortest path distribution and the frequencies of cycles of 3–15 nodes in the fruitfly network differ from those of randomly rewired networks which preserve the same degree distribution as the original PPI network [Giot et al. 2003]. To better capture these frequency distributions of node-cycles, geometric random graph models were proposed. In a geometric random graph model, the nodes correspond to independently and uniformly distributed points in a metric space, and two nodes are linked by an edge if the distance between them is smaller than or equal to some radius r, where the distance is an arbitrary distance norm in the metric space [Penrose 2003]. Pržulj et al. [2004] used geometric random graphs to model PPI networks, by defining the points in 2D, 3D, and 4D Euclidean space, and distance between the points measured using Euclidean distance. Pržulj et al. studied the similarity between geometric random graphs and PPI networks using the distributions for graphlets. A graphlet is a connected network with a small number of nodes, and graphlet frequency is the number of occurrences of a graphlet in a network. The authors defined 29 types of graphlets using 3–5 nodes (Figure 2.3). The relative frequency of a graphlet i from the PPI network G is defined as Ni(G)/T(G), where Ni(G) is the number of graphlets of type i ∈ {1, …, 29}, and is the total number of graphlets of G. The same is defined for a generated geometric random network H. The relative graphlet frequency distance D(G, H) between the two networks G and H is measured as

Figure 2.3 The 29 graphlets of 3–5 nodes defined by Pržulj et al. [2004].


where Fi(G) = − log(Ni(G)/T(G)); the logarithm being used because the graphlet frequencies can differ by several orders of magnitude. The authors generated geometric random networks with the same number of nodes as the proteins in S. cerevisiae and D. melanogaster PPI networks from high-throughput experiments. The authors found that, although the degree distributions of these PPI networks were closer to that of scale-free random networks, other topological parameters matched closely with those of geometric random networks. Specifically, the diameter, local and whole-network clustering coefficients, and the relative graphlet frequencies, computed as above, showed that PPI networks were closer to geometric random networks than scale-free networks. Furthermore, the authors suggest that as the quality and quantity of PPI data improve, the geometric random network may become better suited compared to scale-free and other networks to model PPI networks.

2.5 Visualizing PPI Networks

Visualization concerns an important component of the analysis of PPI networks. Very simply, a PPI network is visualized as dots and lines, where the dots (or other shapes) represent proteins and the lines connecting the dots represent interactions between the proteins. Such a visualization enables quick exploration of the topological properties of PPI networks—for example, counting the neighbors of a selected protein in a PPI network, the number of connected components in the network, or even spotting dense and sparse regions of the network. However, the ease of this exploration and subsequent analysis depends on how effective is the visualization method or tool used to render the PPI network. This rendering of the PPI network concerns the field of graph or network layout, where layout algorithms are used to draw the network—by appropriately positioning the dots and lines—in a 2D space. A good layout should be able to (visually) bring out the topological properties of the PPI network easily, and this has been a subject of research in graph visualization for several years. Here we briefly introduce some of the commonly used algorithms for PPI network layout; for details the readers are referred to the excellent reviews of Morris et al. [2014], Agaptio et al. [2013], and Doncheva et al. [2012].

Random layout arranges the dots (nodes) and lines (edges) in a random manner in the 2D space. The advantage of this algorithm is its simplicity, but on the other hand, it presents a high number of criss-crossing edges, and does not necessarily use the available space optimally, especially for large networks. Circular layout arranges the nodes in succession, one after the other, on a circle. While this algorithm also suffers from criss-crossing of edges, it is widely used to visualize small (sub)networks such as protein complexes and pathways. Tree layout arranges the network as a tree with a hierarchical organization of the nodes. This is obviously more suitable for visualizing trees than networks with cycles. Often, the layout is “ballooned out” by placing the children of each node in the tree on a circle surrounding the node, resulting in several concentric circles. Force-directed layout places the nodes according to a system of forces based on physical concepts in spring mechanics. Typically, the system combines attractive forces between adjacent nodes with repulsive forces between all pairs of nodes to seek an optimal layout in which the overall edge lengths are small while the nodes are well separated.

Figure 2.4 PPI network visualization using the Cytoscape 3.4.0 tool [Shannon et al. 2003, Smoot et al. 2010]. A portion of the human PPI network (1,977 proteins and 5,679 interactions) downloaded from BioGrid [Stark et al. 2011] is visualized here using force-directed layout. Basic statistics—average number of neighbors, network diameter, etc.—are displayed for the network. Proteins (e.g., BRCA1), protein complexes (e.g., eukaryotic initiation factor 4F and nuclear pore complexes), pathways (e.g., Fanconi anaemia pathway), and cellular processes (e.g., DNA-damage repair and chromatin remodeling) are “pulled-out” and highlighted. Cytoscape provides “link-out” to external databases and tools—e.g., KEGG [Kanehisa and Goto 2000]—to enable further analysis.

Once the PPI network is laid out, a good visualization tool should allow at least some basic visual analysis of the network. The following aspects become important here (see Figure 2.4). The ease of navigation through the PPI network to explore individual proteins and interactions is of prime importance. In particular, the tool should be able to load and enable nagivation of even large networks. Next is the provision to annotate the network using internal (e.g., labeling nodes by serial numbers or by their network properties) or external information (see below). The tool should also be able to compute (basic) topological properties of the network—for example, node degree, shortest path lengths, and clustering, closeness, and betweenness coefficients. These statistics help users get at least a preliminary idea of the network. Another valuable feature of a good tool is linkout to external databases, for example to PubMed literature (http://www.ncbi.nlm.nih.gov/pubmed), National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/), UniProt or SwissProt (http://www.uniprot.org/) [Bairoch and Apweiler 1996, UniProt 2015], BioGrid (http://thebiogrid.org/) [Stark et al. 2011], Gene Ontology (GO) (http://www.geneontology.org/) [Ashburner et al. 2000], and Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/pathway.html) [Kanehisa and Goto 2000]. These enable functional annotation of proteins and interactions. Finally, the tool should also possibly support advanced analyses such as clustering of the network, comparison (based on topological characteristics, for example) between networks, and enrichment analysis, e.g., using GO terms. Table 2.3 lists some of the popular tools available for PPI network visualization and (visual) analysis. OMICS Tools (http://omictools.com/network-visualization-category) maintains an exhaustive list of visualization tools for PPI and other biomolecular network analysis.

Table 2.3 Software tools for PPI network visualization and analysis

Visualization ToolSourceReference
Arena3Dhttp://arena3d.org/[Pavlopoulos et al. 2011]
AVIShttp://actin.pharm.mssm.edu/AVIS2/[Seth et al. 2007]
BioLayouthttp://www.biolayout.org/[Theocharidis et al. 2009]
Cytoscapehttp://www.cytoscape.org/download.html[Shannon et al. 2003, Smoot et al. 2010]
Medusahttp://coot.embl.de/medusa/[Hooper and Bork 2005]
NAViGaTORhttp://ophid.utoronto.ca/navigator/download.html[Brown et al. 2009]
ONDEXhttp://www.ondex.org/[Köhler et al. 2006]
Ospreyhttp://biodata.mshri.on.ca/osprey/servlet/Index[Breitkreutz et al. 2003]
Pajekhttp://vlado.fmf.uni-lj.si/pub/networks/pajek/[Vladimir and Andrej 2004]
PIVOThttp://acgt.cs.tau.ac.il/pivot/[Orlev et al. 2004]
ProVizhttp://cbi.labri.fr/eng/proviz.htm[Florian et al. 2005]

2.6 Building High-Confidence PPI Networks

From our discussions on experimental protocols in earlier sections, we know that some protocols—including the AP/MS ones—offer only pulled-down complexes consisting of baits and their preys without specifying the binary interactions between these components. Therefore, binary interactions need to be specifically inferred between the bait and each of its preys within the pulled-down complexes. However, not all preys in a pulled-down complex interact directly with the bait (but, get pulled down due to their interactions with other preys in the complex). Therefore, it is necessary to infer binary interactions not just between the bait and its preys but also between the interacting preys. Yet, care should be taken to avoid inferring spurious (false-positive) interactions between the preys that do not interact. To overcome these uncertainties, often a balance is sought between two kinds of models, spoke and matrix, which are used to transform pulled-down complexes into binary interactions between the proteins [Gavin et al. 2006, Krogan et al. 2006, Spirin and Mirny 2003, Zhang et al. 2008].

The spoke model assumes that the only interactions in the complex are between the bait and its preys, like the spokes of a wheel. This model is useful to reduce the complexity of the data, but misses all (true) prey–prey interactions. On the other hand, the matrix model assumes that every pair of protein within a complex interact. This model can cover all possible true interactions, but can also predict a large number of spurious interactions. An empirical evaluation using 1,993 baits and 2,760 preys from the dataset from Gavin et al. [2006] against 13,384 pairwise protein interactions between proteins within the expert-curated MIPS complexes [Mewes et al. 2006] revealed 80.2% true-negative (missing) interactions and 39% false-positive (spurious) interactions in the spoke model, and 31.2% true-negative interactions but 308.7% false-positive interactions in the matrix model [Zhang et al. 2008]. However, note that many of the missing interactions could be due to the lack of protein coverage in these experiments. A balance is struck between the two models that covers as many true interactions between the baits and preys as possible without allowing too many false interactions [Gavin et al. 2006]; see Figure 2.5.

Gaining Confidence in PPI Networks

Although high-throughput studies have been successful in mapping large fractions of interactomes from multiple organisms, the datasets generated from these studies are not free from errors. High-throughput PPI datasets often contain a considerable number of spurious interactions, while missing a substantial number of true interactions [Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Consequently, a crucial challenge in adopting these datasets for downstream analysis—including protein complex prediction—is in overcoming these challenges.

Figure 2.5 Inferring protein interactions from pull-down protein complexes. Bait–prey relationships from pull-down complexes are assembled using the spokes model, where the bait is connected to each of the preys (A); or using the matrix model, where every bait–prey and prey–prey pair is connected (B). However, these models either miss many true interactions or produce too many spurious interactions. Therefore, a combination of the spoke and matrix models is used, where a balance is sought between the two models using weighting of interactions (C). Interactions with low weights are discarded to give the final set of high-confidence inferred interactions (D).

Spurious Interactions

Spurious or false-positive interactions in high-throughput screens may arise from technical limitations in the underlying experimental protocols, or limitations in the (computational) inference of interactions from the screen. For example, the Y2H system, despite being in vivo, does not consider the localization (compartmentalization), time, and cellular context while testing for binding partners. Since all proteins are tested within one compartment (the nucleus), the chances that two proteins, belonging to two different compartments and are not likely to meet during their lifetimes in live cells, end up testing positive for interaction, is high. Similarly, in vitro TAP pull downs are carried out using cell lysates in an environment where every protein is present in the same “uncompartmentalized soup” [Mackay et al. 2007, Welch 2009]. Therefore, even though two proteins interact under these laboratory conditions, it is not certain that they will ever meet or interact during their life times in live cells. Opportunities are high for “sticky” molecules to function as bridges between proteins causing these proteins to interact promiscuously with partners that never interact with in live cells [Mackay et al. 2007]. Once these complexes are pulled down, the model used to infer binary interactions—between bait and prey or between preys—can also result in inference of spurious interactions (further discussed below). Recent analyses showed that only 30–50% of interactions inferred from high-throughput screens actually occur within cells [Shoemaker and Panchenko 2007, Welch 2009], while the remaining interactions are false positives.

Missing Interactions and Lack of Concordance Between Datasets

Comparisons between datasets from different techniques have shown a striking lack of concordance, with each technique producing a unique distribution of interactions [Shoemaker and Panchenko 2007, Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Moreover, certain interactions depend on post-translational modifications such as disulfide-bridge formation, glycosylation, and phosphorylation, which may not be supported in the adopted system. Many of these techniques also show bias toward abundant proteins (e.g., soluble proteins) and bias against certain kind of proteins (e.g., membrane proteins). For example, AP/MS screens predict relatively few interactions for proteins involved in transport and sensing (trans-membrane proteins), while Y2H screens being targeted in the nucleus fail to cover extracellular proteins [Shoemaker and Panchenko 2007]. These limitations effectively result in a considerable number of missed interactions in interactome datasets.

Welch [2009] summed up the status of interactome maps, based on these above limitations, as “fuzzy,” i.e., error-prone, yet filled with promise.

Estimating Reliabilities of Interactions

The coverage of true interactions can be increased by integrating datasets from multiple experiments. This integration ensures that all or most regions of the interactome are sufficiently represented in the PPI network. However, overcoming spurious interactions still remains a challenge, which is further magnified when datasets are integrated. Therefore, estimating the reliabilities of interactions becomes necessary, thereby keeping only the highly reliable interactions while discarding the spurious or less-reliable ones.

Confidence or reliability scoring schemes offer a score (weight) to each interaction in the PPI network. For an interaction (u, v) ∈ E in the scored (weighted) PPI network G = 〈V, E, w 〉, the score w(u, v) encodes the confidence for the physical interaction between the two proteins u and v. The scoring function w: V × V → R accounts for the biological variability and technical limitations of the experiments used to infer the interactions. The scoring schemes can be classified into three broad categories (Table 2.4): (i) sampling or counting-based, (ii) biological evidence-based, and (iii) topology-based schemes.

Table 2.4 Classification of confidence scoring (PPI weighting) schemes for protein interactions

ClassificationScoring SchemeReference
Sampling or counting basedBootstrap sampling[Friedel et al. 2009]
Comparative Proteomic Analysis (ComPASS)Dice coefficientaHypergeometric samplingSignificance Analysis of INTeractions (SAINT)Socio-affinity scoring[Sowa et al. 2009][Zhang et al. 2008][Hart et al. 2007][Choi et al. 2011, Teo et al. 2014][Gavin et al. 2006]
Independent evidence-basedBayesian networks and C4.5 decision treesTopological Clustering Semantic Similarity (TCSS)Purification Enrichment (PE)[Krogan et al. 2006][Jain and Bader 2010][Collins et al. 2007]
Topology-basedCollaborative Filtering (CF)Functional Similarity (FS) Weight aGeometric embeddingIterative Czekanowski-Dice(ICD) distance aPageRank affinity[Luo et al. 2015][Chua et al. 2006][Pržulj et al. 2004, Higham et al. 2008][Liu et al. 2008][Voevodski et al. 2009]

a. Dice coefficient, FS Weight, and Iterative CD scoring schemes can also be considered as independent evidence-based schemes, because if a pair of proteins have several common partners then these proteins most likely perform the same or similar functions and/or are present in the same cellular compartment (a biological evidence).

Sampling or Counting-Based Schemes

These schemes estimate the confidence of protein pairs by measuring the number of times each protein pair is observed to interact across multiple trials against what would be expected by chance given the abundance of each protein in the library. If the protein pairs are coming from the same experiment, the counting is performed across multiple purifications of the experiment. Given multiple PPI datasets, this idea can be extended to score interacting pairs by measuring the number of times each pair is observed across the different datasets against what would be expected from random given the number of times these proteins appear across the datasets. However, if the PPI datasets come from different experiments (e.g., Y2H and TAP/MS-based), which is usually the case, then it is useful to capture the relative reliability of each experimental technique or source of the datasets into this computation. For example, if Y2H is believed to be less reliable than TAP/MS-based techniques, then protein pairs can be assigned lower weights when observed in Y2H datasets, but assigned higher weights when observed in TAP/MS datasets.

In the study by Gavin et al. [2006], a “socio-affinity” scheme based on this counting idea was used to estimate confidence for interactions inferred from pulled-down complexes detected from TAP purifications. The interactions within the pulled-down complexes are inferred as a combination of spoke and matrix-modeled relationships. A socio-affinity index SA(u, v) then quantifies the tendency for two proteins u and v to identify each other when tagged (spoke model, S) and to co-purify when other proteins are tagged (matrix model, M):


where, for the spoke model (S), is the number of times that u retrieves v when u is tagged; is the fraction of purifications when u was bait; is the fraction of all retrieved preys that were v; nbait is the total number of purifications (i.e., using baits); and is the number of preys retrieved with u as bait. These terms are similarly defined for v as bait. For the matrix model (M), is the number of times that u and v are seen in purifications as preys with baits other than u or v; and are as above; and nprey is the number of preys observed with a particular bait (excluding the bait itself).

Friedel et al. [2009] combined the bait–prey relationships detected from the Gavin et al. [2006] and Krogan et al. [2006] experiments, and used a random sampling-based scheme to estimate confidence of interactions. In this approach, a list Φ= (ϕ1, …, ϕn) purifications were generated where each purification ϕi consisted of one bait bi and the preys pi,1, …, pi,m identified for this bait in the purification: ϕi = 〈bi, [pi,1 …, pi,m]〉. From, Φ, l = 1000 bootstrap samples were created by drawing n purifications with replacement. This means that the bootstrap sample Sj(Φ) contains the same number of purifications as Φ and each purification ϕi can be contained in Sj(Φ) once, multiple times, or not at all, with multiple copies being treated as separate purifications. Interaction scores for the protein pairs are then calculated from these l bootstrap samples using socio-affinity scoring as above, where each protein pair is counted for the number of times the pair appeared across randomly sampled sets of interactions against what would be expected for the pair from random based on the abundance of each protein in the two datasets.

Zhang et al. [2008] modeled each purification as a bit vector which lists proteins pulled down as preys against a bait across different experiments. The authors then used the Sørensen-Dice similarity index [Sørensen 1948, Dice 1945] between the vectors to estimate co-purification of preys across experiments, and thus the interaction reliability between proteins. Specifically, the pull-down data is transformed into a binary protein pull-down matrix in which a cell [u, i] in the matrix is 1 if u is pulled down as a prey in the experiment or purification i, and a zero otherwise. For two protein vectors in this matrix, the Sørensen-Dice similarity index, or simply the Dice coefficient, is computed as follows:


where q is the number of the matrix elements (experiments or purifications) that have ones for both proteins u and v; r is the number of elements where u has ones, but v has zeroes; and s is the number of elements where v has ones, but u has zeroes. If u and v indeed interact (directly or as part of a complex), then most likely the two proteins will be frequently co-purified in different experiments. The Dice coefficient therefore estimates the fraction of times u and v are co-purified in order to estimate the interaction reliability between u and v.

Hart et al. [2007] generated a Probabilistic Integrated Co-complex (PICO) network by integrating matrix-modeled relationships from the Gavin et al. [2002], Gavin et al. [2006], Krogan et al. [2006], and Ho et al. [2002] datasets using hypergeometric sampling. Specifically, the significance (p-value) for observing an interaction between the proteins u and v at least k times in the dataset is estimated using the hypergeometric distribution as


where k is the number of times the interaction between u and v is observed, n and m are the total number of interactions for u and v, respectively, and N is the total number of interactions in the entire dataset. The lower the p-value, the lesser is the chance that the observed interaction between u and v is random, and therefore higher is the chance that the interaction is true.

Methods such as Significance Analysis of INTeractome (SAINT) are based on quantitative analysis of mass spectrometry data. SAINT, developed by Choi et al. [2011] and Teo et al. [2014], assigns confidence scores to interactions based on the spectral counts of proteins pulled down in AP/MS experiments. The aim is to convert the spectral count Xij for a prey protein i identified in a purification of bait j into the probability of true interaction between the two proteins, P(True|Xij). For this, the true and false distributions, P(Xij|True) and P(Xij|False), and the prior probability πT of true interactions in the dataset are inferred from the spectral counts of all interactions involving prey i and bait j. Essentially, SAINT assumes that, if proteins i and j interact, then their “interaction abundance” is proportional to the product XiXj of their spectral counts Xi and Xj. To compute P(Xij|True), the spectral counts Xi and Xj are learned not only from the interaction between i and j, but also from all bona fide interactions that involve i and j. The same principle is applied to compute P(Xij|False) for false interactions. These probability distributions are then used to calculate the posterior probability of true interaction P(True|Xij). The interactions are then ranked in decreasing order of their probabilities, and a threshold is used to select the most likely true interactions.

Comparative Proteomic Analysis (ComPASS) [Sowa et al. 2009] employs a comparative methodology to assign scores to proteins identified within parallel proteomic datasets. It constructs a stats table X[k × m] where each cell X[i, j] = Xi, j is the total spectral count (TSC) for an interactor j (arranged as m rows) in an experiment i (arranged as k columns). ComPASS uses a D-score to normalize the TSCs across proteins such that the highest scores are given to proteins in each experiment that are found rarely, found in duplicate runs, and have high TSCs—all characteristics that qualify proteins to be candidate high-confidence interactors. The D-score is a modification of the Z-score, which weights all interactors equally regardless of the number of replicates or their TSCs. Let X̄j be the average TSC for interactor j across all the experiments,


The Z-score is computed as


where σj is the standard deviation for the spectral counts of interactor j across the experiments. The D-score improves the Z-score by incorporating the uniqueness, the TSC, and the reproducibility of the interaction to assign a score to each protein within each experiment. The D-score first rescales Xi, j as


and p is the number of replicate runs in which the interactor is present. A D-score distribution is generated using a simulated random dataset, and a D-score threshold DT is determined below which 95% of this randomized data falls. A normalized D-score is then computed using this threshold as . All interactors with are considered true, whereas those with are less likely to be bona fide interactors. Huttlin et al. [2015] employed ComPASS to identify 23,744 high-confidence interactors for 2,594 baits expressed in human embryonic kidney (HEK293T) cells (“the BioPlex network,” which as of 2016 contains over 50,000 interactions: http://wren.hms.harvard.edu/bioplex/). The readers are referred to the review by Nesvizhskii [2012] for a number of other scoring methods based on analysis of mass spectrometry data.

Evidence-Based Schemes

These schemes use external or independence evidence to estimate confidence of interactions in the PPI dataset. For example, these evidence may include Gene Ontology (GO) annotations [Ashburner et al. 2000, Mi et al. 2013] for protein functions and localization (compartmentalization), and co-complex memberships in validated complexes. Some of the methods are learning-based; for example, given a (training) set of known interacting pairs and GO annotations, the methods learn (train on) the conditional probability distribution for interacting pairs with and without similar GO annotations (e.g., similar functions or localization), and using this learned distribution, the methods estimate the probability of interaction for the proteins in each pair. In Krogan et al.’s study [Krogan et al. 2006], a machine learning approach using Bayesian networks and C4.5-decision trees trained on validated physical interactions and functional evidence—co-occurrence in manually curated complexes from MIPS—was used to estimate confidence for protein pairs in a spoke-modeled experimental dataset. Collins et al. [2007] developed Purification Enrichment (PE) scoring to generate a “Consolidated network” using the matrix-modeled relationships from Gavin et al. and Krogan et al. datasets. The PE scoring is based on a Bayes classifier trained on manually curated co-complexed protein pairs, GO annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. The Consolidated network was shown to be of high quality, comparable to that of PPIs derived from low-throughput experiments.

In other methods, explicit learning may not be involved, but instead the evidence is directly used to assess the interaction confidence of protein pairs. For example, Resnick’s measure [Resnick 1995] for computing the semantic similarity between annotation terms has been adopted to compute confidence based on GO annotations between the proteins [Xu et al. 2008, Pesquita et al. 2008, Jain and Bader 2010]. Specifically, the semantic similarity between two ontology terms (S, T) having a set A of common ancestors in the GO graph is given as the information content,


where p(A) is the fraction of proteins annotated to term A and all its descendants in the GO graph. Suppose that proteins u and v are annotated to sets of GO terms S and T, respectively. Then the semantic similarity between u and v is defined as the maximum information content (Resnick’s measure) of the set S × T,


The interaction confidence between u and v is then estimated as sim(u, v).

GO graphs tend to be unbalanced with some paths containing more details (depth) compared to others, which stems from the complex biological structure of the GO annotations. However, this creates a bias against terms that do not represent such complex structures, i.e., terms that do not have sufficient depth in the GO graph. To account for this topological imbalance of the GO graph, Jain and Bader [2010] developed Topological Clustering Semantic Similarity (TCSS) which collapses subgraphs that define similar concepts. Terms that are lower down the GO tree have higher information content (i.e., more specific) than the terms at higher levels (i.e., less specific). A cut-off is used to identify subgraphs—subgraph root terms and all their children—with high information content. Since GO terms often have multiple parents, it is likely that this results in overlapping subgraphs. Overlaps between subgraphs are removed in two steps: edge removal by transitive reduction and term duplication. In the edge-removal step, a reduction is performed on the subgraphs: If nodes u and v are connected both via a directed edge uv as well as a directed path uw1, …, wkv, then a transitive reduction is performed to preserve only uv. After this step, if a term still belongs to more than one subgraph, then the term and all its descendents are duplicated across the subgraphs. The similarity between two proteins is measured using this reduced GO graph using Resnick’s similarity between their GO terms, as described above.

Topology-Based Schemes

These schemes analyze the topology of the PPI network—usually in the immediate neighborhood of each protein pair—to estimate interaction confidence for the protein pairs. If the proteins in an interacting pair have many common neighbors in the network, the proteins and their shared neighbors have similar functions and/or are co-localized [Batada et al. 2004], and it is likely that the observed interaction between the proteins is true. This can be understood from the following simple example. Two proteins need to be localized to the same compartment to interact physically. Let us assume that a PPI screen with a false-positive rate of p reports that protein A interacts with C1, …, Cn and D1, …, Dm, and another protein B interacts with C1, …, Cn and E1, …, Em',. Suppose that each of these proteins can be localized to say h subcellular compartments with equal chance. The probability that A and B interact with a Ci in two different places is therefore x = (1 − p) · (1 − p) · (h − 1)/h. Thus, the probability that A and B interact with each C1, …, Cn in two different compartments is xn, and the probability that A and B interact with some Ci in the same compartment is 1 − xn, which monotonically increases with n. That is, the more common partners A and B have (as reported by the screen), the higher the chance they are in the same compartment. Thus, the more common partners the two proteins have in the PPI network, the higher the chance they satisfy the co-localization requirement for interaction. In other words, topology-based schemes based on counting common partners can be viewed as ways to guess whether A and B are likely to be in the same compartment (without needing to know explicitly which compartment); that is, these indices are in fact based on exploiting the biological fact that two proteins must be in the same compartment to physically interact (i.e., a piece of biological information). Therefore, Functional Similarity Weighting (FS Weight) [Chua et al. 2006], Iterative Czekanowski-Dice (CD) weight [Liu et al. 2008], and Dice coefficient [Zhang et al. 2008] can be classified as both topology-based as well as biological evidence-based schemes.

FS Weight, proposed by Chua et al. [2006], is inspired from the graph theory measure Czekanowski-Dice (CD) distance, which is given by


where Nu includes u and all the neighbors of u, similarly for Nv, and NuΔNv = (NuNv) ∪ (NvNu) is the symmetric difference between the sets of neighbors Nu and Nv. CD is a distance (dissimilarity) measure. If Nu = Nv, then CD(u, v) = 0; and if NuNv = ∅, then CD(u, v) = 1.

FS Weight takes inspiration from CD to use common neighbors, and estimates the confidence of the physical interaction between u and v based on their common neighbors. Chua et al. [2006] show that proteins share functions with their strictly indirect neighbors more than with their strictly direct neighbors, and proteins share functions with even higher likelihood with proteins that are simultaneously their direct and indirect neighbors than with proteins that are either strictly their direct or strictly their indirect neighbors. The weight FS(u, v) of the interaction between u and v is estimated as


where λu,v and λv,u are used to penalize protein pairs with very few neighbors, and are given by


where navg is the average number of level-1 neighbors that a protein has in the network. FS Weight thus assigns higher weight to protein pairs with larger number of common neighbors. FS Weight can be used to predict new functional associations between proteins based on the number of neighbors they share; some of these functional associations may be physical interactions.

In Iterative CD, proposed by Liu et al. [2008], the weight for each interaction is computed using the number of neighbors shared between the two interacting proteins, in a manner similar to FS Weight. However, Iterative CD then iteratively corrects these weights for the interactions, such that the weights computed in an iteration uses the weights computed in the previous iteration. By doing so, Iterative CD progressively reinforces the weights of true interactions and dampens the weights of false-positive interactions with each iteration. Iterative CD begins with an unweighted network (all weights set to 1) or a network with weights coming from some prior evidence (e.g., reliabilities of experiments from which the protein pairs were inferred). The weight wk(u, v) for each protein pair (u, v) in the k-th (k > 1) iteration is estimated as


where w1(u, v) = 1 if the interaction (u, v) exists in the original network, w1(u, v) = 0 if the interaction (u, v) does not exist; alternatively, w1(u, v) = r(e)(u,v), which is the reliability of the experiment e used to infer (u, v). The parameters λu and λv penalize protein pairs with very few common neighbors. Liu et al. show that the iterative procedure converges in two iterations for typical PPI networks, but 10–30 iterations may be necessary if the network has high levels of noise. The procedure produces a weight between 0 and 1 for each interaction, and a cut-off (recommended 0.2) is used to filter out low-scoring interactions.

Luo et al. [2015] scored interactions based on collaborative-filtering (CF). This approach is inspired by personalized recommendation in e-commerce where the problem is to identify useful patterns reflecting the connection between users and items from their usage history, and make reliable predictions for possible user-item links based on these patterns [Herlocker et al. 2002]. The CF scheme first computes the similarity sim(u, v) between a pair of interacting proteins u and v in the PPI network using the cosine distance of their adjacency vectors,


where 〈…〉 denotes the inner product between the vectors, and ||.|| denotes the Euclidean norm of the vectors. Therefore, the closer sim(u, v) is to 1, more the common neighbors that u and v share. This cosine distance is then rescaled by incorporating a parameter which, similar to λ in FS Weight and Iterative CD, is used to account for spurious neighbors in the network,

Like in e-commerce where people judge an item based on multiple reviews, preferably from known contacts, the score sim(u, v) from Equation (2.21) is rescaled as:


where (ru,v)d (d being a tunable parameter) is the mean of the scores of interactions with common neighbors, {(u, x) : xN(u) ∩ N(v)} ∪ {(v, x) : xN(u) ∩ N(v)}, in the PPI network.

Voevodski et al. [2009] developed PageRank Affinity, a scoring scheme inspired from Google’s PageRank algorithm [Haveliwala 2003], to score interactions in the PPI network. PageRank Affinity uses random walks to estimate the interconnectedness of protein interactions in the network. A random walk is a Markov process where in each step the walk moves from the current protein to the next with a certain preset probability. PageRank Affinity begins with an unweighted network G (with all weights set to 1) given by the adjacency matrix


The transition probability matrix for the random walks (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one,


where D is the degree matrix, which is a diagonal matrix containing the degrees of the proteins,


Therefore, the transition probabilities are given by the matrix WG, where the transition probabilities pt+1 at step t + 1 are simply pt+1 = ptWG. PageRank Affinity repeatedly simulates random walks beginning from each protein in the network. Let pr(s) be the steady-state probability distribution of a random walk with restart probability α, where the starting vector s gives the probability distribution upon restarting. Then, pr(s) is the unique solution for the linear system given by


The starting vector here is set as follows:


Therefore, pr(su) is the steady-state probability distribution of a random walk that always restarts at u. Let pr(su)[v] represent the steady-state probability value that a protein v has in the distribution vector pr(su). This can be thought of as the probability contribution that u makes to v, and is denoted as pr(uv). This provides the affinity between u and v when the walks always restart at u. The final affinity between u and v is computed as the minimum of pr(uv) and pr(vu).

Pržulj et al. [2004] and Higham et al. [2008] argue that PPI networks are best modeled by geometric random graphs instead of scale-free or small-world graphs, as suggested by earlier works [Watts and Strogatz 1998, Barabási 1999]. A geometric random graph G = 〈V, ϵ〉 with radius ϵ is a graph with node set V of points in a metric space and edge set E = {(u, v) : u, vV ; 0 < ||uv|| ≤ ϵ}, where ||.|| is an arbitrary distance norm in this space. The authors show that a PPI network can be represented as a geometric random graph by embedding the proteins of the PPI network in metric spaces R2, R3, or R4, and finding an ϵ such that any two proteins are connected in the PPI network if and only if the proteins are ϵ-close in the chosen metric space. This embedding can be used to not only weight all interactions, but also predict new interactions between protein pairs and remove spurious interactions from the PPI network: Interactions between proteins that are farther than ϵ-distance away in the metric embedding are pruned off as noise, whereas non-interacting protein pairs that are ϵ-close are predicted to be interacting.

The embedding of the PPI network in a chosen metric space is described briefly as follows. Given a PPI network of N proteins, the interaction weights for protein pairs (i, j) are interpreted as pairwise distances dij, and the task is to find locations in m-dimensional Euclidean space (vectors in Rm) for these proteins so that the pairwise distances are preserved, i.e., for all i, j. In PPI networks with only {0, 1} connectivity information, the length of the shortest path between proteins in the network is used in lieu of the Euclidean distance. Przulj et al. suggest that using the square root of the path length, , where pathij denotes the path length between i and j, is a good function for the purpose.

Conditional probabilities p(dij|interact) and p(dij|noninteract) are learned for distances dij from the embedding. Here, p(dij|interact) is the probability density function describing the distances between pairs of proteins which are known to interact, and p(dij|noninteract) is the probability density function describing the distances between pairs of proteins which do not interact in the dataset (and are known to not interact). Given a distance threshold δ, these probabilities are used to compute the posterior probabilities p(interact|dijδ) and p(noninteract|dijδ) for protein pairs to interact or not interact. For each protein pair (i, j) within the δ-threshold, the weight of the interaction between i and j is then estimated as


A threshold on this estimated weight is applied to remove false-positive interactions from the network.

Combining Confidence Scores

Interactions scored high by all or a majority of confidence-scoring schemes (consensus interactions) are likely to be true interactions. Therefore, a simple majority-voting scheme counts the number of times each interaction is assigned a high score (above the recommended cut-off) by each scheme, and considers only the interactions scored high by a majority of the schemes.

Chua et al. [2009] integrated multiple scoring schemes using a naïve Bayesian approach. For an interaction (u, v) that is assigned scores pi(u, v) by different schemes i (assuming the scores are in the same range [0,1]), the combined score can be computed as 1 − Πi(1 − pi(u, v)). However, even within the same range [0,1], usually different scoring schemes tend to have different distribution of scores they assign to the interactions. Some schemes assign high scores (close to 1) to most or a sizeable fraction of the interactions, whereas other schemes are more conservative and assign low scores to many interactions. To account for the variability in distributions of scores, it is important to consider the relative ranking of interactions within each scoring scheme instead of their absolute scores.

Chua et al. [2009] therefore proposed a ranked-based combination scheme, which works as follows. For each scheme i, the scored interactions are first binned in increasing order of their scores: The first 100 interactions are placed in the first bin, the second 100 interactions in the second bin, and so on. For each bin k in scheme i, a weight p(i, k) is assigned based on the number of interactions from the bin that match known interactions from an independent dataset:


While combining the scores for interaction (u, v), the Bayesian weighting is modified to: 1 − Π(i,k)∈D(u,v)(1 − p(i, k)), where D(u, v) is the list of scheme-bin pairs (i, k) that contain (u, v) across all schemes. This ensures that, irrespective of the scoring distribution of the schemes, if an interaction belongs to reliable bins, it is assigned a high final score.

Yong et al. [2012] present a supervised maximum-likelihood weighting scheme (SWC) to combine PPI datasets and to infer co-complexed protein pairs. The method uses a naïve Bayes maximum-likelihood model to derive the posterior probability that an interaction (u, v) is a co-complexed interaction based on the scores assigned to (u, v) across multiple data sources. These data sources include PPI databases, namely BioGrid [Stark et al. 2011], IntAct [Hermjakob et al. 2004, Kerrien et al. 2012], MINT [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007], and STRING [Von Mering et al. 2003, Szklarczyk et al. 2011], and evidence from cooccurrence of proteins in the PubMed literature abstracts (http://www.ncbi.nlm.nih.gov/pubmed). The set of features is the set of these data sources, and a feature F has value f if proteins u and v are related by data source F with score f, else f = 0. The features are discretized using minimum-description length supervised discretization [Fayyad and Irani 1993]. Using a reference set of protein complexes, each (u, v) in the training set is given a class label co-complex if both u and v are in the same complex, otherwise it is given the class label non-co-complex. The maximum-likelihood parameters are learned for the two classes,


where Nc is the number of interactions with label co-complex, Nc, F = f is the number of interactions with label co-complex and feature value F = f, and likewise for interactions with label non-co-complex, N¬c, F = f′. After learning the maximum-likelihood model, the score for each interaction is computed as the posterior probability of being a co-complex interaction based on the naïve Bayes assumption of independence of the features:


where Z is the normalizing factor:


Table 2.5 displays some of the publicly available databases that integrate PPI datasets from multiple sources. However, a common problem with integrating multiple scored-datasets is the low agreement between the schemes or experiments used to produce these data. Moreover, most scoring methods favor high abundance proteins and are not effective enough to filter out common contaminants [Pu et al. 2015]. Therefore, better scoring and integrating schemes for PPI datasets are always required.

2.7 Enhancing PPI Networks by Integrating Functional Interactions

In addition to the presence of spurious interactions, another limitation in existing PPI datasets is the lack of coverage for true interactions among certain kinds of proteins (the “sparse zone” [Rolland et al. 2014]). This is in part due to limitations in experimental protocols (e.g., washing away of weakly connected proteins during purification of pull-down complexes in TAP experiments), and in part due to the under-representation of certain groups of proteins in these experiments (e.g., membrane proteins). The paucity of true interactions can considerably affect downstream analysis including protein complex prediction. For example, in an analysis by Srihari and Leong [2012a] using protein complexes from MIPS and CYC2008, it was found that many true complexes are embedded in sparse and disconnected regions of the PPI network, thereby altering their dense connectivity and modularity. As we shall see in a subsequent chapter, many computational methods find it difficult to identify these sparse complexes.

Computational prediction of protein interactions can be a good alternative to experimental protocols for enriching the PPI network with true interactions, and to “densify” regions of the network that are sparsely connected. However, accurate prediction of physical interactions between proteins is a difficult problem in itself, and as several studies have noted [Von Mering et al. 2003, Szklarczyk et al. 2011, Srihari and Leong 2012a] most predicted interactions tend to be “functional associations”—that is, relationships connecting functionally similar pairs of proteins—instead of actual physical interactions between the proteins. Nevertheless, if these functional interactions are successful in “topologically enhancing” the PPI network, these can still aid downstream analysis including protein complex prediction.

Table 2.5 Publicly available databases that integrate PPI datasets from multiple experimental, literature, and computational sources

DatabaseSourceReference
ComPPIhttp://comppi.linkgroup.hu/[Veres et al. 2015]
GeneMANIAhttp://www.genemania.org/[Warde-Farley et al. 2010]
HIPPIEhttp://cbdm.mdc-berlin.de/tools/hippie/[Schaefer et al. 2012]
HitPredicthttp://hintdb.hgc.jp/htp/[Patil et al. 2011]
HumanNethttp://www.functionalnet.org/[Lee et al. 2011]
I2D/OPHIDhttp://ophid.utoronto.ca/ophidv2.204/[Brown and Jurisica 2005, Brown and Jurisca 2007, Kotlyar et al. 2015]
InnateDBhttp://www.innatedb.com/[Lynn et al. 2008]
IntScorehttp://intscore.molgen.mpg.de/[Kamburov et al. 2012]
InWebhttp://www.lagelab.org/resources/[Li et al. 2017]
iRefIndexhttp://irefindex.org/wiki/index.php?title=iRefIndex[Razick et al. 2008, Turner et al. 2010]
MyProteinNethttp://netbio.bgu.ac.il/myproteinnet/[Basha et al. 2015]
MatrixDBhttp://matrixdb.univ-lyon1.fr/[Chautard et al. 2011]
IID/OPHIDhttp://ophid.utoronto.ca/iid/[Brown and Jurisica 2005, Kotlyar et al. 2015, Kotlyar et al. 2016]
PrePPIhttp://bhapp.c2b2.columbia.edu/PrePPI/[Zhang et al. 2012, Zhang et al. 2013]
PSICQUIChttp://psicquic.googlecode.com/[Aranda et al. 2011]
STRINGhttp://string-db.org/[Von Mering et al. 2003, Szklarczyk et al. 2011]
UniHIhttp://www.unihi.org/[Kalathur et al. 2014]

Computational Prediction of Protein Interactions

Although high-throughput techniques produce large amounts of data, the covered fraction of the interactomes from most organisms are far from complete [Cusick et al. 2009, Hart et al. 2006, Huang et al. 2007]. For example, while ∼70% of the interactomes from model organisms including S. cerevisiae have been mapped, these interactomes still lack interactions among membrane proteins [Von Mering et al. 2002, Hart et al. 2006, Huang et al. 2007]. Likewise, estimates show that less than 50% of the interactomes from higher-order organisms including human (∼10%) and other mammals have been mapped [Hart et al. 2006, Stumpf et al. 2008, Vidal 2016]. Computational prediction of interactions could partially compensate for this lack of coverage by predicting interactions between proteins in network regions with low coverage. Here, we only present a brief conceptual overview of computational methods developed for protein interaction prediction; for methodological details and for a comprehensive list of these methods, the readers are referred to excellent surveys by Valencia and Pazos [2002], Obenauer and Yaffe [2004], Zahiri et al. [2013], Ehrenberger et al. [2015], and Keskin et al. [2016].

Gene Neighbors. A commonly used approach to predict protein interactions in prokaryotes is by using co-transcribed or co-regulated sets of genes. It is based on the observation that, in prokaryotes, proteins encoded by genes that are transcribed or regulated as single units—e.g., as operons—are often involved in similar functions and tend to physically interact. Computational methods exist to predict operons in bacterial genomes using intergenic distances [Ermolaeva et al. 2011, Price et al. 2005]. Analysis of gene-order conservation in bacterial and archaeal genomes shows that protein products of 63–75% of operonic genes physically interact [Dandekar et al. 1998]. In eukaryotes, evidence from yeast and worm [Teichmann and Babu 2002, Snel et al. 2004] shows that co-regulated sets of genes encode proteins that are functionally similar and these proteins are highly likely to interact. These studies therefore provide the basis to predict new interactions between proteins using sets of co-transcribed and co-regulated sets of genes [Huynen et al. 2000, Bowers et al. 2004].

Phylogenetic Profiles. Similar phylogenetic profiles between proteins provide strong evidence for protein interactions [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. For a given protein, a phylogenetic profile is constructed as a vector of N elements, where N is the number of genomes (species). The presence or absence of the protein in a genome is indicated as 1 or 0 at the corresponding position in the phylogenetic profile. Phylogenetic profiles of a collection of proteins can be clustered using a bit-distance measure, to generate clusters of proteins that co-evolve. Therefore, proteins appearing in the same cluster are considered to be evolutionarily co-evolving and these proteins are inferred to be functionally related and physically interacting. This inference is based on the hypothesis that interacting sets of non-homologous proteins that co-evolve are under evolutionary pressure to conserve their interactions and to maintain their co-functioning ability [Shoemaker and Panchenko 2007, Sun et al. 2005].

Co-Evolution of Interacting Proteins. Interacting proteins often co-evolve so that changes in one protein in a pair leading to the loss of function or interaction should be compensated by correlated changes in the other protein [Shoemaker and Panchenko 2007]. This co-evolution is reflected by the similarity between the phylogenetic protein trees (or simply, protein trees) of non-homologous interacting protein families. A protein tree represents the evolutionary history of protein families, i.e., proteins or protein families that diverged from a common ancestor. These protein trees reconciled with their species trees have their internal nodes annotated to speciation and duplication events [Vilella et al. 2009]. TreeSoft (http://treesoft.sourceforge.net/treebest.shtml) provides a suite of tools to build and visualize protein trees. The similarity between two protein trees can be computed by aligning the corresponding distance matrices so as to minimize the difference between the matrix elements: the smaller the difference between the matrices, the stronger the co-evolution between the two protein families. Interactions are predicted between proteins corresponding to the aligned columns of the two matrices. The similarity between two protein trees is influenced by the speciation process and, therefore, there is a certain background similarity between any two protein trees, irrespective of whether the proteins interact or not. Statistical approaches exist to correct for these factors (phylogenetic subtraction) [Harvey and Pagel 1991, Harvey et al. 1995]. It is also worth noting that a protein can have multiple partners, and so taking into consideration its co-evolution with all its partners further enhances the accuracy of the interaction prediction [Juan et al. 2008].

Gene Fusion. Gene fusion is a common event in evolution, wherein two or more genes in one species fuse into a single gene in another species. Gene fusion is a result of duplication, translocation, or inversion events that affect coding sequences during the evolution of genomes. Therefore, gene fusions play an important role in determining the gene (and genomic) architecture of species. Gene fusions may occur to optimize co-transcription of genes involved in the fusion: by fusing two or more genes, it may be easier to transcribe these genes as a single entity, thus resulting in a single protein product. Typically, proteins coded by these fused genes in a species carry multiple functional domains, which originate from different proteins (genes) in the ancestor species. Therefore, one may infer interactions between these individual proteins in the ancestor species: it is likely that these proteins are partners in performing a particular function and they interact in the ancestor species, and that gene fusion has occurred in another species to optimize the transcription and to produce a single multidomain protein [Marcotte et al. 1999]. These fused proteins are referred to as chimeric or Rosetta Stone proteins [Marcotte et al. 1999]. The Rosetta Stone approach [Enright and Ouzounis 2001, Suhre 2007] infers protein interactions by detecting fusion events between protein sequences across species. In E. coli, this approach identified 6,809 putative interacting pairs of proteins, wherein both proteins from each pair had significant sequence similarity to a single (fused) protein from at least one other species (genome). The analysis of these interacting pairs revealed that, for more than half of these pairs, both the proteins were functionally related [Marcotte et al. 1999].

PPI Network Topology. The pattern of interactions between proteins in a PPI network says a lot about how proteins interact, and provides a way to predict new interactions. For example, if a pair of proteins have many common neighbors in the PPI network, then most likely the two proteins in the pair and their common neighbors are involved in the same or similar function(s). Therefore, one may infer a direct physical interaction between the two proteins based on the number of neighbors and/or functions the two proteins share. Chua et al. [2006] used FS Weight interaction-scoring approach in this manner to predict interactions between level-2 neighbors (connected via one other protein) in the PPI network. This is based on the observation that level-2 neighbors in the PPI network show the same or similar annotations for functions and/or cellular compartment, and therefore these are more likely to interact compared to random pairs of proteins in the network. These FS-weighted predicted interactions between level-2 neighbors are added back to the PPI network after removing low-weighted interactions. Using the same rationale, one can predict new interactions using other topology-based (common-neighbor counting) schemes including Dice coefficient [Zhang et al. 2008] and Iterative CD [Liu et al. 2008]. Likewise, the geometric embedding model [Pržulj et al. 2004, Higham et al. 2008] can also be used to predict new interactions: Proteins that are ϵ-close in the geometric embedding of the PPI network are more likely to interact compared to random pairs of proteins and proteins that are farther than ϵ-distance away in the embedding.

Functional Features. Interacting proteins are often involved in the same or similar functions. Therefore, if a pair of proteins are annotated with the same or similar functions, one could, with some degree of accuracy, infer a physical interaction between the two proteins. This is often referred to as “guilt by association,” which refers to the principle that genes or proteins with related functions tend to share properties such as genetic or physical interactions [Oliver 2000]. This inference can be further enhanced by combining other evidence that supports their functional similarity—for example, if the genes coding for the two proteins are located close by on the genome or are transcribed as an operonic unit (for prokaryotes) [Dandekar et al. 1998, Kumar et al. 2002], or the coding genes are co-transcribed or co-expressed [Huynen et al. 2000, Bowers et al. 2004, Jansen et al. 2002], or show similar phylogenetic profiles [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. Proteins within the same protein complex (co-complexed proteins) show a strong tendency to share functions and cellular localization and therefore physically interact. On the other hand, proteins from different cellular compartments most likely do not meet and therefore do not interact in vivo during their lifetimes. Jansen et al. [2003] used interactions between co-complexed proteins from the MIPS protein complex catalog [Mewes et al. 2006] as the positive training set, and non-interacting pairs of proteins as the negative training set, in a Bayesian framework, to predict new interactions in yeast. Blohm et al. [2014] present a dataset, the “Negatome,” of protein pairs that are highly unlikely to interact, which can be used as a negative training set.

The Gene Ontology graph [Ashburner et al. 2000] integrates information on the functional and localization properties of proteins, and therefore provides a way to predict new interactions. For example, the TCSS approach by Jain and Bader [2010] can be used to compute similarity between pairs of proteins using the GO graph, and protein pairs showing high GO-semantic similarity can be predicted to physically interact. Likewise, multiple pieces of experimental and functional information can be combined to predict new interactions. For example, GeneMANIA (http://www.genemania.org/) [Warde-Farley et al. 2010] combines experimentally detected interactions from BioGrid [Stark et al. 2011, Chatr-Aryamontri et al. 2015], pathway annotations from Pathway Commons (http://www.pathwaycommons.org/) [Cerami et al. 2011], and information on evolutionary conservation of interactions from the Interologous Interaction Database (I2D) [Brown and Jurisica 2005], along with GO-based similarity, to predict new interactions (GeneMANIA and I2D are also listed in Table 2.5). The HumanNet [Lee et al. 2011] is a human functional interaction network which includes predicted interactions based on guilt by association for genes involved in human diseases.

Structural Information on Proteins. 3D structures of proteins provide first-hand evidence for protein interaction sites and binding surfaces of proteins. Therefore, by assessing the compatibility between the binding surfaces between two proteins, one can predict whether the two proteins interact or not. For example, Zhang et al. [2012, 2013] analyzed 3D structures of proteins from the Protein Data Bank (PDB) (http://www.rcsb.org/pdb/home/home.do) [Berman et al. 2000], a database which stores 3D structures for over 600 of the ∼6,000 characterized yeast proteins (∼10%), to predict new interactions between proteins in yeast. However, since the 3D structures are available for only a small fraction of proteins, using this approach for prediction of interactions on a larger scale is not feasible. Zhang et al. proposed to overcome this limitation to some extent by deriving homology models for proteins without available 3D structures. Homology models were derived for an additional ∼3,600 yeast proteins using the ModBase (http://modbase.compbio.ucsf.edu/) [Pieper et al. 2006] and Skybase (http://skybase.c2b2.columbia.edu/pdb60_new/struct_show.php) [Mirkovic et al. 2007] databases. Given a query protein, these databases predict the most likely 3D structure for the protein based on its sequence similarity with templates built from proteins with available 3D structures. The final set of structurally predicted interactions from the Zhang et al. study is available in the PrePPI database (http://bhapp.c2b2.columbia.edu/PrePPI/). Struct2Net (http://groups.csail.mit.edu/cb/struct2net/webserver/) [Singh et al. 2010] uses a structure-threading approach to predict interactions between proteins. Given two protein sequences, Struct2Net “threads” the sequences to known 3D structures from PDB, and then based on the best-matching structures, estimates the interaction between the two proteins. PredictProtein (http://www.predictprotein.org/) [Yachdav et al. 2014] combines structure and GO-based methods to predict new interactions. Wang et al. [2012] curated a 3D-structure resolved dataset of 4,222 high-quality human PPIs enriched for human disease genes by examining relationships between 3,949 genes, 62,663 mutations, and 3,453 associated with human disorders.

Literature Mining. Interactions missed in PPI datasets but have direct or indirect reference in scientific publications can be identified by mining the literature. For example, these references may include abstracts or full-texts of publications maintained in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) by the National Center for Biotechnology Information (NCBI). Text-mining tools based on natural language processing (NLP) and other machine-learning techniques mine for co-occurrence of protein names in these literature sources, and proteins frequently referenced together can be predicted to interact. For example, the PubGene tool, which is a part of COREMINE (http://www.coremine.com/medical/), mines for information on genes and proteins including their co-occurrences in abstracts of publications, their sequence homology, and association with cell processes and diseases. This information can be used to predict interactions between frequently co-occurring proteins. Similarly, iHOP (http://www.ihop-net.org/UniPub/iHOP/) [Hoffman and Valencia 2004] presents a network of genes and proteins that co-occur in the scientific literature.

Limitations of Computationally Predicted Protein Interactions. Despite their promise in enhancing experimentally curated interaction datasets, computational methods have their own limitations. For example, methods that rely on (high-throughput) experimental datasets to predict new interactions—e.g., PPI topology-based methods—carry an inherent bias in their predictions toward proteins already accounted for or enriched in these datasets. Moreover, these predictions are also affected by biological and technical noise (spurious interactions) in experimental datasets. The methods that are based on genomic distances, fusion of genes, and co-transcription of (operonic) genes are applicable only to a selected subset of species (prokaryotes), since most eurkaryotic systems do not exhibit these properties so strongly. Knowledge-based methods—e.g., those based on the GO graph, 3D structures, and literature abstracts—are restricted by the amount and kind of available information, and are again applicable to only a subset of proteins. As a result of these limitations, accurate prediction of physical interactions between proteins is a challenging problem, and most methods in fact predict only functional associations instead of direct physical interactions between proteins. Depending on the application, these functional associations can turn out to be more useful or less useful.

However, despite these limitations, computational techniques are orthogonal to and can effectively complement experimental techniques. Computational predictions can be used to enhance the topologies of PPI networks—for example, by “densifying” sparse network regions or piecing together disconnected proteins or regions of the network—which in turn can improve downstream applications of PPI networks including protein complex prediction.

Computational Prediction of Protein Complexes from Protein Interaction Networks

Подняться наверх