Читать книгу Computational Prediction of Protein Complexes from Protein Interaction Networks - Sriganesh Srihari - Страница 11

Оглавление

3

Computational Methods for Protein Complex Prediction from PPI Networks

All models are wrong, but some models are useful.

—George Box (1919–2013), British statistician (as quoted in Box [1979])

The process of identifying protein complexes from high-throughput interaction datasets involves the following steps [Spirin and Mirny 2003, Srihari and Leong 2012b, Srihari et al. 2015a] (Figure 1.1):

1. integrating high-throughput datasets from multiple sources and assessing the confidence of interactions;

2. constructing a reliable PPI network using only the high-confidence interactions;

3. identifying modular subnetworks from the network to generate a candidate list of complexes; and

4. evaluating the identified complexes against bona fide complexes, and validating and assigning roles to novel complexes.

In this chapter, we review some of the representative methods developed to date for computational prediction of protein complexes from PPI networks.

3.1 Basic Definitions and Terminologies

A protein-protein interaction (PPI) network is modeled as an undirected graph G = 〈V, E〉, where V is the set of proteins, and EV × V is the set of interactions between the proteins. We also use V(G) and E(G) to refer to the set of proteins and interactions of a (sub-)network G, respectively. For a protein vV, N(v) or Nv is the set of immediate neighbors of v, deg(v) = |N(v)| = |Nv| is the degree of v, and E(v) is the set of interactions in the immediate neighborhood of v. The interaction density of G (or its subnetwork) is defined as:


which gives a real number in [0, 1] quantifying the “richness of interactions” within G: 0 for a network without any interactions and 1 for a fully connected network.

The clustering coefficient CC(v) measures the “cliquishness” for the neighborhood of v, and is given by:


If the interactions of the network are scored (weighted), i.e., G = 〈V, E, w〉, then the weighted versions—weighted degree, weighted interaction density, and weighted clustering coefficients—are given as follows:


There are other variants for these definitions proposed in the literature; see Kalna and Higham [2007] and Newman [2010] for a survey.

3.2 Taxonomy of Methods for Protein Complex Prediction

Although at a generic level most methods make the basic assumption that protein complexes are embedded among densely connected subsets of proteins within the PPI network, these methods vary considerably in their algorithmic strategies or in the biological information that they employ to detect complexes. Accordingly, we classify protein complex detection methods into the following two categories [Srihari and Leong 2012b, Srihari et al. 2015a, Li et al. 2010]:

1. methods based solely on network clustering; and

2. methods based on network clustering and additional biological information.

The biological information can be in the form of functional, structural, organizational, or evolutionary evidence on complexes or their constituent proteins. Wang et al. [2010] and Chen et al. [2014] present alternative classifications based on topological (interaction density, betweenness centrality, and clustering coefficient), and dynamical properties of PPI networks capitalized by computational methods, respectively.

We present this classification of methods for protein complex prediction as two snapshots—methodology-based and chronology-based—as shown in Figures 3.1 and 3.2, respectively. The methodology-based classification (Figure 3.1) is based on algorithmic methodologies employed by the methods. At level 1 of the tree, we divide methods into those based solely on network clustering, and those employing additional biological information. At subsequent levels, we further subdivide the methods based on their algorithmic strategies into: (i) methods employing merging or growing of clusters from the network (agglomerative); and (ii) methods employing repeated partitioning of the network (divisive). Agglomerative methods go bottom-up, i.e., by beginning with small “seeds” (e.g., triangles or cliques) and repeatedly adding proteins or merging clusters of proteins based on certain similarity criteria to arrive at predicted set of complexes. On the other hand, divisive methods go top-down, i.e., by repeatedly partitioning the network into multiple dense subnetworks based on certain dissimilarity criteria to arrive at predicted complexes. In the chronology-based classification (Figure 3.2), we bin methods based on the time (year) when these were developed, and for each bin we stack the methods based on the kind of biological information they employ for the prediction.

Figures 3.1 and 3.2 and Table 3.1 show only the methods that are discussed in this chapter. Other methods that employ diverse kinds of biological information and/or algorithmic strategies are covered in the following chapters.

3.3 Methods Based Solely on PPI Network Clustering

Most methods that directly mine for dense subnetworks from the PPI network make use of solely the topology of the network.

Molecular COmplex DEtection (MCODE)

MCODE, proposed by Bader and Hogue [2003], is one of the first computational methods developed for protein complex detection from PPI networks. The MCODE algorithm operates in two steps, vertex weighting and complex prediction, and an optional third step for post-processing of the candidate complexes. In the first step, each protein v in the PPI network G is weighted based on its clustering coefficient (CC). However, instead of using the entire neighborhood of v, MCODE uses the density of the highest k-core in the neighborhood of v, which amplifies the weights for proteins located in densely connected regions of G. A k-core is a subnetwork of proteins such that each protein in this subnetwork has a degree no less than k. A k-core in the neighborhood of v is a subnetwork of v and all its neighbors that have degree no less than k. A k-core in the neighborhood of v is highest or maximal if there exists no (k + 1)-core in that neighborhood. If Ck(v) represents this highest k-core, the weight for v is assigned as:

Computational Prediction of Protein Complexes from Protein Interaction Networks

Подняться наверх