Читать книгу Bioinformatics - Группа авторов - Страница 53
Gaps and Gap Penalties
ОглавлениеOften times, gaps are introduced to improve the alignment between two nucleotide or protein sequences. These gaps compensate for insertions and deletions between the sequences being studied so, in essence, these gaps represent biological events. As such, the number of gaps introduced into a pairwise sequence alignment needs to be kept to a reasonable number so as to not yield a biologically implausible scenario.
The scoring of gaps in pairwise sequence alignments is different from scoring approaches discussed to this point, as no comparison between characters is possible – one sequence has a residue at some position and the other sequence has nothing. The most widely used method for scoring gaps involves a quantity known as the affine gap penalty. Here, a fixed deduction is made for introducing the gap; an additional deduction is made that is proportional to the length of the gap. The formula for the affine gap penalty is G + Ln, where G is the gap-opening penalty (the cost of creating the gap), L is the gap-extension penalty, and n is the length of the gap, with G > L. This last condition is important: given that the gap-opening penalty is larger than the gap-extension penalty, lengthening existing gaps would be favored over creating new ones. The values of G and L can be adjusted manually in most programs to make the insertion of gaps either more or less permissive, but most methods automatically adjust both G and L to the most appropriate values for the scoring matrix being used.
Figure 3.2 A nucleotide scoring table. The scoring for the four nucleotide bases is shown in the upper left of the figure, with the remaining one-letter codes specifying the IUPAC/UBMB codes for ambiguities or chemical similarities. Note that the matrix is a mirror image of itself with respect to the diagonal.
The other major type of gap penalty used is a non-affine (or linear) gap penalty. Here, there is no cost for opening the gap; a simple, fixed mismatch penalty is assessed for each position in the gap. It is thought that affine penalties better represent the biology underlying the sequence alignments, as affine gap penalties take into account the fact that most conserved regions are ungapped and that a single mutational event could insert or delete many more than just one residue. In practice, use of the affine gap penalty better enables the detection of more distant homologs.