Читать книгу Population Genetics - Matthew B. Hamilton - Страница 29

Box 2.1 DNA profiling

Оглавление

The loci used for human DNA profiling are a general class of DNA sequence marker known as simple tandem repeat (STR), simple sequence repeat (SSR), or microsatellite loci. These loci feature tandemly repeated DNA sequences of one to six base pairs (bp) and often exhibit many alleles per locus and high levels of heterozygosity. Allelic states are simply the number of repeats present at the locus, which can be determined by electrophoresis of polymerase chain reaction (PCR) amplified DNA fragments. STR loci used in human DNA profiling generally exhibit Hardy–Weinberg expected genotype frequencies; there is evidence that the genotypes are selectively “neutral” (e.g. not affected by natural selection), and the loci meet the other assumptions of Hardy–Weinberg. STR loci are employed widely in population genetic studies and in genetic mapping (see reviews by Goldstein and Pollock 1997; McDonald and Potts 1997).


Figure 2.8 The original data for the DNA profile given in Table 2.2 and Problem Box2.1 obtained by capillary electrophoresis. The PCR oligonucleotide primers used to amplify each locus are labeled with a molecule that emits blue, green, or yellow light when exposed to laser light. Thus, the DNA fragments for each locus are identified by their label color as well as their size range in base pairs. Panel A shows a simulation of the DNA profile as it would appear on an electrophoretic gel (+ indicates the anode side). Blue, green, and yellow label the 10 DNA profiling loci, shown here in grayscale. The red DNA fragments are size standards with a known molecular weight used to estimate the size in base pairs of the other DNA fragments in the profile. Panel B shows the DNA profile for all loci and the size standard DNA fragments as a graph of color signal intensity by size of DNA fragment in base pairs. Panel C shows a simpler view of trace data for each label color independently with the individual loci labeled above the trace peaks. A few shorter peaks are visible in the yellow, green, and blue traces of Panel C that are not labeled as loci. These artifacts, called “pull up” peaks, are caused by intense signal from a locus labeled with another color (e.g. the yellow and blue peaks in the location of the green labeled amelogenin locus). A full color version of this figure is available on the textbook website.

This is an example of the DNA sequence found at a microsatellite locus. This sequence is the 24.1 allele from the fibrinogen alpha chain gene, or FGA locus (Genbank accession no. AY749636; see Figure 2.8). The integral repeat is the 4 bp sequence CTTT, and most alleles have sequences that differ by some number of full CTTT repeats. However, there are exceptions where alleles have sequences with partial repeats or stutters in the repeat pattern, for example, the TTTCT and CTC sequences imbedded in the perfect CTTT repeats. In this case, the 24.1 allele is 1 bp longer than the 24‐allele sequence.

GCCCCATAGGTTTTGAACTCACAGATTAAACTGTAACCAAAATAAAATTAGGCATTATTTACAAGCTAGTTT CTTT CTTT CTTT TTTCT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTTT CTC CTTC CTTC CTTT CTTC CTTT CTTT TTTGCTGGCA ATTACAGACAAATCAA

Table 2.4 Expected numbers of each of the three MN blood group genotypes under the null hypotheses of Hardy–Weinberg. Genotype frequencies are based on a sample of 1066 Chukchi individuals, a native people of eastern Siberia (Roychoudhury and Nei 1988).

Frequency of M = = 0.4184 Frequency of N = = 0.5816
Genotype Observed Expected number of genotypes Observed – Expected
MM 165 = 1066 × (0.4184)2 = 186.61 −21.6
MN 562 = 1066 × 2(0.4184)(0.5816) = 518.80 43.2
NN 339 = 1066 × (0. 5816)2 = 360.58 −21.6

In more general terms, the expected frequency of an event, p, times the number of trials or samples, n, gives the expected number of events or np. To test the hypothesis that p is the frequency of an event in an actual population, we compare np with . Close agreement suggests that the parameter and the estimate are the same quantity. But a large disagreement instead suggests that p and are likely to be different probabilities. The chi‐squared (χ2) distribution is a statistical test commonly used to compare np and . The χ2 test provides the probability of obtaining the difference (or more) between the observed and expected (np) number of outcomes by chance alone if the null hypothesis is true. As the difference between the observed and expected grows larger, it becomes less probable that the parameter and the parameter estimate are actually the same but differ in a given sample due to chance. The χ2 statistic is:

(2.7)

where ∑ (pronounced “sigma”) indicates taking the sum of multiple terms.

The χ2 formula makes intuitive sense. In the numerator, there is a difference between the observed and Hardy–Weinberg expected number of individuals. This difference is squared, like a variance, since we do not care about the direction of the difference but only the magnitude of the difference. Then, in the denominator, we divide by the expected number of individuals to make the squared difference relative. For example, a squared difference of 4 is small if the expected number is 100 (it is 4%) but relatively larger if the expected number is 8 (it is 50%). Adding all of these relative squared differences gives the total relative squared deviation observed over all genotypes.

(2.8)

We need to compare our statistic to values from the χ2 distribution. But, first, we need to know how much information, or the degrees of freedom (commonly abbreviated as df), was used to estimate the χ2 statistic. In general, degrees of freedom are based on the number of categories of data: df = no. of classes compared − no. of parameters estimated −1 for the χ2 test itself. In this case, df = 3–1 − 1 = 1 for three genotypes and one estimated allele frequency (with two alleles: the other allele frequency is fixed once the first has been estimated).

Figure 2.9 shows a χ2 distribution for one degree of freedom. Small deviations of the observed from the expected are more probable since they leave more area of the distribution to the right of the χ2 value. As the χ2 value gets larger, the probability that the difference between the observed and expected is just due to chance sampling decreases (the area under the curve to the right gets smaller). Another way of saying this is that as the observed and expected get increasingly different, it becomes more improbable that our null hypothesis of Hardy–Weinberg is actually the process that is determining genotype frequencies. Using Table 2.5, we see that a χ2 value of 7.46 with 1 df has a probability between 0.01 and 0.001. The conclusion is that the observed genotype frequencies would be observed less than 1% of the time in a population that actually had Hardy–Weinberg expected genotype frequencies. Under the null hypothesis, we do not expect this much difference or more from Hardy–Weinberg expectations to occur often. By convention, we would reject chance as the explanation for the differences if the χ2 value had a probability of 0.05 or less. In other words, if chance explains the difference in five trials out of 100 or less, then we reject the hypothesis that the observed and expected patterns are the same. The critical value above which we reject the null hypothesis for a χ2 test is 3.84 with 1 df, or in notation χ20.05, 1 = 3.84. In this case, we can clearly see an excess of heterozygotes and deficits of homozygotes, and employing the χ2 test allows us to conclude that Hardy–Weinberg expected genotype frequencies are not present in the population.


Figure 2.9 A χ2 distribution with one degree of freedom. The χ2 value for the Hardy–Weinberg test with MN blood group genotypes as well as the critical value to reject the null hypothesis are shown. The area under the curve to the right of the arrow indicates the probability of observing that much or more difference between the observed and expected outcomes.

Table 2.5 χ2 values and associated cumulative probabilities in the right‐hand tail of the distribution for one through five degrees of freedom.

Probability
df 0.5 0.25 0.10 0.05 0.01 0.001
1 0.4549 1.3233 2.7055 3.8415 6.6349 10.8276
2 1.3863 2.7726 4.6052 5.9915 9.2103 13.8155
3 2.3660 4.1083 6.2514 7.8147 11.3449 16.2662
4 3.3567 5.3853 7.7794 9.4877 13.2767 18.4668
5 4.3515 6.6257 9.2364 11.0705 15.0863 20.5150
Population Genetics

Подняться наверх