Читать книгу Biostatistics Decoded - A. Gouveia Oliveira - Страница 13

1.5 Inferences from Samples

Imagine a swimming pool full of small balls. The color of the balls is the attribute we wish to study, and we know that it can take only one of two possible values: black and white. The problem at hand is to find the proportion of black balls in the population of balls inside the swimming pool. So we take a single ball out of the pool and let us say that the ball happened to be black (Figure 1.8). What can we say about the proportion of black balls in the population?

We could start by saying that it is perfectly possible that the population consists 100% of black balls. We could also say that it is also quite plausible that the proportion of black balls is, say, 80% because then it would be quite natural that, by taking a single ball at random from the pool, we would get a black ball. However, if the proportion of black balls in the population is very small, say less than 5%, we would expect to get a white ball, rather than a black ball. In other words, a sample made up of a black ball is not very consistent with the hypothesis of a population with less than 5% of black balls. On the other hand, if the proportion of black balls in the population is between 5 and 100%, the result of the sampling is quite plausible. Consequently, we would conclude that the sample was consistent with a proportion of black balls in the swimming pool between 5 and 100%. The inference we would make from that sample would be to estimate as such the proportion of black balls, with a high degree of confidence.

One might say that this whole thing is nonsense, because such a conclusion is completely worthless. Of course it is, but that is because we did not bother spending a lot of effort in doing the study. If we wanted a more interesting conclusion, we would have to work harder and collect some more information about the population. That is, we would have to make some more observations to increase the sample size.

Before going into this, think for a moment about the previous study. There are three important things to note. First, this approach to sampling still works in the extreme situation of a sample size of one, while that is not true for the classical approach. Second, the conclusion was correct (remember, it was said that one was very confident that the proportion of black balls in the population was a number between 5 and 100%). The problem with the conclusion, better said with the study, was that it lacked precision. Third, the inference procedure described here is valid only for random samples of the population, otherwise the conclusions may be completely wrong. Suppose that the proportion of black balls in the population is minimal, but because their color attracts our attention, if we looked at the balls before getting our sample, we were much more likely to select a flashy black ball than a boring white one. We would then make the same reasoning as before and reach the same conclusion, but we would be completely wrong because the sample was biased toward the black balls.

Figure 1.8 Inference with binary attributes.

Suppose now that we decide to take a random sample of 60 balls, and that we have 24 black balls and 36 white balls (Figure 1.9). The proportion of black balls in the sample is, therefore, 40%. What can we say about the proportion of black balls in the population? Well, we can say that if the proportion is below, say, 25%, there should not be so many black balls in a sample of 60. Conversely, we can also say that if the proportion is above, say, 55%, there should be more black balls in the sample. Therefore, we would be confident in concluding that the proportion of black balls in the swimming pool must be somewhere between 25 and 55%. This is a more interesting result than the previous one because it has more precision; that is, the range of possibilities is narrower than before. If we need more precision, all we have to do is increase the sample size.

Let us return to the situation of a sample size of one and suppose that we want to estimate another characteristic of the balls in the population, for example, the average weight. This characteristic, or attribute, has an important difference from the color attribute, because weight can take many different values, not just two.

Let us see if we can apply the same reasoning in the case of attributes taking many different values. To do so, we take a ball at random and measure its weight. Let us say that we get a weight of 60 g. What can we conclude about the average weight in the population? Now the answer is not so simple. If we knew that the balls were all about the same weight, we could say that the average weight in the population should be a value between, say, 50 and 70 g. If it were below or above those limits, it would be unlikely that a ball sampled at random would weigh 60 g.

Figure 1.9 Inference with interval attributes I.

However, if we knew that the balls varied greatly in weight, we would say that the average weight in the population should be a value between, say, 40 and 80 g (Figure 1.10). The problem here, because now we are studying an attribute that may take many values, is that to make inferences about the population we also need information about the amount of variation of that attribute in the population. It thus appears that this approach does not work well in this extreme situation. Or does it?

Suppose we take a second random observation and now have a sample of two. The second ball weighs 58 g, and so we are compelled to believe that balls in this population are relatively homogeneous regarding weight. In this case, we could say that we were quite confident that the average weight of balls in the population was between 50 and 70 g. If the average weight were under 50 g, it would be unlikely that we would have two balls with 58 and 60 g in the sample; and similarly if the average weight were above 70 g. So this approach works properly with a sample size of two, but is this situation extreme? Yes it is, because in this case we need to estimate not one but two characteristics of the population, the average weight and its variation, and in order to estimate the variation it is required to have at least two observations.

In summary, in order that the modern approach to sampling be valid, sampling must be at random. The representativeness of a sample is primarily determined by the sampling method used, not by the sample size. Sample size determines only the precision of the population estimates obtained with the sample.

Now, if sample size has no relationship to representativeness, does this mean that sample size has no influence at all on the validity of the estimates? No, it does not. Sample size is of importance to validity because large sample sizes offer protection against accidental errors during sample selection and data collection, which might have an impact on our estimates. Examples of such errors are selecting an individual who does not actually belong to the population under study, measurement errors, transcription errors, and missing values.

Figure 1.10 Inference with interval attributes II.

We have eliminated a lot of subjectivity by putting the notion of sample representativeness within a convenient framework. Now we must try to eliminate the remaining subjectivity in two other statements. First, we need to find a way to determine, objectively and reliably, the limits for population proportions and averages that are consistent with the samples. Second, we need to be more specific when we say that we are confident about those limits. Terms like confident, very confident, or quite confident lack objectivity, so it would be very useful if we could express quantitatively our degree of confidence in the estimates. In order to do that, as we have seen, we need a measure of the variation of the values of an attribute.

Подняться наверх