Читать книгу Statistics for HCI - Alan Dix - Страница 12

Оглавление

CHAPTER 2

The unexpected wildness of random

How random is the world? We often underestimate just how wild random phenomena are—we expect to see patterns and reasons for what is sometimes entirely arbitrary.

By ‘wild’ here I mean that the behaviour of random phenomena is often far more chaotic than we expect. Perhaps because, barring the weather, so many aspects of life are controlled, we have become used to ‘tame,’ predictable phenomena. Crucially, this may lead to misinterpreting data, especially in graphs, either seeing patterns that are in fact pure randomness, or missing trends hidden by noise.

The mathematics of formal statistics attempts to see through this noise and give a clear view of robust properties of the underlying phenomenon. This chapter may help you see why you need to do this sometimes. However, we are also aiming to develop a ‘gut’ feeling for randomness, which is most important when you are simply eyeballing data, getting that first impression, to help you sort out the spurious from the serious and know when to reach for the formal stats.

2.1 EXPERIMENTS IN RANDOMNESS

Through a story and some exercises, I hope that you will get a better feel for how wild randomness is. We sometimes expect random things to end up close to their average behaviour, but we’ll see that variability is often large.

When you have real data you have a combination of some real effect and random ‘noise.’ However, if you do some coin tossing experiments you can be sure that the coins you are dealing with are (near enough) fair—everything you see will be sheer randomness.

2.1.1 RAINFALL IN GHEISRA

We’ll start with a story:

In the far-off land of Gheisra there lies the Plain of Nali. For 100 miles in each direction it spreads, featureless and flat, no vegetation, no habitation; except, at its very centre, a pavement of 25 tiles of stone, each perfectly level with the others and with the surrounding land.

The origins of this pavement are unknown—whether it was set there by some ancient race for its own purposes, or whether it was there from the beginning of the world.


Figure 2.1: Three days in Gheisra—Which are mere chance and which are an omen?

Rain falls but rarely on that barren plain, but when clouds are seen gathering over the Plain of Nali, the monks of Gheisra journey on pilgrimage to this shrine of the ancients, to watch for the patterns of the raindrops on the tiles. Oftentimes the rain falls by chance, but sometimes the raindrops form patterns, giving omens of events afar off.

Some of the patterns recorded by the monks are shown in Fig. 2.1. All of them at first glance seem quite random, but are they really? Do some have properties or tendencies that are not entirely like random rainfall? Which are mere chance, and which foretell great omens? Before reading on make your choices and record why you made your decision.

Before we reveal the true omens, you might like to know how you fare alongside three- and seven-year-olds.

When very young children are presented with this choice (with an appropriate story for their age) they give very mixed answers, but have a small tendency to think that distributions like Day 1 are real rainfall, whereas those like Day 3 are an omen.

In contrast, once children are older, seven or so, they are more consistent and tended to plump for Day 3 as the random rainfall.

Were you more like the three-year-old and thought Day 1 was random rainfall, or more like the seven-year-old and thought Day 1 was an omen and Day 3 random. Or perhaps you were like neither of them and thought Day 2 was true random rainfall.

Let’s see who is right.

Day 1 When you looked at Day 1 you might have seen a slight diagonal tendency with the lower-right corner less dense than the upper-left. Or you may have noted the suspiciously collinear three dots in the second tile on the top row. However, this pattern, the preferred choice of the three-year-old, is in fact the random rainfall—or at least as random as a computer random number generator can manage! In true random phenomena you often do get gaps, dense spots, or apparent patterns, but this is just pure chance.

Day 2 In Day 2 you might have thought it looked a little clumped toward the middle. In fact, this is perfectly right, it is exactly the same tiles as in Day 1, but re-ordered so that the fuller tiles are toward the centre, and the part-empty ones to the edges. This is an omen!

Day 3 Finally, Day 3 is also an omen. This is the preferred choice of seven-year-olds to be random rainfalls and also, I have found, the preferred choice of 27-, 37-, and 47-year-olds. However, it is too uniform. The drops on each tile are distributed randomly within it, but there are precisely five drops on each tile. At some point during our early education we ‘learn’ (wrongly!) that random phenomena are uniform. Although this is nearly true when there are very large numbers involved (maybe 12,500 drops rather than 125), with smaller numbers the effects are far more chaotic than one might imagine.

2.1.2 TWO-HORSE RACES

Now for a different exercise, and this time you don’t just have to choose, you have to do something.

Find a coin or, even better, if you have 20, get them. Toss the coins one by one and put the heads into one row and the tails into another. Keep on tossing until one line of coins has ten coins in it … you could even mark a finish line ten coins away from the start (like Fig. 2.2). If you only have one coin you’ll have to toss it lots of times and keep tally.

If you are on your own repeat this several times, but if you are in a group, perhaps a class, do it fewer times and look at each other’s coins as well as your own.

Before you start, think about what you expect to see, and only then do the coin tossing. So what happened? Did you get a clear winner, or were they neck and neck? Is it what you expected to happen?

I had a go and did five races. In one case they were nearly neck-and-neck at 9 heads to 10 tails, but the other four races were all won by heads with some quite large margins: 10 to 7, 10 to 6, 10 to 5, and 10 to 4.

Often people are surprised because they are expecting a near neck-and-neck race every time. As the coins are all fair, they expect approximately equal numbers of heads and tails. However, just like the rainfall in Gheisra, it is very common to have one quite far ahead of the other.

You might think that because the probability of a head is a half, the number of heads will be near enough half. Indeed, this is the case if you average over lots and lots of tosses. However, with just 20 coins in a race, the variability is large.

The probability of getting an outright winner all heads or all tails is low, only about 1 in 500. However, the probability of getting a near wipe-out with 1 head and 10 tails or vice versa is around 1 in 50—in a large class one person is likely to have this.


Figure 2.2: Two-horse races—Were yours neck-and-neck or was there a front runner?

2.1.3 LESSONS

I hope these two activities begin to give you some idea of the wild nature of random phenomena. We can see a few general lessons.

First, apparent patterns or differences may just be pure chance. For example, if you had found heads winning by 10 to 2, you might have thought this meant that your coin was in some way biased to heads. Or, you might have thought that the nearly straight line of three drops on Day 1 had to mean something. But random things are so wild that apparently systematic effects sometimes happen by chance.

Second, this wildness may lead to what appear to be ‘bad values.’ If you had got 10 tails and just 1 head, you might have thought “but coins are fair, so I must have done something wrong.” Indeed, famous scientists have fallen for this fallacy!

Mendel’s experiment on inheritance of sweet pea characteristics laid the foundations for modern genetics. However, his results are a little too good. If you cross-pollinate two plants, one of them pure bred to have a recessive characteristic (say R) and the other purely dominant (say D), in the first generation all the progeny have the dominant characteristic, but in fact possess precisely one recessive and one dominant gene (RD). In the second generation, interbreeding two of the first-generation RD plants is expected to have observable characteristics that are dominant and recessive in the ideal ratio 3:1. In Mendel’s data the ratios are just a little too close to this figure. It seems likely that he rejected ‘bad values,’ assuming he had done something wrong, when in fact they were just the results of chance.

The same thing can happen in physics. In 1909, Robert Millikan and Harvey Fletcher ran an experiment to determine the charge of a single electron. The experiment (also known as the ‘Millikan Can Experiment’) found that charge came in discrete units and thus showed that each electron has an identical charge. To do this they created charged oil drops and suspended them using the electrostatic charge. The relationship between the electrical field needed and the size (and hence weight) of a drop enabled them to calculate the charge on each oil drop. These always came in multiples of a single value—the electron charge. There are always sources of error in any measurements and yet the reported charges are a little too close to multiples of the same number. Again, it looks like ‘bad’ results were ignored as some form of mistake during the setup of the experiment.

2.2 QUICK (AND DIRTY!) TIP

We often deal with survey or count data. This might come in public forms such as opinion poll data preceding an election, or from your own data when you email out a survey, or count kinds of errors in a user study.

So when you find that 27% of the users in your study had a problem, how confident do you feel in using this to estimate the level of prevalence amongst users in general? If you did a bigger study with more users would you be surprised if the figure you got was actually 17%, 37%, or 77%?

You can work out precise numbers for this, but I often use a simple rule of thumb method for doing a quick estimate.

for survey or other count data do square root times two (ish)

We’re going to deal with this by looking at three separate cases.

2.2.1 CASE 1 –SMALL PROPORTIONS

First, consider the case when the number you are dealing with is a comparatively small proportion of the overall sample. For example, assume you want to know about people’s favourite colours. You do a survey of 1000 people and 10% say their favourite colour is blue. How reliable is this figure? If you had done a larger survey, would the answer still be close to 10% or could it be very different?

The simple rule is that the variation is 2x the square root number of people who chose blue.

To work this out, first calculate how many people the 10% represents. Given the sample was 1000, this is 100 people. The square root of 100 is 10, so 2x this is 20 people. You can be reasonably confident that the number of people choosing blue in your sample is within +/- 20 of the proportion you’d expect from the population as a whole. Dividing that +/-20 people by the 1000 sample, the % of people for whom blue is their favourite colour is likely to be within +/- 2% of the measured 10%.

2.2.2 CASE 2 –LARGE MAJORITY

The second case is when you have a large majority who have selected a particular option. For example, let’s say in another survey, this time of 200 people, 85% said green was their favourite colour.

This time you still apply the “2x square root” rule, but instead focus on the smaller number, those who didn’t choose green. The 15% who didn’t choose green is 15% of 200, that is 30 people. The square root of 30 is about 5.5, so the expected variability is about +/-11, or in percentage terms about +/- 5%. That is, the real proportion over the population as a whole could be anywhere between 80% and 90%.

Notice how the variability of the proportion estimate from the sample increases as the sample size gets smaller.

2.2.3 CASE 3 –MIDDLING

Finally, if the numbers are near the middle, just take the square root, but this time multiply by 1.5.

For example, if you took a survey of 2000 people and 50% answered yes to a question, this represents 1000 people. The square root of 1000 is a bit over 30, and 1.5x this is around 50 people, so you expect a variation of about +/- 50 people, or about +/- 2.5%.

Opinion polls for elections often have samples of around 2000, so if the parties are within a few points of each other you really have no idea who will win.

2.2.4 WHY DOES THIS WORK?

For those who’d like to understand the detailed stats for this (skip if you don’t!) …

These three cases are simplified forms of the precise mathematical formula for the variance of a Binomial distribution np(1 –p), where n is the number in the sample and p the true population proportion for the thing you are measuring. When you are dealing with fairly small proportions the 1 –p term is close to 1, so the whole variance is close to np, that is the number with the given value. You then take the square root to give the standard deviation. The factor of 2 is because about 95% of measurements fall within 2 standard deviations. The reason this becomes 1.5 in the middle is that you can no longer treat (1 –p) as nearly 1, and for p = 0.5, this makes things smaller by square root of 0.5, which is about 0.7. Two times 0.7 is (about) one and half (I did say quick and dirty!).

2.2.5 MORE IMPORTANT THAN THE MATH …

However, for survey data, or indeed any kind of data, these calculations of variability are in the end far less critical than ensuring that the sample really does adequately measure the thing you are after.


Figure 2.3: Monty Hall problem—Should you swap doors? (source: https://en.wikipedia.org/wiki/Monty_Hall_problem#/media/File:Monty_open_door.svg.

Is it fair?—Has the way you have selected people made one outcome more likely. For example, if you do an election opinion poll of your Facebook friends, this may not be indicative of the country at large!

For surveys, has there been self-selection?—Maybe you asked a representative sample, but who actually answered? Often you get more responses from those who have strong feelings about the issue. For usability of software, this probably means those who have had a problem with it.

Have you phrased the question fairly?—For example, people are far more likely to answer “Yes” to a question, so if you ask “do you want to leave?” you might get 60% saying “yes” and 40% saying “no,” but if you asked the question in the opposite way “do you want to stay?,” you might still get 60% saying “yes.”

We will discuss these kinds of issue in greater detail in Chapter 11.

2.3 PROBABILITY CAN BE HARD –FROM GOATS TO DNA

Simple techniques can help, but even mathematicians can get it wrong.

It would be nice if there was a magic bullet to make all of probability and statistics easy. I hope this book will help you make more sense of statistics, but there will always be difficult cases—our brains are just not built for complex probabilities. However, it may help to know that even experts can get it wrong!

We’ll look now at two complex issues in probability that even mathematicians sometimes find hard: the Monty Hall problem and DNA evidence. We’ll also see how a simple technique can help you tune your common sense for this kind of problem. This is not the magic bullet, but it may sometimes help.

2.3.1 THE MONTY HALL PROBLEM

There was a quiz show in the 1950s where the star prize was a car. After battling their way through previous rounds the winning contestant had one final challenge. There were three doors, behind one of which was the prize car, but behind each of the other two was a goat.

The contestant chose a door, but to increase the drama of the moment, the quizmaster did not immediately open the chosen door. Instead, they opened one of the others. The quizmaster, who knew which was the winning door, would always open a door with a goat behind. The contestant was then given the chance to change their mind. Imagine you are the contestant. What do you think you should do?

• Should you stick with the original choice?

• Should you change to the remaining unopened door?

• Or, doesn’t it make any difference?

Although there is a correct answer, there are several apparently compelling arguments in either direction:

One argument is that, as there were originally three closed doors, the chance of the car being behind the door you chose first was 1 in 3, whereas now that there are only two closed doors to choose from, the chance of it being behind the one you didn’t choose originally is 1 in 2, so you should change. However, the astute may have noticed that this is a slightly flawed probabilistic argument, as the probabilities don’t add up to one.

A counter argument is that at the end there are two closed doors, so the chances are even as to which has the car behind it, and hence there is no advantage to changing.

An information theoretic argument is similar—the remaining closed doors hide the car equally before and after the other door has been opened: you have no more knowledge, so why change your mind?

Even mathematicians and statisticians can argue about this, and when they work it out by enumerating the cases, they do not always believe the answer. It is one of those cases where common sense simply does not help … even for a mathematician!

Before revealing the correct answer, let’s have a thought experiment.

2.3.2 TIP: MAKE THE NUMBERS EXTREME

Imagine if instead of three doors there were a million doors. Behind 999,999 doors are goats, but behind the one lucky door there is a car.

I am the quizmaster and ask you to choose a door. Let’s say you choose door number 42. Now I now open 999,998 of the remaining doors, being careful to only open doors that hide goats. You are left with two doors, your original choice and the one door I have not opened. Do you want to change your mind?


Figure 2.4: Monty Hall with a million doors?

This time it is pretty obvious that you should change. There was virtually no chance of you having chosen the right door to start with, so it was almost certainly (999,999 out of a million) one of the others—I have helpfully discarded all the rest so the remaining door I didn’t open is almost certainly the correct one.

It is as if, before I opened the 999,998 ‘goat’ doors, I’d asked you, “do you think the car is precisely behind door 42, or any of the others?”

In fact, exactly the same reasoning holds for three doors. In that case there was a 2/3 chance that the car was behind one of the two doors you did not choose, and as the quizmaster I discarded one of those, the one that hid a goat. So it is twice as likely as your original choice that the car is behind the door I did not open. Regarding the information theoretic argument: the act of opening the goat door does add information because the quizmaster knows which door hides the car, and only opens a goat door. However, it still feels a bit like smoke and mirrors with three doors, even though the million-door version is obvious.

Using the extreme case helps tune your common sense, often allowing you to see flaws in mistaken arguments, or work out the true explanation. It is not an infallible heuristic (sometimes arguments do change with scale), but it is often helpful.

2.3.3 DNA EVIDENCE

The Monty Hall problem has always been a bit of fun, albeit disappointing if you were the contestant who got it wrong. However, there are similar kinds of problem where the outcomes are deadly serious. DNA evidence is just such an example. Although each person’s DNA is almost unique, DNA testing is imperfect and has the possibility of error.

Suppose there has been a murder, and remains of DNA have been found on the scene. The lab DNA matching has an accuracy of one in 100,000.

Imagine two scenarios.

Case 1: Shortly prior to the body being found, the victim had been known to have had a violent argument with a friend. The police match the DNA of the friend with DNA found at the murder scene. The friend is arrested and taken to court.

Case 2: The police look up the DNA in the national DNA database and find a positive match. The matched person is arrested and taken to court.

Similar cases have occurred and led to convictions based heavily on the DNA evidence. However, while in case 1 the DNA is strong corroborating evidence, in case 2 it is not. Yet courts, guided by ‘expert’ witnesses, have not understood the distinction and convicted people in situations like case 2. Belatedly, the problem has been recognised and in the UK there have been a number of appeals where longstanding cases have been overturned, sadly not before people have spent considerable periods behind bars for crimes they did not commit. One can only hope that similar evidence has not been crucial in jurisdictions with a death penalty.

If you were the judge or jury in such a case would the difference be obvious to you?

If not, we can use a similar trick to the one we used in the Monty Hall problem. There, we made the numbers a lot bigger; here we will make the numbers less extreme. Instead of a 1 in 100,000 chance of a false DNA match, let’s make it 1 in 100. While this is still useful, though not overwhelming, corroborative evidence in case 1, it is pretty obvious that if there are more than a few hundred people in the police database, then you are bound to find a match.

It is as if a red Fiat Uno had been spotted outside the victim’s house. If the friend’s car was a red Fiat Uno it would be good additional circumstantial evidence, but simply arresting any red Fiat Uno owner would clearly be silly.

If we return to the original 1 in 100,000 figure for a DNA match, it is the same. If there are more than a few hundred thousand people in the database then you are almost bound to find a match. This might be a way to find people you might investigate by looking for other evidence, indeed that’s the way several cold cases have been solved over recent years, but the DNA evidence would not in itself be strong.

In summary, some diverting puzzles and also some very serious problems involving probability can be very hard to understand. Our common sense is not well tuned to probability. Even trained mathematicians can get confused, which is one of the reasons we turn to formulae and calculations. However, we saw that changing the scale of numbers in a problem can sometimes help your common sense to understand them.

Statistics for HCI

Подняться наверх