Читать книгу Statistics for HCI - Alan Dix - Страница 10

Оглавление

CHAPTER 1

Introduction

In this introductory chapter we consider:

• the nature of human cognition, which makes it hard to understand probability, and hence why we need formal statistics;

• whether you need to worry about statistics at all;

• the way statistics operates to offer us insight into the complexities of the world; and

• the different phases in research and software development and where different forms of qualitative and quantitative analysis are appropriate.

1.1 WHY ARE PROBABILITY AND STATISTICS SO HARD?

Do you find probability and statistics hard? If so, don’t worry, it’s not just you; it’s basic human psychology.

We have two systems of thought¹: (i) subconscious reactions that are based on semiprobabilistic associations, and (ii) conscious thinking that likes to have one model of the world and is really bad at probability. This is why we need to use mathematics and other explicit techniques to help us deal with probabilities. Furthermore, statistics needs both this mathematics of probability and an appreciation of what it means in the real world. Understanding this means you don’t have to feel bad about finding stats hard, and also helps to suggest ways to make it easier.

1.1.1 IN TWO MINDS

Skinner’s famous experiments with pigeons (Fig. 1.1) showed how certain kinds of learning could be studied in terms of associations between stimuli and rewards. If you present a reward enough times with the behaviour you want, the pigeon will learn to do it even when the original reward no longer happens. The learning is semi-probabilistic in the sense that if rewards are more common the learning is faster, or if rewards and penalties both happen at different frequencies, then you get a level of trade-off in the learning. At a cognitive level one can think of strengths of association being built up with rewards strengthening them and penalties inhibiting them.

Figure 1.1: Pigeon about to tap for food (source: https://archive.org/details/controllingbehaviorthroughreinforcement).

This kind of learning is not quite a weighted sum of past experience: for example, negative experiences typically count more than positive ones, and once a pattern is established it takes a lot to shift it. However, it is not so far from a probability estimate. We humans share these subconscious learning processes with other animals. They are powerful and lead to very rapid reactions, but need very large numbers of exposures to similar situations to establish memories.

Of course we are not just our subconscious! In addition, we have conscious thinking and reasoning, which enable us to learn from a single experience. Retrospectively we are able to retrieve a relevant past experience, compare it to what we are encountering now, and work out what to do based on it. This is very powerful, but unlike our more unconscious sea of overlapping memories and associations, our conscious mind is linear and is normally locked into a single model of the world. Because of that single model, this form of thinking is not so good at intuitively grasping probabilities, as is repeatedly evidenced by gambling behaviour and more broadly our assessment of risk.

One experiment used four packs of cards with different penalties and rewards to see how quickly people could assess the difference [5]. The experiment included some patients with prefrontal brain damage, but we’ll just consider the non-patients. The subjects could choose cards from the different packs. Each pack had an initial reward attached to it, but when they turned over a card it might also have a penalty, “sorry, you’ve lost $500.” Some of the packs, those with the higher initial per-card reward, had more penalties, and the other packs had a better balance of rewards. After playing for a while most subjects realised that the packs were different and could tell which were better. The subjects were also wired up to a skin conductivity sensor as used in a lie detector. Well before they were able to say that some of the card packs were worse than the others, they showed a response on the sensor when they were about to turn over a card from the disadvantageous pack—that is subconsciously they knew it was likely to be a bad card.

Because our conscious mind is not naturally good at dealing with probabilities we need to use the tool of mathematics to enable us to reason explicitly about them. For example, if the subjects in the experiment had kept a tally of good and bad cards, they would have seen, in the numbers, which packs were better.

1.1.2 MATHS AND MORE

Some years ago, when I was first teaching statistics, I remember learning that statistics education was known to be particularly difficult. This is in part because it requires a combination of maths and real-world thinking.

In statistics we use the explicit tallying of data and mathematical reasoning about probabilities to let us do quite complex reasoning from effects (measurements) back to causes (the real word phenomena that are being measured). So you do need to feel reasonably comfortable with this mathematics. However, even if you are a whizz at maths, if you can’t relate this back to understanding about the real world, you are also stuck. It is a bit like the applied maths problems where people get so lost in the maths that they forget the units: “the answer is 42”—but 42 what? 42 degrees, 42 metres, or 42 bananas?

On the whole, those who are good at mathematics are not always good at relating their thinking back to the real world, and those of a more practical disposition are not always best at maths—no wonder statistics is hard!

However, knowing this we can try to make things better.

It is likely that the majority of readers of this book will have a stronger sense of the practical issues, so I will try to explain some of the concepts that are necessary, without getting deep into the mathematics of how they are calculated—leave that to the computer!

1.2 DO YOU NEED STATS AT ALL?

The fact that you have opened this book suggests that you think you should learn something about statistics. However, maybe the majority of your work is qualitative, or you typically do small-scale studies and you wonder if it is sufficient to eyeball the raw data and make a judgement.

Sometimes no statistics are necessary. Perhaps you have performed a small user trial and one user makes an error; you look at the circumstances and think “of course lots of users will have the same problem.” Your judgement is based purely on past experience and professional knowledge.

However, suppose you have performed a survey comparing two alternative systems and asked users which system they prefer. The results are shown in Fig. 1.2. It is clear that System A is far more popular than System B. Or is it?

Notice that the left hand scale has two notches, but no values. Let’s suppose first that the notches are at 1000 and 2000: the results of surveying 3000 people. This is obviously a clear result. However, if instead the notches were at 1 and 2, representing a survey of 3 users, you might not be so confident in the results. As you eyeball the data, you are performing some informal statistics.

Figure 1.2: User preferences comparing two systems.

What if it were 10 to 20, or 5 to 10? How clear a result would that be? The job of statistics is precisely to help you with judgements such as these.

1.3 THE JOB OF STATISTICS –FROM THE REAL WORLD TO MEASUREMENT AND BACK AGAIN

If you want to use statistics you will need to learn about t-tests and p-values, perhaps Bayesian statistics or Normal distributions, maybe a stats package such as SPSS or R. But why do this at all? What does statistics actually achieve?

Fundamentally, statistics is about trying to learn dependable things about the real world based on measurements of it.

However, what we mean by ‘real’ is itself a little complicated, from the actual users you have tested to the hypothetical idea of a ‘typical user’ of your system.

1.3.1 THE ‘REAL’ WORLD

We’ll start with the real world, but what is that?

the sample First of all, there is the actual data you have: results from an experiment, responses from a survey, or log data from a deployed application. This is the real world. The user you tested at 3 PM on a rainy day in March, after a slightly overfilling lunch, did make precisely 3 errors and finished the task in 17 minutes and 23 seconds. However, while this measured data is real, it is typically not what you wanted to know. Would the same user on a different day, under different conditions, have made the same errors? What about other users?

the population Another idea of ‘real’ is when there is a larger group of people you want to know about, say all the employees in your company, or all users of product A. This larger group is often referred to as the population. What would be the average (and variation in) error rate if all of them sat down and used the software you are testing? Or, as a more concrete kind of measurement, what is their average height? You might take a sample of 20 people and find their average height, but you are using this to make an estimate about your population as a whole.

the ideal However, while this idea of the actual population is very concrete, often the ‘real’ world you are interested in is slightly more nebulous. Consider the current users of product A. You are interested in the error rate not only if they try your new software today, but if they do so multiple times over a period—that is, a sort of ‘typical’ error rate when each uses the software.

Furthermore, it is not so much the actual set of current users (not that you don’t care about them), but rather the typical user, especially for a new piece of software where you have no current users yet. Similarly, when you toss a coin you have an idea of the behaviour of a fair coin, which is not simply the complete collection of every coin in circulation. Even when you have tossed the coin, you can still think about the different ways it could have fallen, somehow reasoning about all possible pasts and presents for an unrepeatable event.

the theoretical Finally, this hypothetical ‘real’ event may be represented mathematically as a theoretical distribution such as the Normal distribution (for heights) or Binomial distribution (for coin tosses).

In practice, you rarely need to voice these things explicitly, but occasionally you do need to think carefully about it. If you have done a series of consistent blood tests you may know something very important about a particular individual, but not patients in general. If you are analysing big data you may know something very precise about your current users, and how they behave given a particular social context and particular algorithms in your system, but not necessarily about potential users and how they may behave if your algorithms and environment change.

1.3.2 THERE AND BACK AGAIN

Once you have clarity about the ‘real’ world that you want to investigate, the job of statistics also becomes more clear. You have taken measurements, often of some sample of people and situations, and you want to use the measurements to understand the real world (Fig. 1.3).

For example, given a sample of heights of 20 randomly chosen people from your organisation, what can you infer about the heights of everyone? Given the error rates of 20 people on an artificial task in a lab, what can you tell about the behaviour of a typical user in their everyday situation? Given the complete past history of ten million users of a website, what does this tell us about their future behaviour or the behaviour of a new user to the site?

Figure 1.3: The job of statistics—moving from data about the real world back to knowledge about the real world.

1.3.3 NOISE AND RANDOMNESS

If all the measurements we had were deterministic, we would not need statistics. For example, an ultrasonic range finder sends a pulse of sound, measures how long it takes to return, then multiplies the time by the speed of sound, divides by two, and gives you a readout.

In the case of the sample of 20 people we can measure each of their heights relatively accurately, but maybe even this has some inaccuracy, so each measurement has some ‘noise.’ More critical is that they are randomly chosen from the far larger population of employees. In this and many similar situations, there is a degree of randomness in the measurements on which we base our decision making.

Just as with ‘real,’ ‘random’ is not so straightforward.

Some would argue that everything is pre-determined from its causes, with the possible exception of quantum mechanics, and even then only in some interpretations. However, in reality, when we toss a coin or roll a die, we treat these as probabilistic phenomena.

fundamentally random This is predominantly quantum-level processes such as the decay of radionuclides. These are used for some of the most critical random number generators.

complex processes When we toss a coin, the high speed of the spinning coin, coupled with the airflows around it as it falls, means that its path is so complex that it is effectively random. In the digital world, random number generators are often seeded by measuring a large number of system parameters, each in principle deterministic, but so complex and varied that they are effectively unpredictable.

past random events Imagine you have tossed a coin, and your colleague has taken a quick peek,² but you have not yet looked at it. What is the probability it is a head? Instinctively, you would probably say “1 in 2.” Clearly, it is already completely determined, but in your state of knowledge it is still effectively random.

uncontrolled factors As you go round measuring the heights of the people, perhaps tiny air movements subtly affect your ultrasonic height measurement. Or if you subsequently ask the people to perform a website navigation task, perhaps some have better web skills than others, or better spatial ability. Sometimes we can measure such effects, but often we have to treat them as effectively random.

Note that most people would regard the first two of these as ‘really’ random, or we could call them ontologically random—random in their actual state of being. In contrast the latter two are epistemologically random—random in your state of knowledge. In practice, we often treat all these similarly.

A more important set of distinctions that are of practical use are as follows:

persistence In some cases the random effect is in some way persistent (such as the skill or height of the person), but in other cases it is different for every measurement (like the air movements). This is important as the former may be measurable themselves, or in some circumstances can be cancelled out.

probability With the coin or die, we have an idea of the relative likelihood of each outcome, that is we can assign probabilities, such as 1/6 for the die rolling a ‘3’. However, some things are fundamentally unknown, such as trillionth digit of π; all we know is that it is one of the ten digits 0–9.

uniformity For the probabilistic phenomena, some are uniform: the chances of heads and tail are pretty much equal, as are the chances of landing on each of the six faces of a die. However, others are spread unevenly, such as the level of skill or height of a random employee. For the latter, we often need to be able to know or measure the shape of this unevenness (its distribution).

In order for statistics to be useful, the phenomena we deal with need to have some probability attached to them, but this does not need to be uniform, indeed probability distributions (see Chapter 4) capture precisely this non-uniformity. Philosophically, there are many ways we can think about these probabilities:

frequentist This is the most down-to-earth interpretation. When you say the chance of a coin landing heads is 50:50, you mean that if you keep on tossing the coin again and again and again, on average, after many many tosses, the ratio of heads to tails will be about 50:50. In the case of an unrepeatable phenomenon, such as the already tossed coin, this can be interpreted as “if I reset the world and re-ran it lots of times,” though that, of course, is not quite so ‘down to earth.’

idealist Plato saw the actual events of the world as mere reflections of deeper ideals. The toss of the actual coin in some ways is ‘just’ an example of an ideal coin toss. Even if you toss a coin five times in a row, and it happens to come up heads each time, you probably still believe that it is ‘really’ a 50:50 phenomenon.

formalist This is a pragmatic position: it doesn’t matter what probability ‘really’ is, so long as it satisfies the right mathematical rules. In particular, Bayesian statistics encodes beliefs as 0–1 values, which satisfy the rules of probability (sometimes called plausibility or reasonable expectation [14, 40, 41]).

Often frequentist is used to refer to more traditional forms of statistics such as hypothesis testing (see Chapter 6), in contrast to Bayesian statistics (see Chapter 7), because the latter usually adopts the formalist approach, treating probability as belief. However, this is a misnomer as one can have frequentist interpretations of Bayesian methods and one can certainly apply formalism to traditional statistics. Personally, I tend to use frequentist language to explain phenomena, and formalism to do actual calculations … but deep down I am an idealist!

We will explore and experiment further with randomness in the next chapter, but let us focus for the moment on the goal of working back from measurements to the real world. When the measurements include random effects, it is evident that answering questions about the real world requires a combination of probability and common sense—and that is precisely the job of statistics.

1.4 WHY ARE YOU DOING IT?

Are you doing empirical work because you are an academic addressing a research question, or a practitioner trying to design a better system? Is your work intended to test an existing hypothesis (validation) or to find out what you should be looking for (exploration)? Is it a one-off study, or part of a process (e.g., ‘5 users’ for iterative development)?

These seem like obvious questions, but, in the midst of performing and analysing your study, it is surprisingly easy to lose track of your initial reasons for doing it. Indeed, it is common to read a research paper where the authors have performed evaluations that are more appropriate for user interface development, reporting issues such as wording on menus rather than addressing the principles that prompted their study.

This is partly because there are similarities between academic research and UX practice, both parallels in the empirical methods used and also parallels between the stages of each. Furthermore, your goals may shift—you might be in the midst of work to verify a prior research hypothesis, and then notice an anomaly in the data, which suggests a new phenomenon to study or a potential idea for a product.

We’ll start out by looking at the processes of research and software development separately, and then explore the parallels. Being aware of the key stages of each helps you keep track of why you are doing a study and also how you should approach your work. In each we find stages where different techniques are more or less appropriate: some need no statistics at all, instead qualitative methods such as ethnography are best; for some a ‘gut’ feeling for the numbers is sufficient, but no more; and some require formal statistical analysis.

Figure 1.4: Research—different goals for empirical studies.

1.4.1 EMPIRICAL RESEARCH

There are three main uses of empirical work during research, which often relate to the stages or goals of a research project (Fig. 1.4).

exploration This is principally about identifying the questions you want to ask. Techniques for exploration are often open-ended. They may be qualitative: ethnography, in-depth interviews, or detailed observation of behaviour whether in the lab or in the wild. However, this is also a stage that might involve (relatively) big data, for example, if you have deployed software with logging, or have conducted a large-scale, but open-ended, survey. Data analysis may then be used to uncover patterns, which may suggest research questions. Note, you may not need this as a stage of research if you began with an existing hypothesis, perhaps from previous phases of your own research, questions arising from other published work, or based on your own experiences.

validation This is predominantly about answering questions or verifying hypotheses. This is often the stage that involves most quantitative work: including experiments or large-scale surveys. This is also the stage that one most often publishes, especially in terms of statistical results, but that does not mean it is the most important. In order to validate, you must establish what you want to study (explorative) and what it means (explanation).

explanation While the validation phase confirms that an observation is true, or a behaviour is prevalent, this stage is about working out why it is true, and how it happens in detail. Work at this stage often returns to more qualitative or observational methods, but with a tighter focus. However, it may also be more theory based, using existing models, or developing new ones in order to explain a phenomenon. Crucially it is about establishing mechanism, uncovering detailed step-by-step behaviours … a topic we shall return to later.

Figure 1.5: Iterative development process.

Of course these stages may often overlap, and data gathered for one purpose may turn out to be useful for another. For example, work intended for validation or explanation may reveal anomalous behaviours that lead to fresh questions and new hypotheses. However, it is important to know which goal you were intending to address, and, if you change, how and why you are looking at the data differently …and whether this matters.

1.4.2 SOFTWARE DEVELOPMENT

Figure 1.5 shows a typical iterative software development or user experience design cycle. Initial design activity leads to the making of some sort of demonstrable artefact. In the early stages this might be storyboards, or sketches, later wireframes or hi-res prototypes, or in the case of agile development an actual running system. This is then subjected to some form of testing or evaluation.

During this process we are used to two different kinds of evaluation point.

formative evaluation This is about making the system better. It is performed on the design artefacts (sketch, prototype, or experimental system) during the cycles of design–build–test. The form of this varies from expert evaluation to a large-scale user test. The primary purpose of formative evaluation is to uncover usability or experience problems for the next cycle.

summative evaluation This is about checking that the system works and is good enough. It is performed at the end of the software development process on a pre-release product. It may be related to contractual obligations: “95% of users will be able to use the product for purpose X after 20 minutes’ training;” or may be comparative: “the new software outperforms competitor Y on both performance and user satisfaction.” In less formal situations, it may simply be an assessment that enough work has been done based on the cumulative evidence from the formative stages.

Figure 1.6: Parallels between academic research and iterative development.

Figure 1.7: Parallel: exploration—formative evaluation.

In web applications, the boundaries can become a little less clear as changes and testing may happen on the live system as part of perpetual-beta releases or A–B testing.

1.4.3 PARALLELS

Although research and software development have different overall goals, we can see some obvious parallels between the two (Fig. 1.6). There are clear links between explorative research and formative evaluations, and between validation and summative evaluations. However, it is perhaps less immediately clear how explanatory research connects with development.

We will look at each in turn.

Exploration –formative

During the exploration stage of research or during formative evaluation of a product, you are interested in finding any interesting issue (Fig. 1.7). For research this is about something that you may then go on to study in depth and hope to publish papers about. In software development it is about finding usability problems to fix or identifying opportunities for improvements or enhancements. It does not matter whether you have found the most important issue, or the most debilitating bug, so long as you have found sufficient for the next cycle of development.

Figure 1.8: Parallel: validation—summative evaluation.

Statistics are less important at this stage, but may help you establish priorities. If costs or time are short, you may need to decide which of the issues you have uncovered is most interesting to study further, or fix first. In practical usability, the challenge is not usually finding problems, nor even working out how to fix them; it is deciding which are worth fixing.

Validation –summative evaluation

In both validation in research and summative evaluation during development (Fig. 1.8), the focus is much more exhaustive: you want to find all problems and issues (though we hope that few remain during summative evaluation!).

The answers you need are definitive. You are not so much interested in new directions (though that may be an accidental outcome); instead, you are verifying that your precise hypothesis is true, or that the system works as intended. For this you may need statistical tests, whether traditional (p-value) or Bayesian (odds ratio).

You may also want figures: how good is it (e.g., “nine out of ten owners say their cats prefer …”), how prevalent is an issue (e.g., “95% of users successfully use the auto-grow feature”). For this the size of effects is important, so you may be more interested in confidence intervals, or pretty graphs with error bars on them.

As we noted earlier, in practical software development there may not be an explicit summative step, but the decision will be based on the ongoing cycles of formative assessment. This is of course a statistical assessment, however informal; perhaps you just note that the number and severity of problems found has decreased with each iteration. It may also be pragmatic: you’ve run out of time and are simply delivering the best product you have. However, if there is any form of external client, or if the product is likely to be business critical, there should be some form of quality assessment. The decision about whether to use formal statistical methods, eyeballing of graphs and data, or simple expert assessment will depend on many factors including the pragmatics of liability and available time.

Are five users enough?

One of the most well-known (and misunderstood) myths of interaction design is the idea that five users are enough.a I lose count of the number of times I have been asked about this, let alone seen variants of it quoted as a justification for study sizes in published papers.

The idea originated in a paper by Nielsen and Landaur [54], 25 years ago. However, that was crucially about formative evaluation during iterative evaluation. I emphasise, it was neither about summative evaluation, nor about sufficient numbers for statistics!

Nielsen and Landaur combined a simple theoretical model based on software bug detection with empirical data from a small number of substantial software projects to establish the optimum number of users to test per iteration.

Their notion of ‘optimum’ was based on cost—benefit analysis: each cycle of development costs a certain amount, each user test costs a certain amount. If you uncover too few user problems in each cycle you end up with many development cycles, which is expensive in terms of developer time. However, if you perform too many user tests you repeatedly find the same problems, thus wasting user-testing effort.

The optimum value depends on the size and complexity of the project, with the number far higher for more complex projects, where redevelopment cycles are more costly; the figure of five was a rough average based on the projects studied at the time. Nowadays, with better tool support, redevelopment cycles are far less expensive than any of the projects in the original study, and there are arguments that the optimal value may now even be just testing one user [50]—especially if it is obvious that the issues uncovered are ones that appear likely to be common. This idea of one-by-one testing has been embedded in the RITE method (Rapid Iterative Testing and Evaluation), which in addition advocates having various stakeholders heavily involved in very rapid cycles of testing and fixing [52, 53].

However, whether 1, 5, or 20 users, there will be more users on the next iteration—this is not about the total number of users tested during development. In particular, at later stages of development, when the most glaring problems have been fixed, it will become more important to ensure you have covered a sufficient range of the target user group.

For more on this see Jakob Nielsen’s more recent and nuanced advice [55] and my own analyses of “Are five users enough?” [20].

Figure 1.9: Parallel: explanation.

Explanation

While validation establishes that a phenomenon occurs, what is true, explanation tries to work out why it happens and how it works (Fig. 1.9)—deep understanding.

As noted, this will often involve more qualitative work on small samples of people. However, it is also often best connected with quantitative studies of large samples. For example, you might have a small number of rich in-depth interviews, but match the participants against the demographics of large-scale surveys. If, say, a particular pattern of response is evident in the large study and your in-depth interviewee has a similar response, it is often a reasonable assumption that their reasons will be similar to the large sample. Of course, they could just be saying the same thing for completely different reasons, but often common sense or prior knowledge means that the reliability is evident. If you are uncertain of the reliability of the explanation, that could always drive targeted questions in a further round of large-scale surveys.

Similarly, if you have noticed a particular behaviour in logging data from a deployed experimental application, and a user has the same behaviour during a think aloud session or eyetracking session, then it is reasonable to assume that their vocal deliberations and cognitive or perceptual behaviours may be similar to those of the users of the deployed application.

We noted that the parallel with software development was unclear; however, the last example starts to point toward a connection.

During the development process, user testing often reveals many minor problems. It iterates toward a good-enough solution, but rarely makes large-scale changes. Furthermore, at worst, the changes you perform at each cycle may create new problems. This is a common problem with software bugs where code becomes fragile, and with user interfaces, where each change in the interface creates further confusion, and may not even solve the problem that gave rise to it. After a while you may lose track of why each feature is there at all.

Rich understanding of the underlying human processes—perceptual, cognitive, social—can both ensure that ‘bug fixes’ actually solve the problem, and allow more radical, but informed redesign that may make whole rafts of problems simply disappear.

1.5 WHAT’S NEXT

The rest of this book is divided into three parts.

Wild and wide—concerning randomness and distributions. This part will help you get a ‘gut feel’ for random phenomena and some of the ways to describe and understand probabilities. In Chapter 2, we will explore this using a number of coin-tossing experiments, and then look at key concepts: in Chapter 3 bias, variability, and independence and in Chapter 4 probability distributions.

Doing it—if not p then what. This part is about being able to make sense of the statistics you see in articles and reports. After exploring the general issue of the job of statistics further in Chapter 5, Chapter 6 covers traditional statistics (hypothesis testing and confidence intervals), and Chapter 7 introduces Bayesian statistics. For each we will consider what they mean and, as importantly, misinterpretations. Chapter 8 describes some of the common issues and problems faced by all these statistical methods including the dangers of cherry picking and the benefits of simulation and empirical methods made possible by computation. Chapter 9 focuses on the differences and my own recommendations for best choice of methods.

Design and interpretation. The last part of this book is focused on the decisions you need to make as you design your own studies and experiments, and interpret the results. Chapter 10 is about increasing the statistical power of your studies, that is making it more likely you will spot real effects. Chapter 11 moves on to when you have results and want to make sense of them and present them to others; however, much of this advice is also relevant when you are reading the work of others. Finally, Chapter 12 reviews the current state of statistics within HCI and recent developments including adoption of new statistical methods and the analysis of big data.

Подняться наверх