Читать книгу The Tao of Statistics - Dana K. Keller - Страница 18

На сайте Литреса книга снята с продажи.

6. Simplifying—Groups and Clusters

You and I are much alike

Over there, they are different

I will be with those like me

For now

Group differences are the cornerstone for much of the social research done today. The reason is that grouping is a convenient, logical, and valid way to reduce the complexity of data. Groups are created from ideas that people have about characteristics or conditions that separate interesting parts of the data. Gender and race/ethnicity are traditional examples and are used more often than they are reflectively reviewed for their relevance to answering the questions posed.

Clusters are groups that are created by sophisticated mathematics when researchers do not know where, or how, to separate their groups. To form clusters, statistical software needs variables, which someone must choose and code. To that extent, researchers need a fair idea of the characteristics that would distinguish between groups, more than a random guess. The more researchers know how groups were formed in the data, the better they can utilize group information to address interesting questions of the data. If I wanted to know the characteristics (i.e., through clustering) to predict whether a diabetic will have a biennial eye examination, I would start with as many demographic and health care access variables as I could find. I would not look for hat size or color preference information.

Both groups and clusters are used to understand inequalities in life. Also important, they can highlight similarities. Frequently, created clusters are used as though they were naturally occurring groups. With empirical evidence forming the foundation for a given cluster, is prior knowledge and recognition of the commonalities among members of that cluster required to consider members a group? Regardless of the answer to that question, retaining a semantic distinction between groups and clusters increases the specificity of results and adds context to the discussion.

The distinction between using groups versus clusters is whether a strong hypothesis on the source of group differences exists prior to the start of the research. Researchers have found that, lacking a strong hypothesis, letting the data speak for themselves (i.e., clustering) can provide interesting insights. Such insights, though, still need a substantive context.

Analysis of voting patterns by age bracket is a familiar use of groups. News reporters use groups in this way to show differences in voting preferences for young, middle-aged, and older adults. Clusters, on the other hand, might be formed to find common characteristics among members of a particular voting bloc, say those who voted for an independent candidate. Often, one finds that traditional groupings are used (e.g., gender) when such groupings are not related to the question of interest, a situation that should make one pause.

The high school principal has all manner of information on groups: year in school, sports team membership, demographics, courses, and separate sections for the same courses, to name a few. Differences in students’ grades across these groups might suggest some important inequalities in his school. If vastly different grades, for example, were being awarded for similar work by similar students, an intervention with the teachers might be warranted. The statistics would not suggest which teacher graded too high or which too low. Those are value judgments. As we have seen, statistics do not make value judgments. They are remarkably evenhanded, even though statisticians might not always be due to unconscious biases (more on this later).

The director of public health also has access to data that easily can be formed into groups. Nonetheless, she might want to create clusters to address questions about immunization patterns. For example, from epidemiology, she knows that certain religious groups and communities are reluctant to fully immunize their children, creating a situation where childhood diseases with small or vanishing incidence in the overall population might suddenly emerge as a threat to public health, especially to adults born before widespread childhood immunization was implemented and expected.

The literature is not as clear as she would like for purposes of instituting policy changes aimed at improving adult immunization rates. She needs to understand local differences in patterns for people with higher immunization rates compared with those who have lower rates. Using cluster analysis to form groups is a viable method for using her data to understand important differences in these rates. Looking at the key variables selected by the computer would help her understand the differences between clusters that might be used to leverage her resources with the lower-scoring clusters.

Fortunately, the director has very large samples with which she can test large numbers of combinations of variables in computer runs that would have been impossible only a few years ago. In minutes, she can test the most likely 30 or 40 variables. Once she sees the major contributing characteristics for cluster membership, she can look at the socioeconomic and cultural structures of the area, and then work with local expert interventionists in support of public health care policy.

A few words of caution are in order for using groups and clusters. Groups and clusters simplify analyses by reinforcing the notion of differences between people or events. The potential for stereotyping, intentional or not, is high. Appeals to a higher authority, such as the funding source, are not sufficient justification for unreflectively highlighting differences that are not relevant to the question or are wrong by sometimes subtle inference. No method beyond honest reflection exists that will ensure the ethical use and labeling of groups or clusters.

To walk the path of statistical knowledge is to remain aware of unintentional damage that unreflective analyses might cause. These are the latent functions (the unintended consequences, in policy and program evaluation terms) of policy or reporting, be it public or private, large or small. Either formally or not, statistics are often used as evidence to support one perspective at the expense of another. Think about what those perspectives might be and whether you would want to be associated with them.

Подняться наверх