Читать книгу Analysing Quantitative Data - Raymond A Kent - Страница 21

Error in data construction

The construction of data of any kind is likely to involve errors from various sources. These might include, for example:

inappropriate specification of cases;

biased selection of cases;

random sampling error;

poor data capture techniques;

non-response;

response error;

interviewer error;

measurement error.

The appropriateness of the type or types of case specified for a piece of research is often taken for granted rather than argued and justified. Ideally, not only must cases share sufficient background characteristics for comparisons to be made between them, but also those characteristics must be relevant to the topic under investigation. So, selecting university students or ‘housewives’ to study attitudes of hostility towards allowing female or gay bishops in the Church of England may not be appropriate.

Once the characteristics that define the research population of cases have been specified, the selection of cases to be used in the research may be made in a variety of different ways, but should, as far as possible, avoid the over- or under-representation of types of case. This may arise because, for example, the sampling frame used to select cases from omits certain kinds of case, or non-random methods of case selection have been used, for example interviewers have been asked to select respondents. Any biases in selection procedures will not be reduced by increasing the size of the sample.

Even in carefully conducted random samples, there will always be potential for fluctuations from sample to sample such that values of properties recorded for a particular sample may not reflect the actual values on the population from which the sample was drawn. The probability of getting sampling errors of given sizes can be calculated using statistical inference, which is explained at various points in Chapters 4–6.

Error in the construction of data can arise as a result of poor data capture techniques, for example in questionnaire design. There are many things that can go wrong both in the design of individual questions and in the overall design of the questionnaire. Some of these problems arise because the researcher has not followed the guidelines for good questionnaire design, in particular the stages for questionnaire construction, for question wording, routeing and sequencing. Any of these problems will result in errors of various kinds and their extent is unlikely to be known. It has been shown many times over that the responses people give to questions is notoriously sensitive to question wording. However, answers are also affected by the response options people are given in fixed choice questions, by whether or not there is a middle category in a rating, by whether or not there is a ‘don’t know’ filter, or by the ordering of the questions, the ordering of the responses or their position on the page. All the researcher can do is to minimize the likelihood of errors arising from poor questionnaire design through design improvements.

A source of error in virtually all survey research is non-response. It is seldom that all individuals who are selected as potential respondents are successfully contacted, and it is seldom that all those contacted agree to co-operate. Non-contacts are unlikely to be representative of the total population of cases. Married women with young children, for example, are more likely to be at home during the day on weekdays than are men, married women without children, or single women. The probability of finding somebody at home is also greater for low-income families and for rural families. Call-backs during the evening or at weekends may minimize this source of bias, but it will never be eliminated.

The contact rate takes the number of eligible cases contacted as a proportion of the total number of eligible cases approached. Interviewers may be compared or monitored in terms of their contact rates. Potential respondents who have been contacted may still refuse co-operation for a whole variety of reasons including inconvenience, the subject matter, fear of a sales pitch or negative reaction to the interviewer. The refusal rate generally takes the number of refusals as a proportion of the number of eligible cases contacted. Once again, refusals are unlikely to be representative, for example they may be higher among women, non-whites, the less educated, the less well off and the elderly. The detection of refusal bias usually relies on checking differences between those who agreed to the initial contact and those who agreed only after later follow-ups on the assumption that these are likely to be more representative of refusals.

Most researchers report a response rate for their study and this will normally combine the ideas of a contact rate and a refusal rate. However, in terms of its actual calculation a bewildering array of alternatives is possible. Normally, it is the number of completed questionnaires divided by the number of individuals approached. Sometimes the number found to be ineligible is excluded from the latter. Yet others will argue that the same applies to non-contacts, terminations and rejects. The result will be dramatically different calculations of the response rate. Whichever of these is reported, however, what is important as far as error in data construction is concerned is the extent to which those not responding – for whatever reason – are in any way systematically different from those who successfully completed. Whether or not this is likely to be the case will depend substantially on whether or not there were call-backs and at what times of the day and days of the week individuals were approached.

Apart from non-contacts, refusals and those found to be ineligible, there will, in addition, usually be item non-response where individuals agree to participate in the survey but refuse to answer certain questions. A refusal to answer is not always easy to distinguish from a ‘don’t know’, but both need to be distinguished from items that are not responded to because they have been routed out as inappropriate for that respondent. All, however, are instances of ‘missing values’, which are considered in Chapter 2.

Researchers faced with unacceptable non-response rates have a number of options:

simply report the response rate as part of the findings;

try to reduce the number of non-respondents;

allow substitution;

assess the impact of non-response;

compensate for the problem.

Many researchers choose to report survey results based only on data derived from those responding and simply report the response rate as part of the results. This shows that the researcher is unaware of the implications of non-response, believes them to be negligible or has chosen to ignore them. Non-response may not itself be a problem unless the researcher ends up with too few cases to analyse. What is important is whether those not responding are in any significant ways different from those who do.

The number of non-respondents can usually be reduced through improvements in the data collection strategy. This might entail increasing the number of call-backs, using more skilled interviewers or offering some incentive to potential respondents. The effort to increase the rate of return becomes more difficult, however, as the rate of return improves and costs will rise considerably. Allowing substitution can sometimes be a sensible strategy in a sample survey provided the substitutes are selected in the same way as the original sample. This will not reduce bias from non-response, but it is a useful means of maintaining the intended or needed sample size. For censuses, substitution is, of course, not an option.

Assessing the impact of non-response implies an analysis of response rates, contact rates, and so on, plus an investigation of potential differences between respondents and non-respondents, and some model of how these relate to total survey error. There are various ways of checking for non-response bias. Researchers sometimes take late returns in a postal survey as an indication of the kind of people who are non-responders. These are then checked against earlier returns. In an interview survey, supervisors may be sent to refusers to try to obtain some basic information. Interviewers can also be sent to non-responders in a postal survey. Another technique is to compare the demographic characteristics of the sample (age, sex, social class, and so on) with those of the population. If this is known, the comparison is relatively straightforward, although deciding on how ‘similar’ they should be to be acceptable is not clear-cut. If differences are discovered then, again, this can simply be reported along with suitable caveats applied to the results. Alternatively, the researcher may try to compensate for the problem by using a weighted adjustment of responses. A weight is a multiplying factor applied to some or all of the responses given in a survey in order to eliminate or reduce the impact of bias caused by types of case that are over- or under-represented in the sample. Thus if there are too few women aged 20–24 in a sample survey compared with the proportions in this age group known to exist in the population of cases, such that only 50 out of a required 60 are in the achieved sample, the number who, for example, said ‘Yes’ to a question will be multiplied by a weighting that is calculated by dividing the target sample number by the actual sample number, in the example 60/50 or 1.2.

Even if individuals are responding, there may be differences between respondents’ reported answers and actual or ‘true’ values. Response errors arising through dishonesty, forgetfulness, faulty memories, unwillingness or misunderstanding of the questions being asked are notoriously difficult to measure. Research on response error, furthermore, is limited due to the difficulty of obtaining some kind of external validation. In interview surveys, whether face to face or by telephone, interviewers may themselves misunderstand questions or the instructions for filling them in, and may be dishonest, inaccurate, make mistakes or ask questions in a non-standard fashion. Interviewer training, along with field supervision and control, can, to a large extent, reduce the likelihood of such errors, but they will never be entirely eliminated, and there is always the potential for systematic differences between the results obtained by different interviewers.

Errors arising from non-response, erroneous responses or interviewer mistakes are specific to questionnaire survey research. Errors from the inappropriate specification of cases, from biased case selection, from random sampling error or poor data capture techniques may arise in all kinds of research. Errors of different kinds will affect the record of variables or set memberships for each property in various ways and to different extents.

What researchers do in practice is, separately for each property, to focus on likely measurement error – discrepancies between the values recorded and the actual or ‘true’ values. The size of such error is usually unknown since the true value is unknown, but evidence from various sources can be gathered in order to estimate or evaluate the likelihood of such errors. Researchers focus on two aspects of such discrepancies: reliability and validity.

A measure is said to be reliable to the extent that it produces consistent results if repeat measures are taken. We expect bathroom scales to give us the same reading if we step on them several times in quick succession. If we cannot rely on the responses that a questionnaire item produces, then any analysis based on the question will be suspect. For a single-item question or a multi-item question (generating a derived measure) the measures can be retaken at a later date and the responses compared. Such test–retests give an indication of measure stability over time, but there are fairly key problems with this way of assessing reliability:

it may not be practical to re-administer a question at a later date;

it may be difficult to distinguish between real change and lack of reliability;

the administration of the first test may affect people’s answers the second time around;

how long to wait between tests;

what differences between measures count as ‘significant’.

An alternative is to give respondents two different but equivalent measures on the same occasion. The extent to which the two measures covary can then be taken as a measure of reliability. The problem here is that it may be difficult to obtain truly equivalent tests. Even when they are possible, the result may be long and repetitive questionnaires. Another version of this equivalent measures test is the split-half test. This randomly splits the values on a single variable into two sets. A score for each case is then calculated for each half of the measure. If the measure is reliable, each half should give the same or similar results and across the cases the scores should correlate. The problem with this method is that there are several different ways in which the values can be split, each giving different results.

Where the measure taken is a multi-item scale, for example a summated rating scale, it is possible to review the internal consistency of the items. Internal consistency is a matter of the extent to which the items used to measure a concept ‘hang’ together. Ideally, all the items used in the scale should reflect some single underlying dimension; statistically this means that they should correlate one with another (the concept of correlation is taken up in detail in Chapter 5). An increasingly popular measure for establishing internal consistency is a coefficient developed in 1951 by Cronbach that he called alpha. Cronbach’s coefficient alpha takes the average correlation among the items and adjusts for the number of items. Reliable scales are ones with high average correlation and a relatively large number of items. The coefficient varies between zero for no reliability to one for maximum reliability. The result approximates taking all possible split halves, computing the correlation for each split and taking the average. It is therefore a superior measure to taking a single split-half measure. However, there has been some discussion over the interpretation of the results. This discussion is summarized in Box 1.1.

A record of a value for a case is said to be valid to the extent that it measures what it is intended to measure: in other words, the measure and the concept must be properly matched. Validity relates to the use that is made of the measure, so stepping on bathroom scales may be a valid measure of weight, but not of overall health. A reliable measure, furthermore, is not necessarily valid. Our bathroom scales may consistently over- or under-estimate our real weight (although they will still measure change). A valid measure, on the other hand, will, of course, be reliable.

There is no conclusive way of establishing validity. Sometimes it is possible to compare the results of a new measure with a more well-established one. The results of a device for measuring blood pressure at home might be compared with the results from a doctor’s more sophisticated equipment. This assumes, of course, that our GP’s results are valid. For many concepts in the social sciences there are few or no well-established measures. In this situation, researchers might focus on what is called ‘content’ or ‘face’ validity. The key to assessing content validity lies in reviewing the procedures that researchers use to develop the instrument. Is there a clear definition of the concept, perhaps relating this to how the concept has been defined in the past? Do the items that have been used to measure the concept adequately cover all the different aspects of the construct? Have the items been pruned and refined so that, for example, items that do not discriminate between respondents or cases are excluded, and any items that overlap to too great an extent with other items are avoided?

Another way of establishing validity is to assess the extent to which the measures produce the kind of results that would be expected on the basis of experience or well-established theories. Thus a measure of alienation might be expected to associate inversely with social class – the lower the class, the higher the alienation. If the two measures do indeed behave in this way, then this is evidence of what is often called construct validity. If they do not, the researcher may be unclear whether either or both measures are faulty or whether the relationship between the two measures is contrary to theoretical expectations. Conversely, the expectation might be that the two measures are unconnected and therefore should not correlate. If it turns out that they are indeed not correlated, then this is sometimes called discriminant validity.

Other more complex methods of measuring construct validity have been developed, such as multitrait–multimethod validity (Campbell and Fiske, 1959), pattern matching (Cook and Campbell, 1979) or factor analysis (see Chapter 6). Evidence of validity has to be argued for and may be gathered in a number of ways. No single way is likely to provide full evidence and even a combination of approaches is unlikely to constitute final proof.

Подняться наверх