Читать книгу Methodologies and Challenges in Forensic Linguistic Casework - Группа авторов - Страница 25
EVALUATING THE DATA
ОглавлениеOne key feature of how we tackled the Starbuck case was a separation of roles between the two analysts Grant and Grieve (hereafter TG and JG). We decided from the outset to design our analysis in a deliberate attempt to mitigate confirmation bias through controlling the flow of information in the authorship analysis. We split the tasks in the following way:
TG liaised with police, received and evaluated data, decided on comparison sets, and then passed these materials to JG in a particular order. As TG had the principal liaison role and the contextual overview that came with it, he took principal responsibility for writing the report based on JG’s analysis. As primary author of the report, this meant he would have been more likely to be called to court as an expert witness (although both prosecution or defense would have had the right to call either TG or JG).
JG performed the authorship analysis, applying methods at his own discretion on the texts passed to him by TG. The analysis itself was purely his, and he reported his findings to TG at different stages, as will be detailed.
The police provided CFL with considerable data for analysis, at least for a case of forensic disputed authorship. The data set comprised three subsets.
1 The first subset consisted of 82 emails known to have been written by Debbie, totaling approximately 28,000 words. These were sent between August 2006 and April 2010, mostly covering the period of time from before she had met Jamie and included those sent during her previous trips abroad.
2 The second subset consisted of 77 emails written by Jamie, totaling approximately 6,000 words. These were sent between January 22, 2009, and October 18, 2012, and were primarily from the period of travel after the wedding.A substantial number of the emails from this period were sent to a personal assistant who Jamie had employed to help deal with his affairs while he was abroad. The genre of these “business” emails was thus very different from the personal emails Debbie had sent to describe her travels. In part, this explains the very clear difference in average email length between the two sets (approximately 370 words per email for Debbie compared with 70 words per email for Jamie).
3 Finally, a third subset consisted of 29 emails of disputed authorship, all sent from Debbie’s account between April 27, 2010, and May 23, 2012. All these emails were sent after the marriage on April 21, 2010—the majority after the couple had supposedly left on their honeymoon, but the first few before departure.
TG took on the initial evaluation of the data to determine whether an authorship analysis was possible in the first instance and determined the data provided was well suited to a forensic authorship analysis for several reasons.
First, it was clearly a closed-set problem—the briefing from the police was that there were only two candidates under consideration as the author of the disputed emails. Given the context of the case, this seemed like a reasonable assumption. The problem was therefore to determine which of the two candidate styles was the closer match for the disputed texts. This binary closed-set problem is perhaps the most straightforward structure for an authorship problem, because the analyst does not need to perform a strong identification as to who was the author of the questioned document(s). They simply need to provide an informed opinion on whose style is more consistent with the style of the questioned document(s) and to demonstrate that at least some of these points of consistency show comparative or pairwise distinctiveness (Grant, 2020).
In any case, if it is known that the set of possible authors contains the actual author of the questioned document, then the analysis of consistency and distinctiveness can lead to a correct attribution. Crucially, with a problem such as this, it is not the responsibility of the forensic linguist to select the set of possible authors—this is not a linguistic question—but equally crucially, the analyst’s opinion becomes conditional, for example, “If A & B are the only possible authors of the Q texts, then as the Q text is consistent with A’s distinctive features and further from B’s distinctive features, a conclusion can be drawn that A is the more likely author.”
In forensic cases the linguist cannot determine that a forensic problem is a closed-set problem or how big that closed set might be. It is, however, incumbent upon the linguist to ask the police or lawyers questions to understand whether the set is truly closed or assumptions of a closed set are unfounded. Once reassured that the closed-set assumption is reasonable, then it is possible to accept the provided set of possible authors and work within the limits placed on us by this decision-making. It is, of course, possible to question whether this is a closed-set problem, and the decision that it is indeed a closed set needs to be made consciously and carefully.
On many occasions, the initial response of the consulting forensic linguist has to be to request further investigation by the police, so that they can convincingly demonstrate that the structure of the problem is closed. This interrogation of the problem structure often requires a full understanding of the background of the case, and, if this information is carried forward into the analysis, it can be a source of potential bias. In the Starbuck analysis, TG was responsible for establishing the basis for treating the problem as a closed set and closely questioned the investigating officers around this issue.
As noted, this structure of closed-set authorship attribution is far easier than the alternative—the task of open-set authorship verification, which arises when the forensic linguist is asked to determine if the known author of a set of texts did or did not write a questioned document. In such cases, we can provide an opinion on how consistent the style of the candidate is with the style of the questioned document and how distinctive any shared features are, but in these cases the task is to establish population-level distinctiveness (Grant, 2020), which is especially challenging. How do we decide which consistent features are distinctive? Against what comparison corpus? And how many distinctive features do we need before we find a reasonable match?
We do not have clear answers to these questions. Coulthard (2004) suggests that because no two people have the same linguistic histories, no two people will have exactly the same style. However, this position is hard to demonstrate, and, even if it is true, it is unclear how much data we need to consider before we can distinguish a given author from all other authors. Further to this, there are suggestions (e.g., Wright, 2017) that some authors are more distinctive than others at a population level, and this, too, might need to be accounted for in any particular problem. These issues that we face in the task of authorship verification can be sidestepped when investigating closed set authorship attributions, as in the Starbuck case.
The second point of evaluation for TG was whether there was sufficient, relevant comparison data for the analysis. It is most important to have a good quantity of known material as all comparative authorship analysis is about describing a pattern of known usage to compare with the usage in the disputed material. Therefore, the analysis depends on the frequency or even the mere occurrence of linguistic forms in the comparison material. If the quantity of known-author data is limited, then we cannot speak with any kind of confidence about a pattern of use and we cannot describe whether the use of a given feature is typical of a given author. In particular, as the amount of data decreases, so does the number of features that could possibly be meaningful. Additionally, when we have more data, we can potentially compare the use of a wide range of features, thereby creating a basis for a more reliable attribution.
The amount of data here is not large compared with many of the historical and hypothetical nonforensic cases considered in stylometry and computational stylistic research, but it is relatively large in our experience for a forensic linguistic investigation. The decision on whether the amount of data is sufficient for the analysis is thus a further entry point for potential bias. In the Starbuck analysis, this decision fell to TG (although it would have been possible for JG to report back that there was insufficient data to proceed).
In terms of relevance of comparison in the Starbuck material, register variation was largely controlled. All the texts under analysis were emails. This homogeneity is of great value in any authorship analysis because we know that language, in general and that of individuals, varies across different communicative contexts, as people use different linguistic structures to better achieve different communicative goals (Biber, 1995). Comparing the authorship of texts written across multiple registers can be a very challenging task, as the register signal will almost always be stronger than the authorship signal. For example, consider how quickly any reader can distinguish different registers of texts. It takes only a few seconds to distinguish an email from a newspaper article, but to determine authorship is clearly much more difficult. Dealing with register variation in authorship analysis is therefore an extremely difficult task, especially because we do not have a strong understanding of how the style of individual authors tends to shift across registers. Indeed, it seems likely that different authors would shift in different ways across registers, making the task even harder. This is an important area for future research in authorship analysis—perhaps the main challenge facing the field and a challenge of real practical importance in the forensic context, as data often comes from different genres.
It is not the case, however, that there was no register variation at all. In particular, the types of emails differed substantially across the three subsets of data, including in terms of topic and audience. Debbie Starbuck’s known emails were mostly emails to family and friends, many of which narrated her travels from before meeting Jamie. Jamie’s known emails were mostly interacting with his personal assistant while he traveled on practical matters, and the disputed texts were mostly in response to emails from Debbie’s family giving them updates and assuring them she was well. These differences in communicative context necessarily have linguistic consequences. To take the most superficial example, consider the difference in Debbie and Jamie’s email length, which clearly reflect these differences in purpose and audience.
Nevertheless, the registers here were judged to be sufficiently similar that we felt confident looking for consistently and distinctively used authorship patterns that did not seem to simply be explained by register variation, although we kept these differences in mind and adjusted our interpretation of the results accordingly. Similarly, the fact that the data was all from a relatively similar time period was also helpful, as we know people’s language can change over time (Wagner, 2012). On the basis of these considerations, TG judged that the comparison set was sufficiently relevant and not a major issue in this case and thus that the problem was tractable and should be passed to JG. TG’s evaluation of the texts, however, brought up other points worthy of discussion.
One advantage given to the analysis is that the texts were precisely time stamped. Emails as a genre naturally create an ordered series of texts for analysis, and this structure to the data can assist in devising a method and in hypothesis formation and testing. For example, if there is a working hypothesis of an account takeover by a different writer at some point in a series of emails, then this provides an analytic advantage over a situation where an email account might have been hacked and subject to occasional use by a second author.
In the Starbuck case, TG was able to clarify with the police investigator that the hypothesis of an account takeover was indeed central, and thus he was able to take this into account in analysis design. This is an advantage in analysis as it allowed the creation of different sets of texts. The first set was a group of known emails sent from Debbie’s account before any account takeover had occurred. This group included emails up to the last time Debbie had been seen alive and well. The second set was emails after which any account takeover may have occurred. If a style shift was to be found, it is likely that it will be within this group with the later emails in the group being stylistically different from those in the known set of emails. This is not to say that each email was not considered individually, but that they were also considered in terms of their position in the time series. This means, for example, that the weight of evidence for any style shift can be considered cumulatively after any identified break in style.
A further point in TG’s preliminary evaluation is that each email text was relatively short. At the most basic level, the problem of dealing with short texts is that they do not provide the analyst with as much material as longer texts, from which distinctive and consistent features might be identified. Generally, more evidence is simply better.4 Slightly more technically, the issue is that linguistic observations in less text will give rise to fewer examples of the feature, and this means that generalization into a pattern of use will be less reliable.
For example, imagine trying to predict the bias of a weighted coin: if you flipped it only a few times you would be unlikely to be able to estimate the bias correctly, but if you flipped it a few hundred times you might have a very good estimate. The same thing happens when you measure the relative frequency of a word (i.e., its percentage out of the total words in the text). If one looks at a single, short sentence from a text, the word ‘the’ might occur once in five words, but we would not want to generalize from such an observation that the word occurs once every five words across the entire text. Only after we have seen a sufficient number of tokens or instances of a word can we start to make such estimations. Texts that are fewer than 500 words long are therefore generally seen as being too short for the application of stylometric approaches to authorship analysis (although recently this number has been decreasing; see Grieve et al., 2019), and, often in a forensic context, the entire data set might be smaller than this.
Finally, one last complication with this data was that, although it consisted of emails, the police provided us with access only to screenshots of the texts. Because these were simple images, they could not be automatically analyzed computationally. As a result, we needed to convert these images into text using optical character recognition software, which was a relatively time-consuming process and required thorough checking against the image files to ensure that even minor punctuation features were correctly digitized.
The outcome of TG’s evaluation phase of the analysis was the judgment that this data set as a whole was well suited for analysis. Cases like this with small, closed sets of authors, sufficient data, and register control do occur with some regularity, despite claims sometimes made in the stylometry literature in particular (e.g., Luyckx & Daelemans, 2011). Law enforcement agencies can often provide these types of problem—especially with online language use providing essentially permanent records of data available. Researchers with relatively little forensic experience appear to focus their efforts on more and more challenging problems. For practical casework problems, these more complex research projects are less relevant. Such academic authorship studies are, of course, important, but many issues around the “easier” sorts of cases have not yet been resolved. By sharing actual investigative linguistic casework with the researchers and the public, the forensic linguistic community can help provide a picture of the landscape of actual forensic problems.