Читать книгу Digital Forensic Science - Vassil Roussev - Страница 11

Оглавление

CHAPTER 3

Definitions and Models

Forensic science is the application of scientific methods to collect, preserve, and analyze evidence related to legal cases. Historically, this involved the systematic analysis of (samples of) physical material in order to establish causal relationships among various events, as well as to address issues of provenance and authenticity.¹ The rationale behind it—Locard’s exchange principle—is that physical contact between objects inevitably results in the exchange of matter leaving traces that can be analyzed to (partially) reconstruct the event.

With the introduction of digital computing and communication, the same general assumptions were taken to the cyber world, largely unchallenged. Although a detailed conceptual discussion is outside the intent of this text, we should note that the presence of persistent “digital traces” (broadly defined) is neither inevitable nor is it a “natural” consequence of the processing and communication of digital information. Such records of cyber interactions are the result of concious engineering decisions, ones not usually taken specifically for forensic purposes. This is a point we will return to shortly, as we work toward a definition that is more directly applicable to digital forensics.

3.1 THE DAUBERT STANDARD

Any discussion on forensic evidence must inevitably begin with the Daubert standard—a reference to three landmark decisions by the Supreme Court of the United States: Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993); General Electric Co. v. Joiner, 522 U.S. 136 (1997); and Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).

In the words of Goodstein [78]: “The presentation of scientific evidence in a court of law is a kind of shotgun marriage between the two disciplines.… The Daubert decision is an attempt (not the first, of course) to regulate that encounter.”

These cases set a new standard for expert testimony [11], overhauling the previous Frye standard of 1923 (Frye v. United States, 293 F. 1013, D.C. Cir. 1923). In brief, the Supreme Court instructed trial judges to become gatekeepers of expert testimony, and gave four basic criteria to evaluate the admissability of forensic evidence:

1. The theoretical underpinnings of the methods must yield testable predictions by means of which the theory could be falsified.

2. The methods should preferably be published in a peer-reviewed journal.

3. There should be a known rate of error that can be used in evaluating the results.

4. The methods should be generally accepted within the relevant scientific community.

The court also emphasized that these standards are flexible and that the trial judge has a lot of leeway in determining admissability of forensic evidence and expert witness testimony. During legal proceedings, special Daubert hearings are often held in which the judge rules on the admissibility of expert witness testimony requested by the two sides.

In other words, scientific evidence becomes forensic only if the court deems it admissible. It is a somewhat paradoxic situation that an evaluation of the scientific merits of a specific method is rendered by a judge, not scientists. There is no guarantee that the legal decision, especially in the short term, will be in agreement with the ultimate scientific consensus on the subject. The courts have a tendency to be conservative and skeptical with respect to new types of forensic evidence. The admissability decision also depends on the specific case, the skill of the lawyers on both sides, the communication skills of the expert witnesses, and a host of other factors that have nothing to do with scientific merit.

The focus of this book is on the scientific aspect of the analytical methods and, therefore, we develop a more technical definition of digital forensic science.

3.2 DIGIAL FORENSIC SCIENCE DEFINITIONS

Early applications of digital forensic science emerged out of law enforcement agencies, and were initiated by investigators with some technical background, but no formal training as computer scientists. Through the 1990s, with the introduction and mass adoption of the Internet, the amount of data and the complexity of the systems investigated grew quickly. In response, digital forensic methods developed in an ad hoc, on-demand fashion, with no overarching methodology, or peer-reviewed venues. By the late 1990s, coordinated efforts emerged to formally define and organize the discipline, and to spell out best field practices in search, seizure, storage, and processing of digital evidence [126].

3.2.1 LAW-CENTRIC DEFINITIONS

In 2001, the first Digital Research Forensic Workshop was organized with the recognition that the ad hoc approach to digital evidence needed to be replaced by a systematic, multi-disciplinary effort to firmly establish digital forensic science as a rigorous discipline. The workshop produced an in-depth report outlining a research agenda and provided one of the most frequently cited definitions of digital forensic science [136]:

Digital forensics: The use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation, documentation, and presentation of digital evidence derived from digital sources for the purpose of facilitating or furthering the reconstruction of events found to be criminal, or helping to anticipate unauthorized actions shown to be disruptive to planned operations.

This definition, although primarily stressing the investigation of criminal actions, also includes an anticipatory element, which is typical of the notion of forensics in operational environments. The analysis there is performed primarily to identify the vector of attack and scope of a security incident; identifying adversary with any level of certainty is rare, and prosecution is not the typical outcome.

In contrast, the reference definition provided by NIST a few years later [100] is focused entirely on the legal aspects of forensics, and emphasizes the importance of strict chain of custody:

Digital forensics is considered the application of science to the identification, collection, examination, and analysis of data while preserving the integrity of the information and maintaining a strict chain of custody for the data. Data refers to distinct pieces of digital information that have been formatted in a specific way.

Another way to describe these law-centric definitions is that they provide a litmus test for determining whether specific investigative tools and techniques qualify as being forensic. From a legal perspective, this open-ended definition is normal and works well as the admissability of all evidence gets decided during the legal proceedings.

From the point of view of a technical discussion, however, such definitions are too generic to provide a meaningful starting point. Further, the chain of custody issues are primarily of procedural nature and do not bring up any notable technical problems. Since the goal of this book is to consider the technical aspects of digital forensics, it would be prudent to start with a working definition that is more directly related to our subject.

3.2.2 WORKING TECHNICAL DEFINITION

We adopt the working definition first introduced in [154], which directly relates to the formal definition of computing in terms of Turing machines, and is in the spirit of Carrier’s computer history model (Section 3.3.2):

Digital forensics is the process of reconstructing the relevant sequence of events that have led to the currently observable state of a target IT system or (digital) artifacts.

Notes

1. The notion of relevance is inherently case-specific, and a big part of a forensic analyst’s expertise is the ability to identify case-relevant evidence.

2. Frequently, a critical component of the forensic analysis is the causal attribution of event sequence to specific human actors of the system (such as users and administrators).

3. The provenance, reliability, and integrity of the data used as evidence are of primary importance.

We view all efforts to perform system, or artifact, analysis after the fact as a form of forensics. This includes common activities, such as incident response and internal investigations, which almost never result in any legal actions. On balance, only a tiny fraction of forensic analyses make it to the courtroom as formal evidence; this should not constrain us from exploring the full spectrum of techniques for reconstructing the past of digital artifacts.

The benefit of employing a broader view of forensic computing is that it helps us to identify closely related tools and methods that can be adapted and incorporated into forensics.

3.3 MODELS OF FORENSIC ANALYSIS

In this section we discuss three models of the forensic analysis; each considers a different aspect of the analysis and uses different methods to describe the process. Garfinkel’s differential analysis (Section 3.3.1) approach formalizes a common logical inference technique (similar, for example, to differential diagnosis in medicine) for the case of computer systems. In this context, diffential analysis is an incremental technique to reason about the likely prior state and/or subsequent events of individual artifacts (e.g., a file has been copied).

Carrier’s computer history model (Section 3.3.2) takes a deeper mathematical approach in describing forensics by viewing the computer system under investigation as a finite state machine. Although it has few direct practical implications, it is a conceptually important model for the field. Some background in formal mathematical reasoning is needed to fully appreciate its contribution.

The final model of Pirolli and Card (Section 3.3.3) does not come from the digital forensics literature, but from cognitive studies performed on intelligence analysts. It is included because we believe that the analytical process is very similar and requires the same type of skills. Understanding how analysts perform the cognitive tasks is of critical importance to designing usable tools for the practice. It also helps in understanding and modeling the differences in the level of abstraction at which the three groups of experts—forensic researchers/developers, analysts, and lawyers—operate.

3.3.1 DIFFERENTIAL ANALYSIS

The vast majority of existing forensic techniques can be described as special cases of differential analysis—the comparison of two objects, A and B, in order to identify the differences between them. The ultimate goal is to infer the sequence of events that (likely) have transformed A into B (A preceeds B in time). In the context of digital forensics, this fundamental concept has only recently been formalized by Garfinkel et al. [75], and the rest of this section introduces the formal framework they put forward.

Terminology

Historically, differencing tools (such as the venerable diff) have been applied to a wide variety of artifacts, especially text and program code, long before they were employed for forensic use. The following definitions are introduced to formally generalize the process.

• Image. A byte stream from any data-carrying device representing the object under analysis. This includes all common evidence sources—disk/filesystem images, memory images, network captures, etc.

Images can be physical, or logical. The former reflect (at least partially) the physical layout of the data on the data store. The latter consists of a collection of self-contained objects (such as files) along with the logical relationships among them without any reference to their physical storage layout.

• Baseline image, A. The image first acquired at time TA.

• Final image, B. The last acquired image, taken at time TB.

• Intermediary images, In. Zero, or more, images recorded between the baseline and final images; In is the nth image acquired.

• Common baseline is a single image that is a common ancestor to multiple final images.

• Image delta, B – A, is the differences between two images, typically between the baseline image and the final image.

• The differencing strategy defines the rules for identifying and reporting the differences between two, or more, images.

• Feature, f, is a piece of data that is either directly extracted from the image (file name/size), or is computed from the content (crypto hash).

• Feature in image, (A, f). Features are found in images; in this case, feature f is found in image A.

• Feature name, NAME (A, f). Every feature may have zero, one, or multiple names. For example, for a file content feature, we could use any of the file names and aliases under which it may be known in the host filesystem.

• Feature location, Loc(f), describes the address ranges from which the content of the particular feature can be extracted. The locations may be either physical, or logical, depending on the type of image acquired.

• A feature extraction function, F(), performs the extraction/computation of a feature based on its location and content.

• Feature set, F(A), consists of the features extracted from an image A, using the extraction function F().

• The feature set delta, F(B) – F(A), contains the differences between the feature sets extracted from two images; the delta is not necessarily symmetric.

• Transformation sequence, R, consists of the sequence of operations that, when applied to A, produce B. For example, the Unix diff program can generate a patch file that can be used to transform a text file in this fashion. In general, R is not unique and there can be an infinite number of transformations that can turn A into B.

Generalized Differential Analysis

As per [75], each feature has three pieces of metadata:

Location: A mandatory attribute describing the address of the feature; each feature must have at least one location associated with it. Name: A human-readable identifier for the feature; this is an optional attribute. Timestamp(s) and other metadata: Features may have one, or more, timestamps associated with them, such as times of creation, modification, last access, etc. In many cases, other pieces of metadata (key-value pairs) are also present.

Given this framework, differential analysis is performed not on the data images A and B, but on their corresponding feature sets, F(A) and F(B). The goal is to identify the operations which transform F(A) into F(B). These are termed change primitives, and seek to explain/reproduce the feature set changes.

In the general case, such changes are not unique as the observation points may fail to reflect the effects of individual operations which are subsequently overridden (e.g., any access to a file will override the value of the last access time attribute). A simple set of change inference rules is defined (Table 3.1) and formalized (Table 3.2) in order to bring consistency to the process. The rules are correct in that they transform F(A) into F(B) but do not necessarily describe the actual operations that took place. This is a fundamental handicap for any differential method; however, in the absence of complete operational history, it is the best that can be accomplished.

If A and B are from the same system and TA < TB, it would appear that all new features in the feature set delta F(B) – F(A) should be timestamped after TA. In other words, if B were to contain features that predate TA, or postdate TB, then this would rightfully be considered an inconsistecy. An investigation should detect such anomalies and provide a sound explanation based on knowledge of how the target system operates. There is a range of possible explanations, such as:

Table 3.1: Change detection rules in plain English ([75], Table 1)

If something did not exist and now it does, it was created

If it is in a new location, it was moved

If it did exist before and now it does not, it was deleted

If more copies of it exist, it was copied

If fewer copies of it exist, something got deleted

Aliasing means names can be added or deleted

Table 3.2: Abstract rules for transforming A → B (A into B) based on observed changes to features, f, feature locations Loc (A, f), and feature names NAME (A, f). Note: The RENAME primitive is not strictly needed (as it can be modeled as ADDNAME followed by DELNAME), but it is useful to convey higher-level semantics ([75], Table 2).

Tampering. This is the easiest and most obvious explanation although it is not necessarily the most likely one; common examples include planting of new files with old timestamps, and system clock manipulation.

System operation. The full effects of the underlying operation, even as simple as copying a file, are not always obvious and require careful consideration. For example, the Unix cp command sets the creation time of the new copy to the time of the operation but will keep the original modification time if the -p option is used.

Time tracking errors. It has been shown [127, 167] that operating systems can introduce inconsitencies during normal operation due to rounding and implementation errors. It is worth noting that, in many cases, the accuracy of a recorded timestamp is of little importance to the operation of the system; therefore, perfection should not be assumed blindly.

Tool error is always a possibility; like all software, forensic tools have bugs and these can manifest themselves in unexpected ways.

One important practical concern is how to report the extracted feature changes. Performing a comprehensive differential analysis, for example, of two hard disk snapshots is likely to result in an enormous number of individual results that can overwhelm the investigator. It is critical that differencing tools provide the means to improve the quality and relevance of the results. This can be accomplished in a number of ways: (a) filtering of irrelevant information; (b) aggregation of the results to highlight both the common and the exceptional cases; (c) progressive disclosure of information, where users can start at the aggregate level and use queries and hierchies to drill down to the needed level of detail; (d) timelining—provide a chronological ordering of the (relevant) events.

3.3.2 COMPUTER HISTORY MODEL

Differential analysis offers a relatively simple view of forensic inference, by focusing on the beginning and end state of the data, and by expressing the difference in terms of a very small set of primitive operations. The computer history model (CHM) [27]—one of Carrier’s important contributions to the field—seeks to offer a more detailed and formal description of the process. The model employs finite state machines to capture the state of the system, as well as its (algorithmic) reaction to outside events. One of the main considerations is the development of an investigative model that avoids human bias by focusing on modeling the computation itself along with strict scientific hypothesis testing. The investigation is defined as a series of yes/no questions (predicates) that are evaluated with respect to the available history of the computation.

Primitive Computer History Model

This assumes that the computer being investigated can be represented as a finite state machine (FSM), which transitions from one state to another in reaction to events. Formally, the FSM is a quintuple (Q, Σ, δ, s0, F), where Q is a finite set of states, Σ is a finite alphabet of event symbols, δ is the transition function δ : Q × Σ → Q, s0 ∊ Q is the starting state of the machine, and F ⊆ Q is the set of final states.

The primitive history of a system describes the lowest-level state transitions (such as the execution of individual instructions), and consists of the sequence of primitive states and events that occurred.

The primitive state of a system is defined by the discrete values of its primitive, uniquely addressable storage locations. These may include anything from a CPU register to the content of network traffic (which is treated as temporary storage). As an illustration, Figure 3.1 shows an event E₁ reading the values from storage locations R₃ and R₆ and writing to locations R₃ and R₄.

Figure 3.1: Primitive computer history model example: event E₁ is reading the values from storage locations R₃ and R₆ and writing to locations R₃ and R₄ [27].

The primitive history is the set T containing the times for which the system has a history. The duration between each time in T, Δt, must be shorter than the fastest state change in the system. The primitive state history is function hps : T → Q that maps a time t ∊ T to the primitive state that existed at that time. The primitive event history is a function hpe : T → Σ that maps a time t ∊ T to a primitive event in the period (t – Δt, t + Δt).

The model described so far is capable of describing a static computer system; in practice, this is insufficient as a modern computing system is dynamic—it can add resources (such as storage) and capabilities (code) on the fly. Therefore, the computer history model uses a dynamic FSM model with sets and functions to represent the changing system capabilities. Formally, each of the Q, Σ, and δ sets and functions can change for each t ∊ T.

Complex Computer History Model

The primitive model presented is rarely practical on contemporary computer systems executing billions of instructions per second (code reverse engineering would be an exceptional case). Also, there is a mismatch between the level of abstraction of the representation and that of the questions that an investigator would want to ask (e.g., was this file downloaded?). Therefore, the model provides the means to aggregate the state of the system and ask questions at the appropriate level of abstraction.

Complex events are state transitions that cause one or more lower-level complex or primitive events to occur; for example, copying a file triggers a large number of primitive events. Complex storage locations are virtual storage locations created by software; these are the ephemeral and persistent data structures used by software during normal execution. For example, a file is a complex storage location and the name value attribute pairs include the file name, several different times-tamps, permissions, and content.

Figure 3.2 shows a complex event E₁ reading from complex storage locations D₁ and D₂ and writing a value to D₁. At a lower level, E₁ is performed using events E_1a and E_1b, such as CPU, or I/O instructions. The contents of D₁ and D₂ are stored in locations (D_1a, D_1b) and (D_2a, D_2b), respectively.

Figure 3.2: Complex history event examples: event E₁ with two complex cause locations and one complex effect location [27].

General Investigation Process

The sequence of queries pursued by the investigator will depend on the specific objectives of the inquiry, as well as the experience and training of the person performing it. The CHM is agnostic with respect to the overall process followed (we will discuss the cognitive perspective in Section 3.3.3) and does not assume a specific sequence of high-level phases. It does, however, postulate that the inquiry follow the general scientific method, which typically consists of four phases: Observation, Hypothesis Formulation, Prediction, and Testing & Searching.

Observation includes the running of appropriate tools to capture and observe aspects of the state of the system that are of interest, such as listing of files/processes, and rendering the content of files. During Hypothesis Formulation the investigators use the observed data, and combine it with their domain knowledge to formulate hypothesis that can be tested, and potentially falsified, in the history model. In the Prediction phase, the analyst identifies specific evidence that would be consistent, or would be in contradiction, with the hypothesis. Based on the predictions, experiments are performed in the Testing phase, and the outcomes are used to guide further iterations of the process.

Categories of Forensic Analysis

Based on the outlined framework, the CHM identifies seven categories of analytical techniques.

History duration. The sole techniques in this category and is operational reconstruction—it uses event reconstruction and temporal data from the storage devices to determine when events occurred and at what points in time the system was active. Primary sources for this analysis include log files, as well as the variety of timestamp attributes kept by the operating system and applications.

Primitive storage system configuration. The techniques in this category define the capabilities of the primitive storage system. These include the names of the storage devices, the number of addresses for each storage device, the domain of each address on each storage device, and when each storage device was connected. Together, these sets and functions define the set of possible states Q of the FSM.

Primitive event system configuration. Methods in this category define the capabilities of the primitive event system; that is, define the names of the event devices connected, the event symbols for each event device, the state change function for each event device, and when each event device was connected. Together, these sets and functions define the set of event symbols Σ and state change function δ. Since primitive events are almost never of direct interest to an investigation, these techniques are not generally performed.

Primitive state and event definition. Methods in this category define the primitive state history (hps) and event history (hes) functions. There are five types of techniques that can be used to formulate and test this type of hypothesis and each class has a directional component. Since different approaches can be used to defining the same two functions, a hypothesis can be formulated using one technique and tested with another. Overall, these are impractical in real investigations, but are presented below for completeness.

Observation methods use direct observation of an output device to define its state in the inferred history, and are only applicable to output device controllers; they cannot work for internal devices, such as hard disks.

Capabilities techniques employ the primitive system capabilities to formulate and test state and event hypotheses. To formulate a hypothesis, the investigator chooses a possible state or event at random; this is impractical for almost all real systems as the state space is enormous.

Sample data techniques extract samples from observations of similar systems or from previous executions of the system being investigated; the results are metrics on the occurrence of events and states. To build a hypothesis, states and events are chosen based on how likely they are to occur. Testing the hypothesis reveals if there is evidence to support the state or event. Note that this is a conceptual class not used in practice as there are no relevant sample data.

Reconstruction techniques use a known state to formulate and test hypotheses about the event and state that existed immediately prior to the known state. This is not performed in practice, as questions are rarely formulated about primitive events.

Construction methods are the forward-looking techniques that use a known state to formulate and test hypotheses about the next event and state. This is not useful in practice as the typical starting point is an end state; further, any hypothesis about the future state would not be testable.

Complex storage system configuration. Techniques in this category define the complex storage capabilities of the system, and are needed to formulate and test hypotheses about complex states. The techniques define the names of the complex storage types (Dcs), the attribute names for each complex storage type (DATcs), the domain of each attribute (ADOcs), the set of identifiers for the possible instances of each complex storage type (DADcs), the abstraction transformation functions for each complex storage type (ABScs), the materialization transformation functions for each complex storage type (MATcs), and the complex storage types that existed at each time and at each abstraction layer X ∊ L(ccs–X).

Two types of hypotheses are formulated in this category: the first one defines the names of the complex storage types and the states at which they existed; the second defines the attributes, domains, and transformation functions for each complex storage type. As discussed earlier, complex storage locations are program data structures. Consequently, to enumerate the complex storage types in existance at a particular point in time requires the reconstruction of the state of the computer, so that program state could be analyzed.

Identification of existing programs can be accomplished in one of two ways: program identification—by searching for programs on the system to be subsequently analyzed; and data type observation—by inferring the presence of complex storage types that existed based on the data types that are found. This latter technique may give false positives in that a complex type may have been created elsewhere and transferred to the system under investigation.

Three classes of techniques can be used to define the attributes, domains, and transformation functions for each complex storage type: (a) complex storage specification observation, which uses a specification to define a program’s complex storage types; (b) complex storage reverse engineering, which uses design recovery reverse engineering to define complex storage locations; (c) complex storage program analysis, which uses static, or dynamic, code analysis of the programs to identify the instructions creating, or accessing, the complex storage locations and to infer their structure.

It is both impractical and unnecessary to fully enumerate the data structures used by programs; only a set of the most relevant and most frequently used ones are supported by investigative tools, and the identification process is part of the tool development process.

Complex event system configuration. These methods define the capabilities of the complex event system: the names of the programs that existed on the system (Dce), the names of the abstraction layers(L), the symbols for the complex events in each program (DSYce–X), the state change functions for the complex events (DCGce–X), the abstraction transformation functions (ABSce), the materialization transformation functions (MATce), and the set of programs that existed at each time (cce).

Inferences about events are more difficult than those about storage locations because the latter are both abstracted and materialized and tend to be long-lived because of backward compatibility; the former are usually designed from the top-down, and backward compatibility is a much lesser concern.

Three types of hypotheses can be tested in this category: (a) programs existence, including period of their existence; (b) abstraction layers, event symbols, and state change functions for each program; (c) the materialization and abstraction transformation functions between the layers.

With respect to (a), both program identification and data type reconstruction can be used in the forms already described.

For hypotheses in regard to (b), there are two relevant techniques—complex event specification observation and complex event program analysis. The former uses a specification of the program to determine the complex events that it could cause. The latter works directly with the program to observe the events; depending on the depth of the analysis, this could be as simple as running the program under specific circumstances, or it could be a massive reverse engineering effort, if a (near-)complete picture is needed.

The hypotheses in part (c) concern the rules defining the mappings between higher-level and lower-level events. Identifying these rules is an inherently difficult task, and Carrier proposed only one type of technique with very limited applicability—development tool and process analysis. It analyzes the programming tools and development process to determine how complex events are defined.

Complex state and event definition. This category of techniques defines the complex states that existed (hcs) and the complex events that occurred (hce). It includes eight classes of analysis techniques and each has a directional component (Figure 3.3). Two concern individual states and events, two are forward- and backward-based, and four are upward- and downward-based.

Complex state and event system capabilities methods use the capabilities of the complex system to formulate and test state and event hypotheses based on what is possible. The main utility of this approach is that it can show that another hypothesis is impossible because it is outside of the system’s capabilities.

Complex state and event sample data techniques use sample data from observations of similar systems or from previous executions. The results include metrics on the occurrence of events and states and would show which states and events are most likely. This class of techniques is employed in practice in an ad hoc manner; for example, if a desktop computer is part of the investigation, an analyst would have a hypothesis about what type of content might be present.

Complex state and event reconstruction methods use a state to formulate and test hypotheses about the previous complex event and state. This approach is frequently employed, although the objective is rarely to reconstruct the state immediately preceeding a known one, but an earlier one. Common examples include analyzing web browser history, or most recently used records to determine what the user has recently done.

Figure 3.3: The classes of analysis techniques for defining complex states and events have directional components to them. [27].

Complex state and event construction techniques use a known state to formulate and test hypotheses about the next event and state. Similarly to the corresponding techniques at the primitive level, complex-level construction techniques are rarely used to define the event and the immediately following state. Instead, they are employed to predict what events may have occurred. For example, the content of a user document, or an installed program, can be the basis for a hypothesis on what other events and states may have occured afterward.

The final four classes of methods either abstract low-level data and events to higher-level ones, or perform the reverse—materialize higher-level data and events to lower levels. Data abstraction is a bottom-up approach to define complex storage locations (data structures) using lower-level data and data abstraction transformation rules. For example, given a disk volume, we can use knowledge about the filesystem layout to transform the volume into a set of files.

Data materialization is the reverse of data abstraction, transforming higher-level storage locations into lower-level ones using materialization rules, and has limited practical applications.

Event abstraction is the bottom-up approach to define complex events based on a sequence of lower-level events and abstraction rules. This has limited applicability to practice because low-level events tend to be too many to log; however, they can be used in the process of analyzing program behavior.

Event materialization techniques are the reverse of event abstraction, where high-level events and materialization rules are used to formulate and test hypotheses about lower-level complex and primitive events. For example, if a user is believed to have performed a certain action, then the presence, or absence, of lower-level traces of their action can confirm, or disprove, the hypothesis.

3.3.3 COGNITIVE TASK MODEL

The differential analysis technique presented in Section 3.3.1 is a basic building block of the investigative process, one that is applied at varying levels of abstraction and to a wide variety of artifacts. However, it does not provide an overall view of how forensic experts actually perform an investigation. This is particularly important in order to build forensic tools that properly support the cognitive processes.

Unfortunately, digital forensics has not been the subject of any serious interest from cognitive scientists and there have been no coherent efforts to document forensic investigations. Therefore, we adopt the sense-making process originally developed by Pirolli and Card [142] to describe intelligence analysis—a cognitive task that is very similar to forensic analysis. The Pirolli–Card cognitive model is derived from an in-depth cognitive task analysis (CTA), and provides a reasonably detailed view of the different aspects of an intelligence analyst’s work. Although many of the tools are different, forensic and intelligence analysis are very similar in nature—in both cases analysts have to go through a mountain of raw data to identify (relatively few) relevant facts and put them together in a coherent story. The benefit of using this model is that: (a) it provides a fairly accurate description of the investigative process in its own right, and allows us to map the various tools to the different phases of the investigation; (b) it provides a suitable framework for explaining the relationships of the various models developed within the area of digital forensics; and (c) it can seamlessly incorporate into the investigation information from other sources.

The overall process is shown in Figure 3.4. The rectangular boxes represent different stages in the information-processing pipeline, starting with raw data and ending with presentable results. Arrows indicate transformational processes that move information from one box to another. The x axis approximates the overall level of effort to move information from raw to the specific processing stage. The y axis shows the amount of structure (with respect to the investigative process) in the processed information for every stage. Thus, the overall trend is to move the relevant information from the lower left to the upper right corner of the diagram. In reality, the processing can both meander through multiple iterations of local loops and jump over phases (for routine cases handled by an experienced investigator).

Figure 3.4: Notional model of sense-making loop for analysts derived from cognitive task analysis [185, p. 44].

External data sources include all potential evidence sources for the specific investigation, such as disk images, memory snapshots, network captures, as well as reference databases, such as hashes of known files. The shoebox is a subset of all the data that has been identified as potentially relevant, such as all the email communication between two persons of interest. At any given time, the contents of the shoebox can be viewed as the analyst’s approximation of the information content potentially relevant to the case. The evidence file contains only the parts that directly speak to the case, such as specific email exchanges on topics of interest.

The schema contains a more organized version of the evidence, such as a timeline of events, or a graph of relationships, which allows higher-level reasoning over the evidence. A hypothesis is a tentative conclusion that explains the observed evidence in the schema and, by extension, could form the final conclusion. Once the analyst is satisfied that the hypothesis is supported by the evidence, the hypothesis turns into a presentation, which is the final product of the process. The presentation usually takes on the form of an investigator’s report that both speaks to the high-level conclusions relevant to the legal case, and also documents the low-level technical steps based on which the conclusion has been formed.

The overall analytical process is split into two main activity loops: a foraging loop that involves actions taken to find potential sources of information, query them, and filter them for relevance; and a sense-making loop in which the analyst develops—in an iterative fashion—a conceptual model that is supported by the evidence. The information transformation processes in the two loops can be classified into bottom-up (organizing data to build a theory) or top-down (finding data based on a theory). In practice, analysts apply these in an opportunistic fashion with many iterations.

Bottom-up Processes

Bottom-up processes are synthetic—they build higher-level (more abstract) representations of the information from more specific pieces of evidence.

• Search and filter: External data sources, hard disks, network traffic, etc., are searched for relevant data based on keywords, time constraints, and others in an effort to eliminate the vast majority of the data that are irrelevant.

• Read and extract: Collections in the shoebox are analyzed to extract individual facts and relationships that can support or disprove a theory. The resulting pieces of artifacts (e.g., individual email messages) are usually annotated with their relevance to the case.

• Schematize: At this step, individual facts and simple implications are organized into a schema that can help organize and help identify the significance and relationship among a growing number of facts and events. Timeline analysis is one of the basic tools of the trade; however, any method of organizing and visualizing the facts—graphs, charts, etc.—can greatly speed up the analysis. This is not an easy process to formalize, and most forensic tools do not directly support it. Therefore, the resulting schemas may exist on a piece of paper, on a whiteboard, or only in the mind of the investigator. Since the overall case could be quite complicated, individual schemas may cover only specific aspects of it, such as the sequence of events discovered.

• Build case: Out of the analysis of the schemas, the analyst eventually comes up with testable theories that can explain the evidence. A theory is a tentative conclusion and often requires more supporting evidence, as well as testing against alternative explanations.

• Tell story: The typical result of a forensic investigation is a final report and, perhaps, an oral presentation in court. The actual presentation may only contain the part of the story that is strongly supported by the digital evidence; weaker points may be established by drawing on evidence from other sources.

Top-down Processes

Top-down processes are analytical—they provide context and direction for the analysis of less structured data search and organization of the evidence. Partial, or tentative conclusions, are used to drive the search of supporting and contradictory pieces of evidence.

• Re-evaluate: Feedback from clients may necessitate re-evaluations, such as the collection of stronger evidence, or the pursuit of alternative theories.

• Search for support: A hypothesis may need more facts to be of interest and, ideally, would be tested against all (reasonably) possible alternative explanations.

• Search for evidence: Analysis of theories may require the re-evaluation of evidence to ascertain its significance/provenance, or may trigger the search for more/better evidence.

• Search for relations: Pieces of evidence in the file can suggest new searches for facts and relations on the data.

• Search for information: The feedback loop from any of the higher levels can ultimately cascade into a search for additional information; this may include new sources, or the reexamination of information that was filtered out during previous passes.

Foraging Loop

It has been observed [138] that analysts tend to start with a high-recall/low-selectivity query, which encompassed a fairly large set of documents—many more than the analyst can afford to read. The original set is then successively modified and narrowed down before the documents are read and analyzed.

The foraging loop is a balancing act between three kinds of processing that an analyst can perform—explore, enrich, and exploit. Exploration effectively expands the shoebox by including larger amounts of data; enrichment shrinks it by providing more specific queries that include fewer objects for consideration; exploitation is the careful reading and analysis of an artifact to extract facts and inferences. Each of these options has varying cost and potential rewards and, according to information foraging theory [141], analysts seek to optimize their cost/benefit trade-off.

Sense-making Loop

Sense-making is a cognitive term and, according to Klein’s [102] widely quoted definition, is the ability to make sense of an ambiguous situation. It is the process of creating situational awareness and understanding to support decision making under uncertainty; it involves the understanding of connections among people, places, and events in order to anticipate their trajectories and act effectively.

There are three main processes that are involved in the sense-making loop: problem structuring—the creation and exploration of hypotheses, evidentiary reasoning—the employment of evidence to support/disprove hypothesis, and decision making—selecting a course of action from a set of available alternatives.

Data Extraction vs. Analysis vs. Legal Interpretation

Considering the overall process from Figure 3.4, we gain a better understanding of the relationships among the different actors. At present, forensics researchers and tool developers primarily provide the means to extract data from the forensic targets (step 1), and the basic means to search and filter it. Although some data analytics and natural language processing methods (like entity extraction) are starting to appear in dedicated forensic software, these capabilities are still fairly rudimentary in terms of their ability to automate parts of the sense-making loop.

The role of the legal experts is to support the upper right corner of the process in terms of building/disproving legal theories. Thus, the investigator’s task can be described as the translation of highly specific technical facts into a higher-level representation and theory that explains them. The explanation is almost always tied to the sequence of actions of humans involved in the case.

In sum, investigators need not be software engineers but must have enough proficiency to understand the significance of the artifacts extracted from the data sources, and be able to competently read the relevant technical literature (peer-reviewed articles). Similarly, analysts must have a working understanding of the legal landscape and must be able to produce a competent report, and properly present their findings on the witness stand, if necessary.

¹A more detailed definition and discussion of traditional forensics is beyond our scope.

Подняться наверх