Читать книгу Domain-Sensitive Temporal Tagging - Jannik Strötgen - Страница 13
ОглавлениеCHAPTER 2
The Concept of Time
In the previous chapter, we already have implicitly exploited some characteristics of temporal information to explain the motivating examples. Now, we formulate the key characteristics of temporal information in a precise manner (Section 2.1). Then, we highlight the differences between multiple types of temporal expressions occurring in textual documents (Section 2.2) and analyze their possible textual realizations (Section 2.3).
2.1 KEY CHARACTERISTICS OF TEMPORAL INFORMATION
There are three key characteristics of temporal information that make this kind of information highly valuable for many search and exploration tasks. They can be formulated as follows [Alonso et al., 2011].
TEMPORAL INFORMATION IS WELL DEFINED
Given two points in time or two time intervals, the temporal relationship between them can always be determined, for example, as before or identical. In general, the relationship can be assumed to be one of the temporal relations defined by Allen [1983] in the context of temporal reasoning. In addition to the equality relation, there are six symmetrical relations, namely before, meets, overlaps, during, starts, and finishes [Allen, 1983]. In Figure 2.1, these relations are visualized following Allen’s presentation.
Figure 2.1: Temporal information is well-defined so that one of the relations defined by Allen [1983] holds between any intervals X and Y. Note that all relations except the equality relation are symmetric so that in total there are 13 possible relations between X and Y.
TEMPORAL INFORMATION CAN BE NORMALIZED
Regardless of the terms used and even of the languages used, two temporal expressions referring to the same semantics can be normalized to the same value in some standard format. Thus, temporal information can be considered as term- and language-independent. Understanding how temporal expressions can be normalized is one important step toward realizing how temporal information can be exploited in all kinds of application and research scenarios. While we will discuss the details when introducing annotation standards for temporal information in Section 3.1, an example with different temporal expressions carrying the same meaning is shown in Figure 2.2. Note that the expressions are uttered at various reference times (tref) and are normalized to the same value on the timeline t.
Figure 2.2: Temporal information can be normalized; the expressions uttered at various times tref have the same value in standard format (2015-10-12
). Note that explicit expressions such as “October 12, 2015” are normalized independently of when they are stated. The terms “heute” and “hoy” are German and Spanish translations of “today”.
TEMPORAL INFORMATION CAN BE ORGANIZED HIERARCHICALLY
Temporal expressions can be of different granularities. For example, they can be of granularity day (e.g., “August 3, 1992”), month (e.g., “August 1992”), or year (e.g., “1992”). Due to the fact that years consist of months and months consist of days, expressions of one granularity (e.g., day) can be mapped to coarser granularities (e.g., month or year) based on the hierarchy of temporal information. In Figure 2.3, this hierarchy information is shown using the concept of timelines. A timeline is associated with a specific granularity (e.g., tday, tmonth, tquarter, tyear) so that expressions of respective granularities can be placed on the timelines as points in time. Note, however, that coarse expressions represent a point on the timeline with the same granularity (e.g., “August 1992” on tmonth) but span a time interval on finer granularities (e.g., “August 1992” spans from “August 1, 1992” to “August 31, 1992” on tday).
Figure 2.3: Temporal information can be organized hierarchically. The blue triangles show how points on coarser timelines (e.g., “1990s” on tdecade) span an interval on finer timelines (e.g., “1990s” spans from “1990” to “1999” on tyear).
2.2 TEMPORAL EXPRESSIONS IN DOCUMENTS
There are different types of temporal expressions according to what kind of temporal information an expression refers to, for example, a point in time or a duration. Note that we use the term point in time to refer to an expression if it can be anchored on a timeline of any granularity although, strictly speaking, expressions of coarse granularities span a time interval on finer granularities (cf. Figure 2.3).
In the context of temporal tagging, it is common practice to distinguish between the following four types of expressions—as it is specified in the temporal markup language TimeML, which will be detailed in Section 3.1 together with further annotation standards.
• Date expressions: A date expression refers to a point in time of the granularity “day” (e.g., “July 10, 2015”) or any other coarser granularity, for example, “month” (e.g., “July 2015”) or “year” (e.g., “2015”).
• Time expressions: A time expression refers to a point in time of any granularity smaller than “day” such as a part of a day (e.g., “Friday morning”) or time of a day (e.g., “3:30 pm”).
• Duration expressions: A duration expression provides information about the length of an interval. They can refer to intervals of different granularities (e.g., “three hours” or “five years”). In addition to the length of the interval, it might also be possible to specify the point in time when the interval starts or ends. However, the main semantics of a duration expression is about the length of the interval.
• Set expressions: A set expression refers to the periodical aspect of an event, that is, it describes a set of times or dates (e.g., “every Monday”) or a frequency within a time interval (e.g., “twice a week”).
As mentioned above, date expressions—and also (coarse) time expressions—can also be considered as time intervals since there is always a smaller temporal unit out of which such expressions consist, for example, a single “day” as a point in time consists of hours and could thus be regarded as a duration of the granularity “hour”. However, time and date expressions can be placed on timelines as single points—although the timelines are of different granularities depending on the expressions, as exemplified in Figure 2.3. In contrast, a duration expression cannot be placed on a timeline as a single point although the point in time when the interval starts or ends might be specified in addition to the length of the interval. Thus, time and date expressions of different granularities are not treated as durations despite the fact that they often have a duration.
2.3 REALIZATIONS OF TEMPORAL EXPRESSIONS
Temporal expressions, in particular those of the types “date” and “time”, can be realized in natural language in several different ways. Besides the fact that the full variety of realizations should be covered and thus extracted by a temporal tagger, a major issue is that depending on the realization, the difficulty in the normalization of date and time expressions varies significantly.
Many different terms have been used in the literature to describe various realizations and characteristics of point expressions, and a brief survey of alternative namings and their descriptions is given below. In this book, we use the four types of realizations described by Strötgen [2015], whose namings are motivated by observations earlier discussed in the literature. However, the goal of the four types is to cover those characteristics of point expressions that are particularly relevant for temporal tagging. In Table 2.1, the four categories are shown with sample expressions and an explanation of what information is required for their normalization.
• Explicit expressions: Explicit expressions are date and time expressions that carry all the required information for their normalization. Thus, no further knowledge or context information is required, the expressions are fully specified and context-independent. For example, the expressions of the granularity day “March 11, 2013” and of the granularity month “March 2013” can be directly normalized to 2013-03-11
and 2013-03
, respectively.
• Implicit expressions: Implicit expressions can be normalized once their implicit temporal semantics is known. Thus, this category is designed specifically for named dates. Examples are holidays that can be directly mapped to a point in time. A simple implicit expression is “Christmas 2013” since Christmas refers to December 25. Thus, the expression can be normalized to 2013-12-25
. A more complex example is “Columbus Day 2013” since Columbus Day is scheduled as the second Monday in October. Some calendar calculations have to be performed to normalize the expression to 2013-10-14
.
Table 2.1: The four categories how temporal expressions can be realized with examples and an overview of information required for their normalization
• Relative expressions: In contrast to explicit and implicit expressions, relative expressions cannot be normalized without context information. More precisely, a reference time has to be detected to normalize expressions such as “today” and “the following year”. For some relative expressions, the reference time is the point in time when the expression was formulated (e.g., for “today”) while the reference time of other expressions is a point in time mentioned in the context of the expression (e.g., in the statement “in 2000 … in the following year”, 2001
is the normalized value of “the following year” since “2000” is the reference time). In both cases, the reference time is the only required information, because the relation to the reference time is carried by the expressions.
• Underspecified expressions: For the normalization of underspecified expressions, the relation to the reference time is required in addition to the reference time itself. For instance, expressions such as “December” or “December 25” can locally be normalized to XXXX-12
and XXXX-12-25
, respectively, that is, without specifying the year. Assuming that the reference time is “November 2013” (2013-11
) and the relation to the reference time is “after”, then the two examples can be normalized to 2013-12
and 2013-12-25
, respectively.
ALTERNATIVE NAMINGS
As mentioned above, the categorization of temporal expressions referring to points in time has quite a long tradition in the literature. While the set of expressions which we call explicit expressions is usually a fixed set and only the names to refer to such expressions differ—e.g., explicit [e.g., Alonso et al., 2007, Schilder and Habel, 2001], fully specified [e.g., Pustejovsky et al., 2003a], absolute [e.g., Derczynski, 2013, Jurafsky and Martin, 2008], complete [e.g., Hinrichs, 1986], and independent [e.g., Hinrichs, 1986]—expressions we call implicit are less frequently discussed. Grouping the other expressions (i.e., the ones we refer to as relative and underspecified) results in different, partially overlapping sets with multiple names in the literature.
In the following, we present Mazur’s [2012] overview of the terminology used in the literature. For this, the following three example expressions are used:
(i) “tomorrow”,
(ii) “2 days later”, and
(iii) “May 21st”.
While some authors summarize all three types of expressions, e.g., as indexical expressions [e.g., Schilder and Habel, 2001] or relative expressions [e.g., Alonso et al., 2007], they were already separated into three groups by Smith [1978] and Hinrichs [1986]. Expressions such as (i) are frequently referred to as deictic expressions [e.g., Ahn et al., 2005, Busemann et al., 1997, Hinrichs, 1986, Smith, 1978]. Expressions such as (ii) are referred to as anaphoric expressions by some authors [Busemann et al., 1997], while others use the same term to refer to expressions such as (ii) and (iii) [e.g., Ahn et al., 2005]. In our categorization, we follow Busemann et al. [1997] referring to expressions such as (iii) as underspecified expressions.
Some authors include so-called “vague expressions” as a separate group of point expressions. For instance, Mani and Wilson [2000b] use the term to refer to expressions such as “Monday morning” or season names (e.g., “fall”, “winter”) as vague expressions since their boundaries are fuzzy. That is, there are no exact start and end times. However, we agree with Mazur [2012] that the vagueness of such expressions should not result in a specific type of expressions since it “is not the expression that is vague […] [but] the entity referred to that has vague boundaries” [Mazur, 2012].
UNCERTAINTY OF TEMPORAL EXPRESSIONS
Standard date and time expressions are also often used without referring to the full duration of the expression. That is, the actual meaning of them is uncertain, or more specifically, it is not clear which exact time interval they actually refer to [Berberich et al., 2010]. For instance, in “he visited Germany in 2010”, it is rather unlikely that the visit took place the whole year. The exact point or period in 2010 is not known. Thus, all expressions of a larger granularity than a timestamp could be regarded as fuzzy. As will be described in Chapter 3, according to annotation standards, date and time expressions are typically assigned a single normalized value so that we also refer to them as points in time (with specific granularities). However, as pointed out by Berberich et al. [2010]—and as we will also discuss later in Section 3.1 when describing annotation standards—for some applications it may be useful to consider every time and date expression as an interval and to assign lower and upper bounds for the start and end times instead of a single value, that is, to take care of the fuzziness issue.
Figure 2.4: Different realization types of date expressions in documents.
EXAMPLES OF DATE EXPRESSIONS IN A NEWS ARTICLE
In order to become familiar with the naming of realization types of date and time expressions, we give some examples in Figure 2.4. In some excerpts of the news article, which was already shown in Figure 1.1, temporal expressions are marked as either explicit, underspecified, or relative. Since there has been no implicit temporal expression in the original article, we added the last sentence to the example to cover all four realization types of temporal expressions in this example.
As already pointed out above, there are differences in how temporal expressions of the four realization types are to be normalized. Since these differences are one of the key challenges of temporal tagging, we will cover them in detail in Chapter 4. Before that, we will first lay some further foundations (annotation standards and evaluation methods) and present an overview of relevant research competitions as well as existing annotated data sets in the next chapter.
2.4 SUMMARY OF THE CHAPTER
The most important characteristic of temporal information in the context of temporal tagging is that it can be normalized. For applications exploiting normalized temporal information, it is furthermore important that temporal information is well defined and that it can be organized hierarchically. While there are four types of temporal expressions (date, time, duration, and set expressions), several namings of the realizations of date and time expressions have been suggested in the literature. However, in the context of temporal tagging, we suggest to distinguish between explicit, implicit, relative, and underspecified date and time expressions.