Читать книгу Domain-Sensitive Temporal Tagging - Jannik Strötgen - Страница 14
ОглавлениеCHAPTER 3
Foundations of Temporal Tagging
In this chapter, we lay the theoretical foundations to fully understand the discipline of temporal tagging and the challenges that approaches to temporal tagging are faced with. For this, we survey annotation standards, evaluation methods, research competitions, and temporally annotated corpora.
3.1 ANNOTATION STANDARDS
As introduced in the previous chapter, there are different types of temporal expressions: date, time, duration, and set expressions. In addition, temporal expressions can carry their meaning explicitly or implicitly, or they can be underspecified or relative to some context information. When addressing the task of temporal tagging, it is necessary that it is well defined: (i) what types of temporal expressions are “markable” [Ferro et al., 2005b] and should thus be annotated; (ii) what extents should be annotated; and (iii) how the semantics of the expressions can be captured by using normalization attributes requiring some values in a standard format. Thus, annotation standards with precise specifications are a prerequisite when dealing with the task of temporal tagging.
Currently, there are two widely used annotation standards for annotating temporal expressions in documents: TIDES TIMEX2 [Ferro et al., 2001, 2005b] and TimeML [Pustejovsky et al., 2003a, 2005, 2010], a specification language for temporal annotation using TIMEX3 tags for temporal expressions. Both standards present guidelines for the annotation of temporal expressions, including how to determine the extents of expressions and their normalizations. In both cases, the normalization is defined according to the ISO 8601 standard for temporal information with some extensions. For instance, a date expression of granularity day is normalized in the format YYYY-MM-DD
. Since all widely used annotated corpora (cf. Section 3.4) as well as all state-of-the-art systems (cf. Chapter 5) are based on either one of the two above-mentioned standards, we describe the details of both of them in the following.
TIDES TIMEX2
While there have been several TIMEX definitions reaching from extent-only coverage [see, e.g., Chinchor, 1998], up to inclusion of some normalization information [see, e.g., Mani and Wilson, 2000a, Setzer and Gaizauskas, 2000], the TIDES TIMEX2 definitions were the first annotation guidelines that were well defined with sufficient detail to become broadly accepted as a standard. The annotation guidelines are based on the principles that temporal expressions should be tagged “if a human can determine a value for [it]” and that the value “must be based on evidence internal to the document” [Ferro et al., 2001]. Covering extent and normalization information, both questions What is a temporal expression? and What is the meaning of a temporal expression? are addressed. For the normalization, TIMEX2 tags may contain the following attributes [Ferro et al., 2005b]:
• VAL: a normalized form of the date/time [or duration/set];
• MOD: captures temporal modifiers;
• ANCHOR_VAL: a normalized form of an anchoring date/time [of a duration];
• ANCHOR_DIR: the relative direction between VAL and ANCHOR_VAL; and
• SET: identifies expressions denoting sets of times.
Except for the SET attribute, there is no concrete attribute for the type of temporal expressions in general. Nevertheless, since it can be determined based on the VAL attribute whether an expression is a time, a date or a duration, the classification of temporal expressions into these four types is implicitly covered by TIMEX2 annotations. However, it is rather difficult to use TIMEX2 annotations if only the extraction and classification of temporal expressions is targeted without the full normalization of temporal expressions.
TIMEML WITH TIMEX3 TAGS FOR TEMPORAL EXPRESSIONS
TimeML, which has more recently been formalized to create the ISO standard ISO-TimeML1[Pustejovsky et al., 2010], is based on the TIDES standard and was developed to capture further types of temporal information in documents. In contrast to TIDES that has only one tag for temporal expressions, TimeML contains tags for annotating events, temporal links (i.e., temporal relations), and temporal signals in addition to the TIMEX3 tag for temporal expressions [Pustejovsky et al., 2003a, 2005, 2010]. In the following, we focus on a description of TimeML aspects that are relevant for the task of temporal tagging.
Due to the fact that TimeML focuses on temporal information in general and not only temporal expressions, there are significant differences between TIMEX2 and TIMEX3. These differences concern both the attributes and the extents of temporal expressions. For example, events can be part of temporal expressions in TIMEX2 (<TIMEX2>two days after the revolution</TIMEX2>
), while they are not part of temporal expressions following TimeML (<TIMEX3>two days</TIMEX3> after the revolution
).
In particular, specific types of pre- and post-modifiers of temporal expressions are part of TIMEX2 tags while in TimeML they are outside TIMEX3 tags [Mazur, 2012]. Such constructs are handled using the newly introduced tags for annotating relations between temporal expressions and events. In addition, TIMEX3 tags cannot be nested. However, TIMEX3 tags with no extent are introduced, for example, to deal with unspecified time points, which are sometimes needed to anchor durations. Note that despite the fact that such abstract tags, that is, annotations without any extent, are described in the TimeML annotation guidelines, they have not been used [cf. Mazur, 2012]—neither in annotated corpora nor by TIMEX3-compliant temporal taggers—until the Italian temporal tagging challenge EVENTI in 2014 [Caselli et al., 2014]. In addition, abstract tags have been annotated in the 2016 released MEANTIME corpus [Minard et al., 2016], which was developed in the context of the NewsReader project.2 Before that, empty TIMEX3 tags have been mostly ignored.
To describe the semantics of temporal expressions, the most important attributes of TIMEX3 tags3 are:
• type: defines whether the expression is of type date, time, duration, or set;
• value: a normalized form of the expression;
• mod: captures temporal modifiers;
• quant and freq: specify the quantity and frequency of set expressions;
• beginpoint and endpoint: anchor begin and end of a duration; and
• tid: automatically assigned id number.
While the attribute type—with possible values “date”, “time”, “duration”, and “set”—is newly introduced in TIMEX3, the attributes value and mod are similar to the VAL and MOD attributes in TIMEX2. These two attributes already capture a large part of the information of temporal expressions, and for many expressions—in particular for many date and time expressions—the value attribute is the only attribute besides type that is needed for normalization. This is also the reason why in several evaluations of temporal taggers, the value attribute is the focus of interest [see, e.g., UzZaman et al., 2013].
In particular for explicit date and time expressions, forming the value attribute (or the VAL attribute in TIMEX2) is straightforward, for example, the values of the expressions “September 13, 2009” and “Oct 12, 2014 7:00 am” are 2009-09-13
and 2014-10-12T07:00
, respectively. For underspecified and relative date and time expressions, setting the value attribute is more challenging, because the information covered by their own extents is not sufficient. In contrast, a reference time has to be used along with a temporal function to calculate the content of the value attribute. For instance, in a document published on November 27, 2014 (2014-11-27
), the expression “yesterday” can be normalized to 2014-11-26
.4
Value attributes in TIMEX3 (as VAL attributes in TIMEX2) assigned to duration expressions start with “P” (period), followed by an amount and an abbreviated unit, e.g., the value of “three years” is P3Y
. If the unit of the duration is smaller than a day, the value attribute starts with “PT” (period, time), e.g., PT5H
for the expression “five hours”. Thus, the value attribute of durations represents the length of the duration. If a duration can be anchored to some point in time, the attribute beginpoint or endpoint can be used to cover this information. Finally, the value attributes of set expressions are often similar to the ones of duration expressions. However, set expressions are additionally assigned at least one of the attributes quant and freq to cover the characteristics of set expressions. For instance, “twice a week”, has a value attribute of P1W
and a freq attribute of 2X
.
In contrast to the other attributes, the tid attribute does not contain any normalized information about an expression, but is just an id number that is automatically generated. It can be used to refer from other TimeML objects to a particular TIMEX3 object. Due to the relations between annotated instances within TimeML, for example, a temporal relation between an event and a temporal expression, an id is assigned to all objects in TimeML.
For many temporal expressions, only an identifier, a type, and a value are assigned. In addition, although the different attributes and definitions of extents between TIMEX2 and TIMEX3 are significant, the annotations for many temporal expressions are very similar, and an automated conversion works reasonably well [see Saquete, 2010, Saquete and Pustejovsky, 2011].
ANNOTATION SPECIFICATIONS FOR OTHER LANGUAGES THAN ENGLISH
While annotation standards have mostly focused on English or have been developed with the assumption of being rather language-independent, more recently, more and more effort was put into developing language-specific annotation specifications that capture language characteristics. Obviously, most of the language-specific adaptations deal with specifying extents of temporal expressions. For instance, TimeML specifies that determiners are typically included and prepositions are excluded of the extents of temporal expressions (e.g., in <TIMEX3>the year 2000</TIMEX3>
). In other languages, however, contractions are sometimes used with prepositions and determiners (e.g., in German “in dem” can be contracted to “im” and thus the respective German phrase could be annotated either as <TIMEX3>im Jahr 2000</TIMEX3>
or as im <TIMEX3> Jahr 2000</TIMEX3>
). For this, there is a need for a decision whether to include both or neither of them in the extents of temporal expressions.
Furthermore, the set of possible normalization values for the temporal expressions’ attributes have to be extended. For instance, while the original TimeML TIMEX3 attribute value has a possible value to specify a quarter of a year using “Q”, e.g., in 2015-Q1
and 2015-Q2
for the first and second quarter of the year 2015, respectively, it does not contain possible values to specify the three four-month periods of a year. While this is quite logical since references to quarters of years are frequent in English, references to the three four-month periods are not. However, when being faced with other languages, such expressions occur frequently. For instance, in Spanish the phrase <TIMEX3>el primer cuatrimestre</TIMEX3>
refers to the first four-month period of a year. Obviously, it should be possible to normalize such expressions accordingly.
Language-specific annotation guidelines and specifications following the English TimeML have been developed for several languages. Often, they have been developed in the context of some research competitions or together with a manually annotated corpus, which will be surveyed in Section 3.3 and Section 3.4, respectively. These efforts resulted in annotation guidelines and specifications, with some of them being very sophisticated, e.g., those for French [Bittar et al., 2011], Spanish [Saurí and Badia, 2012a, Saurí et al., 2010], and Italian (Ita-TimeML) [Caselli, 2010, Caselli et al., 2011]. It is interesting to note that many of the adaptations to the guidelines and specifications do not concern the annotations of temporal expressions but other parts of TimeML.
For Portuguese [Costa and Branco, 2012] and Romanian [Forascu and Tufis, 2012], English TimeML-annotated data was translated and the annotations were aligned. The authors of both works report that modifications to the original TimeML annotations were sometimes necessary due to language differences, but mostly concerned events and temporal relations, that is, not temporal expression annotations. First steps toward TimeML-compliant annotation specifications for further languages have been taken without focusing on temporal expressions, e.g., for Turkish [Seker and Diri, 2010]. For some languages, annotation efforts concentrated on TIMEX3 annotations only, e.g., for Vietnamese and Arabic [Strötgen et al., 2014a], Croatian [Skukan et al., 2014] and Turkish [Küçük and Küçük, 2015]. These, however, did not result in language-specific annotation specifications but have been carried out by following the English annotation guidelines for TIMEX3 annotations as closely as possible.
HANDLING THE UNCERTAINTY OF TEMPORAL EXPRESSIONS
According to both standards, TIDES TIMEX2 and TimeML with TIMEX3 tags, temporal expressions referring to points on timelines of any granularity are associated with a single value attribute. For instance, <TIMEX>the year 2000</TIMEX>, <TIMEX>March 2000</TIMEX>
, and <TIMEX>March 11, 2000</TIMEX>
are normalized to 2000, 2000-03
, and 2000-03-11
, respectively. As pointed out by Berberich et al. [2010], such temporal expressions carry some amount of uncertainty if they occur in a specific context. For instance, in the phrase “the FIFA world cup final 1998”, the final took place on a particular day and not during the whole year.
Thus, they suggest handling each date and time expression as a four-tuple with lower bounds (l) and upper bounds (u) for the begin and end times to cover this uncertainty, i.e., as 〈beginl, beginu, endl, endu〉. For single temporal expressions the lower bounds are identical and the upper bounds are identical (e.g., the four-tuple representation of <TIMEX>May 2000</TIMEX>
is 〈2000-05-01, 2000-05-31, 2000-05-01, 2000-05-31
〉). For interval expressions, the four values are different, e.g., 〈2000-03-01, 2000-03-31, 2001-05-01, 2001-05-31
〉 for <TIMEX>March 2000 to May 2001</TIMEX>
.
When strictly following TimeML, the phrase “March 2000 to May 2001” is to be annotated as two date expressions (<TIMEX>March 2000</TIMEX> and <TIMEX>May 2001</TIMEX>
) and a duration expression as abstract tag with the value attribute covering the length of the interval (1 year and 3 months). In addition, the begin and end of the interval are covered by the beginpoint and endpoint attributes normalized as 2000-03
and 2001-05
. However, as pointed out above, these empty TIMEX tags are often ignored and thus the duration information about complex temporal expressions is typically not covered.
It is worth mentioning that such a crisp annotation of temporal expressions using the four-tuple representation is not always possible due to the fuzziness of language. For instance, temporal expressions with modifiers are more difficult to interpret. In such cases, TimeML makes use of the modifier attribute in addition to the value attribute, e.g., <TIMEX> the beginning of 2000</TIMEX>
has a value attribute of 2000
and a modifier attribute of START
. Thus, the annotation is left fuzzy on purpose. A direct resolution to the four-tuple representation is also difficult. Of course, due to the fuzziness one could assign the same four values as if there was no modifier. However, it is obvious that parts of the year are not part of “the beginning of 2000” and specifying the boundary is difficult, if not impossible. The boundary might also depend on when the expression is uttered. If the time of utterance is March 2000, then it is likely that March is not included in the time referred to as “the beginning of 2000”. In contrast, March might be included if the expression is uttered in 2002. The upper bound of the end time can thus not be determined at all.
SUMMARY
TIDES TIMEX2 and TimeML annotation standards are widely accepted in the research community. Depending on particular use cases, they are sometimes extended—as by Berberich et al. [2010] in the context of temporal information retrieval—to better cover the requirements of applications. Due to a lot of research on temporal relation extraction, TimeML is more widely used than TIDES TIMEX2 annotations.
Whenever one is faced with the task of temporal tagging, annotation specifications are required so that normalized information can be correctly interpreted. In addition, since almost all works in the area of temporal tagging are following one of the two standards, it is crucial to follow these annotation specifications when developing a temporal tagger. Otherwise, existing manually annotated corpora cannot be used for evaluations and no meaningful comparison to existing approaches is possible.
Based on both standards, several research competitions have been organized, and several corpora have been manually annotated to be used as benchmarks. In the following sections, we survey temporal tagging research competitions and present an overview of existing annotated corpora. As different measures have been used in the research competitions to evaluate temporal tagging performance, we first describe how temporal taggers can be evaluated and what issues have to be taken into consideration.
3.2 EVALUATING TEMPORAL TAGGERS
In general, as for many natural language processing tasks, there are two ways of evaluating the extraction and normalization quality of temporal taggers: extrinsically and intrinsically. In the former case, more complex tasks or applications relying on temporal tagging output are evaluated. Examples are the tasks of temporal information retrieval [Alonso et al., 2011], temporal relation extraction [UzZaman et al., 2013], and (time-related) question answering [Llorens et al., 2015]. Much more common to evaluate temporal taggers, however, are intrinsic evaluations, that is, using manually annotated corpora and directly evaluating a temporal tagger’s extraction and normalization quality.
CONFUSION MATRIX
For intrinsic evaluations, temporal tagging is considered as a specific sequential tagging task, and the confusion matrix (also called contingency table or contingency matrix) can be used to describe a system’s output when compared to a gold standard. As shown in Table 3.1, all decisions of a temporal tagger can be grouped with the confusion matrix into one of the following four classes of a binary classification [Manning and Schütze, 2003]:
• true positives (TP): annotated by the system and in the gold standard;
• true negatives (TN): neither annotated by the system nor in the gold standard;
• false positives (FP): annotated by the system but not in the gold standard; and
• false negatives (FN): not annotated by the system but in the gold standard.
Note that because many temporal expressions consist of more than one token, it is also common to distinguish between strict and relaxed matching. Details about the differences will be explained at the end of the section (page 29).
Table 3.1: The decisions of a temporal tagger can be categorized using the confusion matrix
System Prediction | Gold Standard (Ground Truth) | |
Positive | Negative | |
Positive | TP | FP |
Negative | FN | TN |
PRECISION, RECALL, F1-SCORE
Both tasks of temporal taggers—the extraction and the normalization of temporal expressions—can be evaluated based on the confusion matrix. For the extraction, true positives are all instances that are correctly extracted by the system, while for the normalization, only instances that are correctly extracted and normalized are considered as true positives. Typically, in an evaluation the measures of precision, recall, and f1-score are determined.
Precision is a measure to indicate how many of the expressions extracted by the system are correct (Equation 3.1). If all instances marked as positive by the system are correct, then precision equals 1, and if all instances marked as positive by the system are incorrectly marked, then precision equals 0:
In contrast, recall indicates how many of the expressions that should be extracted are correctly extracted by the system (Equation 3.2). Thus, recall equals 0 if none of the instances that should be marked as positive is marked as positive by the system, and recall equals 1 if all instances that should be marked as positive are indeed marked as positive by the system: