Читать книгу A Companion to Chomsky - Группа авторов - Страница 57
5.4 Type 2 Grammars: Context‐Free Grammars
ОглавлениеA Type 2 or context‐free grammar (CFG) allows the right‐hand side of a rule to be any mixture of nonterminal and terminal symbols. A classic example, which generates the (for ) stringset, is shown in (2). A derivation using this grammar is in (3).
1 (2) start symbol: SSaSbSab
2 (3)SaSbaaSbbaaaSbbbaaaaSbbbbaaaaabbbbb
It is helpful to consider exactly how this CFG manages to do what no FSG can. What made this pattern impossible for an FSG is that the endless collection of prefixes a, aa, aaa, etc. cannot all be “kept separate” by a function that maps each prefix to a subset of a finite number of states. But a CFG still has only a finite number of rules, which specify the behaviour of a finite number of nonterminal symbols; one way or another, a CFG must boil down to a collection of rules that specify finitely many ways in which subexpressions of finitely many different classes can be combined. How can this finiteness be reconciled with the fact that a CFG can generate the pattern?
The answer is that while a FSG is limited to classifying strings according to their role as prefixes (i.e. according to what can come after them), a CFG is able to classify strings according to their role as infixes (i.e. according to what can come around them). The CFG in (2) does not work by specifying what can come after aa, and what can come after aaa, etc.; as we have seen, any finite device that adopts this strategy is doomed. Instead, this grammar works by specifying what can appear around certain strings, and the crucial point is that what can appear around, say, aabb, is the same as what can appear around aaabbb – so while the pattern does not create any interesting equivalences between prefixes, it does create interesting equivalences between infixes. The grammar in (2) takes advantage of this by making a two‐way distinction between (i) strings of the form , which can be combined with any surroundings of the form , and (ii) all other strings. It uses the nonterminal symbol S as a book‐keeping symbol to identify strings belonging to the first class, just as FSGs use states.
To see these concepts in a more familiar form, consider the CFG in (4). A few example derivations are illustrated with tree diagrams in (5). The stringset that this grammar generates also cannot be generated by any FSG: among strings of the form , only those where there is exactly one occurrence of chase for each occurrence of dogs after the first one will be generated (i.e. those of the form ), so there are unboundedly many nonequivalent prefixes of the form . The early rejection of FSGs as models of natural language was based on the claim that a grammar of English would need to generate patterns of essentially this sort (Chomsky, 1956, §2.3).
1 (4) start symbol: SSNP VPNdogsNPNAchasedNPA NAbigNPN RCVchasedRCNP VVchaseVPV NPVPsleep
1 (5)
Subsets of the set of nonterminals characterize classes of strings, grouped according to the ways they can appear as infixes. I will call these sets inside sets: for example, the inside set of the string chase big dogs is , and the inside set of chased dogs is ; see Table 5.5. This partitioning of strings is the CFG's analog of an FSG's partitioning of strings by forward sets, and corresponds to the familiar notion of categorized constituents. A typical motivation for writing a grammar where a single nonterminal can derive, say, both sleep and chase dogs, is to express the idea that they are intersubstitutable infixes, i.e. for any two strings and , if is generated then is also generated. This is the sense in which CFGs carry over earlier ideas from immediate constituent analysis (e.g. Harris, 1946, 1954; Wells, 1947). For more recent discussions of this important notion, see Clark (2013), Clark and Yoshinaka (2016) and Stabler (2019, Section 2).
Table 5.5 Equivalence classes induced by the grammar in (4)
Inside set | Example strings from the corresponding equivalence class |
---|---|
dogs sleep | |
dogs dogs chase sleep | |
chased dogs chase dogs | |
dogs chased dogs | |
dogs | |
big dogs | |
dogs dogs chase | |
dogs dogs dogs chase chase | |
sleep | |
chase dogs | |
chased dogs dogs chase | |
chased big dogs | |
dogs chase | |
dogs chased | |
big dogs chased | |
dogs | |
big | |
chase | |
chased | |
chased dogs | |
dogs dogs | |
sleep chased | |
big dogs chased |
Like forward sets, the inside set of a string (relative to a particular CFG) can be computed recursively from the inside sets of the string's parts. Recall that the forward set of a string , for example, can be computed by considering the states in the forward set of , and asking which of these states have an outgoing transition emitting . The situation for inside sets is similar. Considering only rules of the form “ ” for ease of exposition: to compute the inside set of , we must consider the inside set of and ask whether any of its member nonterminals can be combined with any nonterminal in the inside set of ; but also consider combinations drawing one nonterminal from the inside set of and one from that of , plus combinations drawing one from the inside set of and one from that of . (The inside set of a length‐one string , we can assume, is just the set of nonterminals for which the grammar contains the rule .)15 The upshot is that to work out the inside set of , it suffices to know the inside sets of all its substrings. This is more complex than the situation for forward sets because there are multiple “splits” to consider (as opposed to only splitting into and ); but it is analogous in the important sense that inside sets tell us everything there is to know about how the grammar treats the relevant subexpression.
This freedom to split strings into pieces in arbitrary ways underlies a CFG's ability to create classes of intersubstitutable infixes. For example, consider the first tree in (5): the fact that we can split the last word off dogs dogs chase sleep but then split the first word off dogs dogs chase leads to the end result that the overall string is split into an infix and its surroundings . The nonterminal symbol RC serves as a hinge or pivot that links together these two pieces – dogs chase, which can appear “inside” RC, and , which can appear “outside” it – just as the state B in Figure 5.2, for example, links together someone ran and and ran really quickly.
In an important sense, what distinguishes CFGs from FSGs is not directly an issue of hierarchical tree structure or constituency – the tree in Figure 5.3 is hierarchical, expressing part‐whole relationships among constituents, in at least one sense – but rather, the distinction between being able to make infix‐circumfix splits and being restricted to prefix‐suffix splits. Specifically, the stringsets that can be generated by a CFG but not by any FSG are those for which every CFG is “self‐embedding” (Chomsky, 1959, p. 167), i.e. those where some string can be substituted for a proper infix of itself, such as the way aabb can be substituted for ab in the stringset. The relevant equivalence classes are still classes of strings, and are based on the way substrings fit into larger strings.
What sorts of stringsets cannot be generated by the infix‐based mechanisms of a CFG? The comparison between the stringsets and defined in (6) is instructive (Chomsky and Miller, 1963, pp. 285–287). Some notation: is the reverse of a string , and is the set of all non‐empty strings using the symbols a and b. So is the set of even‐length palindromes, and is the set of strings whose second half simply repeats the first half.
1 (6)
There is a CFG that generates , shown in (7).
1 (7)S aSaS bSbS aaS bb
It may seem at first surprising, then, that no CFG can generate , since it is simply without the reversal. But the notation obscures the way that the CFG in (7) actually works. Importantly, this CFG generates abaaba not by combining the prefix aba with the suffix aba, but rather, by combining the infix baab with the surroundings a__a, using the first rule shown in (7). An infix‐based analysis of , however, is no help.