Читать книгу A Companion to Chomsky - Группа авторов - Страница 56
5.3 Type 3 Grammars: Finite State Grammars
ОглавлениеAs mentioned above, Type 3 grammars were understood from the outset to be a reformulation of finite‐state automata in the setting of rewriting grammars, so I will generally describe them as “finite‐state grammars” (FSGs).
Figure 5.2 shows an example of a FSG in rewrite‐rule format, and also the equivalent finite‐state automaton in a standard graphical representation. A finite‐state automaton works with a specified finite set of states (indicated with circles in the diagram) and a specified set of (terminal) symbols. One of the states is the designated start state (indicated with an unlabeled arrow), and some of the states are designated accepting states (indicated with a double circle). The workings of the automaton are specified as a set of transitions, each associating an ordered pair of states with a symbol, and represented graphically by an arrow.
A finite‐state automaton generates a string if and only if there is a connected sequence of transitions leading from the start state to an accepting state, that emit the symbols of in order. For example, the automaton in Figure 5.2 generates someone really ran, but not someone ran really or someone ran quickly.
To see the connection between Type 3 rewriting grammars and finite‐state automata, we associate states with nonterminal symbols, and in particular associate the start state of an automaton with the starting nonterminal of a grammar. The grammar's rewrite rules correspond to the automaton's transitions. Consider for example the transition from state A to state B emitting someone. If we think of each nonterminal as characterizing all strings that can be used to move the automaton from state to an accepting state, then the rewrite rule “A someone B” says that a string can move the automaton from state A to an accepting state if it's made up of the symbol someone followed by a string that moves from state B to an accepting state. For transitions into an accepting state we include an additional rule with no nonterminal on the right hand side, such as “B ran”.
The string‐rewriting derivation and the tree structure in Figure 5.3 both show that someone ran and ran really quickly is generated by both the rewriting grammar and the automaton in Figure 5.2.
Figure 5.2 A simple finite‐state grammar, represented graphically and as rewrite rules
Figure 5.3 Two equivalent representations of how the grammar in Figure 5.2 generates the string someone ran and ran really quickly
The central idea in understanding the capabilities and limitations of FSGs is what I will call a string's forward set. Adopting the automaton perspective, the forward set of a string is the set of states that the automaton could reach from its initial state by taking some sequence of transitions that produce . For example, if the automaton in Figure 5.2 has, from its initial state, taken some sequence of transitions that produces someone ran, then the only state that it might be in is C; so the forward set of someone ran (for this automaton) is . Similarly, the forward set of someone really is , the forward set of someone ran and is , and the forward set of someone and is the empty set. The automaton generates a string if and only if the forward set of the string contains an accepting state.
Importantly, knowing the forward set of some initial part of a string (i.e. a prefix of the string, in formal terms) provides a head‐start toward calculating the forward set of the entire string. For example, suppose we are told that the forward set (using the grammar in Figure 5.2) of some particular string is . Then, writing for string concatenation, it is straightforward to see that the forward set of the string is , and that the forward set of is . We can be sure of this without knowing anything about except its forward set. The forward set captures everything there is to know about how the automaton treats the prefix ; it identifies what we can think of as the category of that subexpression.
This leads us to the crucial idea of intersubstitutability mentioned in the introduction. Notice that someone ran and someone really ran both have the same forward set, namely . So for any string , the forward set of must be the same as the forward set of – because in each case, all that matters is how the string can “move us forward” from whatever states happen to be in the common forward set of and . Furthermore, the two strings and are either both generated by the grammar, or neither is, since these two strings' common forward set either does or doesn't contain an accepting state.
Table 5.2 Equivalence Classes Induced by the Grammar in Figure 5.2
Forward set | Example strings from the corresponding equivalence class |
---|---|
someone | |
someone really | |
someone really really | |
someone ran and someone | |
someone really ran and someone | |
someone ran | |
someone really ran | |
someone ran and ran | |
someone ran and someone ran | |
someone ran really | |
someone ran really really | |
someone ran and someone ran really | |
someone ran and | |
someone really ran and | |
someone ran and someone ran and | |
someone and | |
someone someone | |
and ran |
Stated generally, what we can conclude when two strings have the same forward set is that the relevant grammar treats them as intersubstitutable prefixes. So forward sets describe the way a grammar partitions prefixes into equivalence classes, collections of prefixes that are all intersubstitutable with each other. Table 5.2 shows the classes into which prefixes are partitioned by the grammar in Figure 5.2.
These intersubstitutability relationships also underlie the way “loops” in the structure of an FSG allow for the generation of infinitely many strings. A consequence of the loops in Figure 5.2 is that, for example, someone really and someone really really have the same forward set (namely ). Since these are therefore intersubstitutable, it follows that they are also both interchangeable with someone really really really – since we can substitute someone really really for the someone really part of itself. Any continuation that is compatible with one of these strings (e.g. the continuation ran) will be compatible with all others from the infinite class as well.
Figures 5.4 and 5.5 show some more abstract examples to sharpen these important points.
The grammar in Figure 5.4 requires that a occurs an odd number of times. Here, “what matters” about a prefix is just whether a occurs an odd or even number of times. The grammar reflects this by grouping all strings with an even number of as together in the equivalence class represented by the forward set ; and similarly, for an odd number, the forward set . See Table 5.3.
Figure 5.4 Two representations of a finite‐state grammar requiring an odd number of as
Figure 5.5 Two representations of a finite‐state grammar requiring that the second‐last symbol is a
Table 5.3 Equivalence classes induced by the grammar in Figure 5.4
Contains an even number of as | FORWARD SET: |
---|---|
EXAMPLE STRINGS: , b, bb, aa, baa, aba, aab, abba | |
Contains an odd number of as | FORWARD SET: |
EXAMPLE STRINGS: a, ab, ba, bab, bba, aaa, aaba, baaa |
The grammar in Figure 5.5 requires that a be the second last symbol in a string. Part of what matters here is whether a was the second last symbol; we also need to track whether the last symbol was a, so that we know where we stand if one more symbol is added. These two yes‐no questions create a four‐way split, represented by the four forward sets shown in Table 5.4. The set of all possible strings is partitioned into these four classes, and knowing which of these four classes a string belongs to provides everything an automaton needs to know in order to enforce appropriate requirements on what follows. For example, any two strings with a in the last position but not the second last position will both have forward set , and will therefore be treated as intersubstitutable.
These examples bring out an important intuition that is likely familiar to linguists: designing a grammar amounts to choosing “what to remember” about intermediate subexpressions, and what can be ignored. It is exactly this move of ignoring distinctions between subexpressions that allows a finitely specified grammar to generate unboundedly many expressions. A device that never allowed itself to “forget,” or ignore, some aspects of an expression's internals, would be one where the applicability of a grammatical rule is dependent on the entire surrounding context, and would simply amount to a finite list of expressions. The automaton in Figure 5.6 is of this uninteresting sort: each state serves to remember an entire prefix, so no two distinct prefixes are “grouped together” by virtue of sharing a forward set, and the automaton simply encodes an arbitrary finite set of strings (namely ). This set exhibits no interesting patterns because the automaton has no interesting structure.
Table 5.4 Equivalence classes induced by the grammar in Figure 5.5
Second‐last symbol is not a | Second‐last symbol is a | |
---|---|---|
Last symbol is not a | FORWARD SET: | FORWARD SET: |
EXAMPLE STRINGS: , b, bb, abb | EXAMPLE STRINGS: ab, aab, bab | |
Last symbol is a | FORWARD SET: | FORWARD SET: |
EXAMPLE STRINGS: a, ab, aba, bba | EXAMPLE STRINGS: aa, aaa, baa |
Figure 5.6 A boring finite state automaton where states stand in a one‐to‐one correspondence with prefixes
Given the central role of this partitioning of strings into classes, a natural question that arises is what sorts of partitionings are possible. How large can the number of classes induced by a finite‐state grammar get? There is no upper limit, but the number of classes must be finite.14 This follows from the fact that the number of states is finite: each equivalence class is identified by a subset of the set of states, and if the set of states is finite then it has only finitely many subsets.
The requirement that there only be finitely many equivalence classes allows us to succinctly identify the limitations of FSGs. Consider, for example, trying to construct an FSG for (for ). First note that the prefix a is not intersubstitutable with aa, since one but not the other will lead to a well‐formed string when followed by bb. Similarly, neither of these prefixes is intersubstitutable with aaa. So the three strings a, aa, and aaa need to be split up into three distinct equivalence classes. In fact, as we go further down the list to aaaa, aaaaa, aaaaaa and so on, we will never find two such strings of as that are intersubstitutable. No finite number of equivalence classes will be enough to track all the relevant distinctions. But for any finite number of states, there are only finitely many different possible forward sets; any FSG will, eventually, assign two different strings of as – say, and – the same forward set, and will therefore incorrectly treat them as intersubstitutable prefixes (e.g. incorrectly generating and in addition to and ).
There is something natural about this idea: a grammar (or a mind) can only contain finitely many rules, and each rule specifies some allowable “use” for finitely many classes of expression, picked out by states or nonterminal symbols. Even if we somehow sliced up the space of all possible expressions into infinitely many equivalence classes, any finite grammar would not contain enough rules to define uses for all of those classes.