Читать книгу The Handbook of Speech Perception - Группа авторов - Страница 54

Vector representations and encoding models

One difficulty in studying meaning is that “meaning” can be challenging to define. If you ask what the word ‘strawberry’ means, we might point at a strawberry. If we know the activity in your visual system that is triggered by looking at a strawberry, then we can point to similar activity patterns in your visual system when you think of the word ‘strawberry’ as another kind of meaning. You might imagine that it is harder to point to just any part of the brain and ask of its current state, “Is this a representation of ‘strawberry’?” But it is not impossible. In this subsection, we will, in as informal a way as possible, introduce the ideas of vector representations of words, and encoding models for identifying the neural representations of vectors.

Generally speaking, an encoding model aims to predict how the brain will respond to a stimulus. Encoding models contrast with decoding models, which aim to do the opposite: guess which stimulus caused the brain response. The spectrogram reconstruction method (mentioned in a previous section) is an example of a decoding model (Mesgarani et al., 2008). An encoding model of sound would therefore try to predict the neural response to an audio recording. In a landmark study of semantic encoding, Mitchell et al. (2008) were able to predict fMRI responses to the meanings of concrete nouns, like ‘celery’ and ‘airplane.’ Unlike studies of embodied meaning, Mitchell et al. (2008) were able to predict neural responses that were not limited to the sensorimotor systems. For instance, they predicted accurate word‐specific neural responses across bilateral occipital and parietal lobes, the fusiform and middle frontal gyri, and sensory cortex; the left inferior frontal gyrus; the medial frontal gyrus and the anterior cingulate (see Figure 3.6 for reference; Mitchell et al., 2008). These encoding results expand the number of regions to which the meaning of a word might be distributed, to nonsensory systems like the anterior cingulate. An even greater expansion of these semantic regions can be found in more recent work (Huth et al., 2016).

So how does an encoding model work? The model uses linear regression to map from a vector representation of a word to the intensity of a single voxel measured during an fMRI scan (and representing the activity in a bit of brain). This approach can be generalized to fit multiple voxels (representing the whole brain) and trained on a subset of word embeddings and brain scans, before being tested on unseen data in order to evaluate the model’s ability to generalize beyond the words it was trained on. But what do vector representation and word embedding mean? This field is rather technical and jargon rich, but the key ideas are relatively easy to grasp. Vector representations, or word embeddings, represent each word by a vector, effectively a list of numbers. Similarly, brain states can be quantified by vectors or lists of numbers that represent the amount of activity seen in each voxel. Once we have these vectors, using linear regression methods to try to identify relationships that map one onto the other is mathematically quite straightforward. So the maths is not difficult and the brain activity vectors are measurable by experiment, but how do we obtain suitable vector representations for each word that we are interested in? Let us assume a vocabulary of exactly four words:

1 airplane
2 boat
3 celery
4 strawberry

One way to encode each of these as a list of numbers is to simply assign one number to each word: ‘airplane’ = [1], ‘boat’ = [2], ‘celery’ = [3], and ‘strawberry’ = [4]. We have enclosed the numbers in square brackets to mean that these are lists. Note that it is possible to have only one item in a list. A good thing about this encoding of the words, as lists of numbers, is that the resulting lists are short and easy to decode: we only have to look them up in our memory or in a table. But this encoding does not do a very good job of capturing the differences in meanings between the words. For example, ‘airplane’ and ‘boat’ are both manufactured vehicles that you could ride inside, whereas ‘celery’ and ‘strawberry’ are both edible parts of plants. A more involved semantic coding might make use of all of these descriptive features to produce the following representations.

In Table 3.1, a 1 has been placed under the semantic description if the word along the row satisfies it. For example, an airplane is manufactured, so the first number in its list is 1, but ‘celery,’ even if grown by humans, is not manufactured, so the first number in its list is 0. The full list for the word ‘boat’ is [1, 1, 1, 0, 0], which is five numbers long. Is this a good encoding? It is certainly longer than the previous encoding (boat = [2]), and unlike the previous code it no longer distinguishes ‘airplane’ from ‘boat’ (both have the identical five‐number codes). Finally, the codes are redundant in the sense that, as far as a linear‐regression model is concerned, representing the word ‘boat’ as [1, 1, 1, 0, 0] is no more expressive than representing it as [1, 0]. Still, we might like the more verbose listing, since we can interpret the meaning of each number, and we can solve the problem of ‘airplane’ not differing from ‘boat’ by adding another number to the list. That is, if we represented the words with six‐number lists, then ‘airplane’ and ‘boat’ could be distinguished: airplane = [1, 1, 1, 0, 0, 0] and boat = [1, 1, 1, 0, 0, 1]. Now the last number of airplane is a 0 and the last number of boat is a 1.

Table 3.1 Semantic‐field encodings for four words.

Word	Manufactured	Vehicle	Ride inside	Edible	Plant part
airplane	1	1	1	0	0
boat	1	1	1	0	0
celery	0	0	0	1	1
strawberry	0	0	0	1	1

So far, our example may seem tedious and somewhat arbitrary: we had to come up with attributes such as “manufactured” or “edible,” then consider their merit as semantic feature dimensions without any obvious objective criteria. However, there are many ways to automatically search for word embeddings without needing to dream up a large set of semantic fields. An incrementally more complex way is to rely on the context words that each one of our target words occurs within a corpus of sentences. Consider a corpus that contains exactly four sentences.

1 The boy rode on the airplane.
2 The boy also rode on the boat.
3 The celery tasted good.
4 The strawberry tasted better.

Our target‐words are, again, ‘airplane,’ ‘boat,’ ‘celery,’ and ‘strawberry.’ The context‐words are ‘also,’ ‘better,’ ‘boy,’ ‘good,’ ‘on,’ ‘rode,’ ‘tasted,’ and ‘the’ (ignoring capitalization). If we create a table of target words in rows and context words in columns, we can count how many times each context word occurs in a sentence with each target word. This will produce a new set of word embeddings (Table 3.2).

Unlike the previous semantic‐field embeddings, which were constructed using our “expert opinions,” these context‐word embeddings were learned from data (a corpus of four sentences). Learning a set of word embeddings from data can be very powerful. Indeed we can automate the procedure; and even a modest computer can process very large corpora of text to produce embeddings for hundreds of thousands of words in seconds. Another strength of creating word embeddings like these is that the procedure is not limited to concrete nouns, since context words can be found for any target word – whether an abstract noun, verb, or even a function word. You may be wondering how context words are able to represent meaning, but notice that words with similar meanings are bound to co‐occur with similar context words. For example, an ‘airplane’ and a ‘boat’ are both vehicles that you ride in, so they will both occur quite frequently in sentences with the word ‘rode’; however, one will rarely find sentences that contain both ‘celery’ and ‘rode.’ Compared to ‘airplane’ and ‘boat,’ ‘celery’ is more likely to occur in sentences containing the word ‘tasted.’ As the English phonetician Firth (1957, p. 11) wrote: “You shall know a word by the company it keeps.”

Table 3.2 Context‐word encodings of four words.

Word	also	better	boy	good	on	rode	tasted	the
airplane	0	0	1	0	1	1	0	2
boat	1	0	1	0	1	1	0	2
celery	0	0	0	1	0	0	1	1
strawberry	0	1	0	0	0	0	1	1

With a reasonable vector representation for words like these, one can begin to see how it may be possible to predict the brain activation for word meanings (Mitchell et al., 2008). Start with a fairly large set of words and their vector representations, and record the brain activity they evoke. Put aside some of the words (including perhaps the word ‘strawberry’) and use the remainder as a training set in order to find the best linear equation that maps from word vectors to patterns of brain activation. Finally, use that equation to predict what the brain activation should have been for the words you held back, and test how similar that predicted brain activation is to the one that is actually observed, and whether the activation patterns for ‘strawberry’ is indeed more similar to that of ‘celery’ than it is to that of ‘boat.’ One similarity measure commonly used for this sort of problem is the cosine similarity, which can be defined for two vectors ⃗p and ⃗q, according to the following formula:

Now if we plug the context‐word embeddings for each pair of words from our four‐word set into this equation, we end up with the similarity scores shown in Table 3.3. Note that numbers closer to 1 mean more similar and numbers closer to 0 mean more dissimilar. A perfect score of 1 actually means identical, which we see when we compare any word embedding with itself. Note that we have only populated the diagonal and upper triangle of this table, because the lower part is a reflection of the upper part, and therefore redundant.

As expected, the words ‘airplane’ and ‘boat’ received a very high similarity score (0.94), whereas ‘airplane’ and ‘celery,’ for example, received lower similarity scores (0.41). The score for ‘celery’ and ‘strawberry,’ however, were also more similar (0.67). Summary statistics such as these for the similarity between two very long lists of numbers are quick and easy to compute, even for very long lists of numbers. Exploring them also helps to build an intuition about how encoding models, such as those of Mitchell et al. (2008), represent the meanings of words and thus what the brain maps they discover represent. Specifically, Firth’s (1957) idea that the company a word keeps can be used to build up a semantic representation of a word has had a profound impact on the study of semantics recently, especially in the computational fields of natural language processing and machine learning (including deep learning). Mitchell et al.’s (2008) landmark study bridged natural language processing with neuroscience in a way that finds common ground for both fields at the time of writing. Not only do we expect words that belong to similar semantic domains to co‐occur with similar context words, but if the brain is capable of statistical learning, as many believe, then this is exactly the kind of pattern we should expect to find encoded in neural representations.

Table 3.3 Cosine similarities between four words.

	airplane	boat	celery	strawberry
airplane	1	0.94	0.44	0.44
boat	–	1	0.41	0.41
celery	–	–	1	0.67
strawberry	–	–	–	1

To summarize, we have only begun to scratch the surface of how linguistic meaning is represented in the brain. But figuring out what the brain is doing when it is interpreting speech is so important, and mysterious, that we have tried to illustrate a few recent innovations in enough detail that the reader may begin to imagine how to go further. Embodied meaning, vector representations, and encoding models are not the only ways to study semantics in the brain. They do, however, benefit from engaging with other areas of neuroscience, touching for example on the homunculus map in the somatosensory cortex (Penfield & Boldrey, 1937). It is less clear, at the moment, how to extend these results from lexical to compositional semantics. A more complete neural understanding of pragmatics will also be needed. Much work remains to be done. Because spoken language combines both sound and meaning, a full account of speech comprehension should explain how meaning is coded by the brain. We hope that readers will feel inspired to contribute the next exciting chapters in this endeavor.

Подняться наверх