Читать книгу The Concise Encyclopedia of Applied Linguistics - Carol A. Chapelle - Страница 216
Historical Overview
ОглавлениеPioneering work on ASR dates to the early 1950s. The first ASR system, developed at Bell Telephone Laboratories by Davis, Biddulph, and Balashek (1952), could recognize isolated digits from 0 to 9 for a single speaker. In 1956, Olson and Belar created a phonetic typewriter that could recognize 10 discrete syllables. It was also speaker dependent and required extensive training.
These early ASR systems used template‐based recognition based on pattern matching that compared the speaker's input with prestored acoustic templates or patterns. Pattern matching operates well at the word level for recognition of phonetically distinct items in small vocabularies but is less effective for larger vocabulary recognition. Another limitation of pattern matching is its inability to match and align input speech signals with prestored acoustic models of different lengths. Therefore, the performance of these ASR systems was lackluster because they used acoustic approaches that only recognized basic units of speech clearly enunciated by a single speaker (Rabiner & Juang, 1993).
An early attempt to construct speaker‐independent recognizers by Forgie and Forgie (1959) was also the first to use a computer. Later, researchers experimented with time‐normalization techniques (such as dynamic time warping, or DTW) to minimize differences in speech rates of different talkers and to reliably detect speech starts and ends (e.g., Martin, Nelson, & Zadell, 1964; Vintsyuk, 1968). Reddy (1966) attempted to develop a system capable of recognizing continuous speech by dynamically tracking phonemes.
Figure 1 A simple four‐state Markov model with transition probabilities
The 1970s were marked by several milestones: focus on the recognition of continuous speech, development of large vocabulary speech recognizers, and experiments to create truly speaker‐independent systems. During this period, the first commercial ASR system, called VIP‐100, appeared and won a US National Award. This success triggered the Advanced Research Projects Agency (ARPA) of the US Department of Defense to fund the Speech Understanding Research (SUR) project from 1971 to 1976 (Markowitz, 1996). The goal of SUR was to create a system capable of understanding the connected speech of several speakers from a 1,000‐word vocabulary in a low‐noise environment with an error rate of less than 10%. Of six systems, the most viable were Hearsay II, HWIM (hear what I mean), and Harpy, the only system that completely achieved SUR's goal (Rodman, 1999). The systems had a profound impact on ASR research and development by demonstrating the benefits of data‐driven statistical models over template‐based approaches and helping move ASR research toward statistical modeling methods such as hidden Markov modeling (HMM). Unlike pattern matching, HMM is based on complex statistical and probabilistic analyses (Peinado & Segura, 2006). In simple terms, HMMs represent language units (e.g., phonemes or words) as a sequence of states with transition probabilities between each state (see Figure 1).
The main strength of an HMM is that it can describe the probability of states and represent their order and variability through matching techniques such as the Baum‐Welch or Viterbi algorithms. In other words, HMMs can adequately analyze both the temporal and spectral variations of speech signals, and can recognize and efficiently decode continuous speech input. However, HMMs require extensive training and huge computational power for model‐parameter storage and likelihood evaluation (Burileanu, 2008).
Although HMMs became the primary focus of ASR research in the 1980s, this period was also characterized by the reintroduction of artificial neural network (ANN) models, abandoned since the 1960s due to numerous practical problems. Neural networks are loosely modeled on the human neural system. A network consists of interconnected processing elements (units) combined in layers with different weights that are determined on the basis of the training data (see Figure 2). A typical ANN takes an acoustic input, processes it through the units, and produces an output (i.e., a recognized text). To correctly classify and recognize the input, a network uses the values of the weights.
The main advantage of ANNs lay in the classification of static patterns (including noisy acoustic data), which was particularly useful for recognizing isolated speech units. However, pure ANN‐based systems were not effective for continuous speech recognition, so ANNs were often integrated with HMMs in a hybrid approach (Torkkola, 1994).
The use of HMMs and ANNs in the 1980s led to considerable efforts toward constructing systems for large‐vocabulary continuous speech recognition. During this time ASR was introduced in public telephone networks, and portable speech recognizers were offered to the public. Commercialization continued in the 1990s, when ASR was integrated into products, from PC‐based dictation systems to air traffic control training systems.
Figure 2 A simple artificial neural network
During the 1990s, ASR research focused on extending speech recognition to large vocabularies for dictation, spontaneous speech recognition, and speech processing in noisy environments. This period was also marked by systematic evaluations of ASR technologies based on word or sentence error rates and constructing applications that would be able to mimic human‐to‐human speech communication by having a dialogue with a human speaker (e.g., Pegasus and How May I Help You?). Additionally, work on visual speech recognition (i.e., recognition of speech using visual information such as lip position and movements) began and continued after 2000 (Liew & Wang, 2009).
The 2000s witnessed further progress in ASR, including the development of new algorithms and modeling techniques, advances in noisy speech recognition, and the integration of speech recognition into mobile technologies. Another recent trend is the development of emotion recognition systems that identify emotions and other paralinguistic content from speech using facial expressions, voice tone, and gestures (Schuller, Batliner, Steidl, & Seppi, 2009; Anagnostopoulos, Iliou, & Giannoukos, 2015). However, one area that has truly revolutionized ASR in recent years is deep learning (Deng & Yu, 2014; Yu & Deng, 2015; Mitra et al., 2017; Zhang et al., 2017).
Deep learning refers to a set of machine learning techniques and models that are based on nonlinear information processing and learning of feature representations. One such model is deep neural network (DNN), which started gaining widespread adoption in ASR systems around 2010 (Deng & Yu, 2014). Unlike HMMs and traditional ANNs that rely on shallow architecture (i.e., one hidden layer) and can only handle context‐dependent, constrained input due to their susceptibility to background noise and differences between training and testing conditions (Mitra et al., 2017), DNNs use multiple layers of representation for acoustic modeling that improve speech recognition performance (Deng & Yu, 2014). Recent studies have shown that DNN‐based ASR systems can significantly increase recognition accuracy (Mohamed, Dahl, & Hinton, 2012; Deng et al., 2013; Yu & Deng, 2015) and reduce the relative error rate by 20–30% or more (Pan, Liu, Wang, Hu, & Jiang, 2012). Deep learning architecture is now utilized in all major ASR systems.