The Concise Encyclopedia of Applied Linguistics. Carol A. Chapelle. Читать онлайн. Newlib. NEWLIB.NET

Автор: Carol A. Chapelle
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Языкознание
Год издания: 0
isbn: 9781119147374
Скачать книгу
focus of ASR research in the 1980s, this period was also characterized by the reintroduction of artificial neural network (ANN) models, abandoned since the 1960s due to numerous practical problems. Neural networks are loosely modeled on the human neural system. A network consists of interconnected processing elements (units) combined in layers with different weights that are determined on the basis of the training data (see Figure 2). A typical ANN takes an acoustic input, processes it through the units, and produces an output (i.e., a recognized text). To correctly classify and recognize the input, a network uses the values of the weights.

      The main advantage of ANNs lay in the classification of static patterns (including noisy acoustic data), which was particularly useful for recognizing isolated speech units. However, pure ANN‐based systems were not effective for continuous speech recognition, so ANNs were often integrated with HMMs in a hybrid approach (Torkkola, 1994).

      The use of HMMs and ANNs in the 1980s led to considerable efforts toward constructing systems for large‐vocabulary continuous speech recognition. During this time ASR was introduced in public telephone networks, and portable speech recognizers were offered to the public. Commercialization continued in the 1990s, when ASR was integrated into products, from PC‐based dictation systems to air traffic control training systems.

image

      The 2000s witnessed further progress in ASR, including the development of new algorithms and modeling techniques, advances in noisy speech recognition, and the integration of speech recognition into mobile technologies. Another recent trend is the development of emotion recognition systems that identify emotions and other paralinguistic content from speech using facial expressions, voice tone, and gestures (Schuller, Batliner, Steidl, & Seppi, 2009; Anagnostopoulos, Iliou, & Giannoukos, 2015). However, one area that has truly revolutionized ASR in recent years is deep learning (Deng & Yu, 2014; Yu & Deng, 2015; Mitra et al., 2017; Zhang et al., 2017).

      Deep learning refers to a set of machine learning techniques and models that are based on nonlinear information processing and learning of feature representations. One such model is deep neural network (DNN), which started gaining widespread adoption in ASR systems around 2010 (Deng & Yu, 2014). Unlike HMMs and traditional ANNs that rely on shallow architecture (i.e., one hidden layer) and can only handle context‐dependent, constrained input due to their susceptibility to background noise and differences between training and testing conditions (Mitra et al., 2017), DNNs use multiple layers of representation for acoustic modeling that improve speech recognition performance (Deng & Yu, 2014). Recent studies have shown that DNN‐based ASR systems can significantly increase recognition accuracy (Mohamed, Dahl, & Hinton, 2012; Deng et al., 2013; Yu & Deng, 2015) and reduce the relative error rate by 20–30% or more (Pan, Liu, Wang, Hu, & Jiang, 2012). Deep learning architecture is now utilized in all major ASR systems.

      Developing an effective ASR system poses a number of challenges. They include speech variability (e.g., intra and interspeaker variability such as different voices, accents, styles, contexts, and speech rates), recognition units (e.g., words and phrases, syllables, phonemes, diphones, and triphones), language complexity (e.g., vocabulary size and difficulty), ambiguity (e.g., homophones, word boundaries, syntactic and semantic ambiguity), and environmental conditions (e.g., background noise or several people speaking simultaneously).

      ASR has tremendous potential in applied linguistics. In one application area, that of language teaching, Eskenazi (1999) compares the strengths of ASR to effective immersion language learning in developing spoken‐language skills. ASR‐based systems can provide a way for learners of a foreign language to hear large amounts of the foreign language spoken by many different speakers, produce speech in large amounts, and get relevant feedback. In addition, Eskenazi (1999) suggests that using ASR computer‐assisted language learning (CALL) materials allows learners to feel at greater ease and get more consistent assessment of their skills. ASR can also be used for virtual dialogues with native speakers (Harless, Zier, & Duncan, 1999) and for pronunciation training (Dalby & Kewley‐Port, 1999). Most importantly, learners enjoy ASR applications. Study after study indicates that appropriately designed software that includes ASR is a benefit to language learners in terms of practice, motivation, and the feeling that they are actually communicating in the language rather than simply repeating predigested words and sentences.

      The holy grail of a computer recognition system that matches human speech recognition remains out of reach at present. A number of limitations appear consistently in attempts to apply ASR systems to foreign language‐learning contexts. The major limitation occurs because most ASR systems are designed to work with a limited range of native speech patterns. Consequently, most ASR systems do not do well in recognizing non‐native speech, both because of unexpected phone mapping and because of prosody differences. In one now dated study, Derwing, Munro, and Carbonaro (2000) tested Dragon Naturally Speaking's ability to identify errors in speech of very advanced L2 speakers of English. Human listeners were able to successfully transcribe between 95% and 99.7% of the words, and the recognition rates by the program were a respectable 90% for native English speakers. In contrast, the system accurately transcribed only around 70% for the non‐native speakers who were mostly intelligible to human listeners. Despite problems with L2 speech recognition, recent studies have demonstrated that even imperfect commercial recognizers can be helpful in providing feedback on pronunciation (McCrocklin, 2016; Liakin, Cardoso, & Liakina, 2017).

      In addition, ASR systems have been built for word recognition rather than assessment and feedback, and thus many commercial recognition systems offer only implicit feedback on pronunciation but not specific mispronunciation detection. However, most language learners require assessment of the specifics of their pronunciation and specific feedback to make progress. Fortunately, these are topics that are consistently being explored in speech sciences (e.g., Duan, Kawahara, Dantsuji, & Zhang, 2017).