The main advantage of ANNs lay in the classification of static patterns (including noisy acoustic data), which was particularly useful for recognizing isolated speech units. However, pure ANN‐based systems were not effective for continuous speech recognition, so ANNs were often integrated with HMMs in a hybrid approach (Torkkola, 1994).
The use of HMMs and ANNs in the 1980s led to considerable efforts toward constructing systems for large‐vocabulary continuous speech recognition. During this time ASR was introduced in public telephone networks, and portable speech recognizers were offered to the public. Commercialization continued in the 1990s, when ASR was integrated into products, from PC‐based dictation systems to air traffic control training systems.
Figure 2 A simple artificial neural network
During the 1990s, ASR research focused on extending speech recognition to large vocabularies for dictation, spontaneous speech recognition, and speech processing in noisy environments. This period was also marked by systematic evaluations of ASR technologies based on word or sentence error rates and constructing applications that would be able to mimic human‐to‐human speech communication by having a dialogue with a human speaker (e.g., Pegasus and How May I Help You?). Additionally, work on visual speech recognition (i.e., recognition of speech using visual information such as lip position and movements) began and continued after 2000 (Liew & Wang, 2009).
The 2000s witnessed further progress in ASR, including the development of new algorithms and modeling techniques, advances in noisy speech recognition, and the integration of speech recognition into mobile technologies. Another recent trend is the development of emotion recognition systems that identify emotions and other paralinguistic content from speech using facial expressions, voice tone, and gestures (Schuller, Batliner, Steidl, & Seppi, 2009; Anagnostopoulos, Iliou, & Giannoukos, 2015). However, one area that has truly revolutionized ASR in recent years is deep learning (Deng & Yu, 2014; Yu & Deng, 2015; Mitra et al., 2017; Zhang et al., 2017).
Deep learning refers to a set of machine learning techniques and models that are based on nonlinear information processing and learning of feature representations. One such model is deep neural network (DNN), which started gaining widespread adoption in ASR systems around 2010 (Deng & Yu, 2014). Unlike HMMs and traditional ANNs that rely on shallow architecture (i.e., one hidden layer) and can only handle context‐dependent, constrained input due to their susceptibility to background noise and differences between training and testing conditions (Mitra et al., 2017), DNNs use multiple layers of representation for acoustic modeling that improve speech recognition performance (Deng & Yu, 2014). Recent studies have shown that DNN‐based ASR systems can significantly increase recognition accuracy (Mohamed, Dahl, & Hinton, 2012; Deng et al., 2013; Yu & Deng, 2015) and reduce the relative error rate by 20–30% or more (Pan, Liu, Wang, Hu, & Jiang, 2012). Deep learning architecture is now utilized in all major ASR systems.
Challenges and Applications of ASR
Developing an effective ASR system poses a number of challenges. They include speech variability (e.g., intra and interspeaker variability such as different voices, accents, styles, contexts, and speech rates), recognition units (e.g., words and phrases, syllables, phonemes, diphones, and triphones), language complexity (e.g., vocabulary size and difficulty), ambiguity (e.g., homophones, word boundaries, syntactic and semantic ambiguity), and environmental conditions (e.g., background noise or several people speaking simultaneously).
Despite these challenges, in recent years numerous companies such as Nuance Communications, Google, Microsoft, Apple, and Amazon have developed and released ASR systems and software packages that have applications in computer system interfaces (e.g., voice control of computers, data entry, dictation), education (e.g., toys, games, language translators, language‐learning software), healthcare (e.g., systems for creating various medical reports, aids for blind and visually impaired patients), telecommunications (e.g., phone‐based interactive voice response systems for banking services, information services), military (e.g., voice control of fighter aircraft), and—more increasingly—consumer products and services (e.g., car navigation systems, household appliances, and mobile devices). Some of the most well‐known ASR software packages include Dragon NaturallySpeaking, Braina, and LumenVox Speech Recognizer, as well as interactive ASR‐supported systems such as Siri, Cortana, and Alexa.
ASR in Applied Linguistics
ASR has tremendous potential in applied linguistics. In one application area, that of language teaching, Eskenazi (1999) compares the strengths of ASR to effective immersion language learning in developing spoken‐language skills. ASR‐based systems can provide a way for learners of a foreign language to hear large amounts of the foreign language spoken by many different speakers, produce speech in large amounts, and get relevant feedback. In addition, Eskenazi (1999) suggests that using ASR computer‐assisted language learning (CALL) materials allows learners to feel at greater ease and get more consistent assessment of their skills. ASR can also be used for virtual dialogues with native speakers (Harless, Zier, & Duncan, 1999) and for pronunciation training (Dalby & Kewley‐Port, 1999). Most importantly, learners enjoy ASR applications. Study after study indicates that appropriately designed software that includes ASR is a benefit to language learners in terms of practice, motivation, and the feeling that they are actually communicating in the language rather than simply repeating predigested words and sentences.
The holy grail of a computer recognition system that matches human speech recognition remains out of reach at present. A number of limitations appear consistently in attempts to apply ASR systems to foreign language‐learning contexts. The major limitation occurs because most ASR systems are designed to work with a limited range of native speech patterns. Consequently, most ASR systems do not do well in recognizing non‐native speech, both because of unexpected phone mapping and because of prosody differences. In one now dated study, Derwing, Munro, and Carbonaro (2000) tested Dragon Naturally Speaking's ability to identify errors in speech of very advanced L2 speakers of English. Human listeners were able to successfully transcribe between 95% and 99.7% of the words, and the recognition rates by the program were a respectable 90% for native English speakers. In contrast, the system accurately transcribed only around 70% for the non‐native speakers who were mostly intelligible to human listeners. Despite problems with L2 speech recognition, recent studies have demonstrated that even imperfect commercial recognizers can be helpful in providing feedback on pronunciation (McCrocklin, 2016; Liakin, Cardoso, & Liakina, 2017).
In addition, ASR systems have been built for word recognition rather than assessment and feedback, and thus many commercial recognition systems offer only implicit feedback on pronunciation but not specific mispronunciation detection. However, most language learners require assessment of the specifics of their pronunciation and specific feedback to make progress. Fortunately, these are topics that are consistently being explored in speech sciences (e.g., Duan, Kawahara, Dantsuji, & Zhang, 2017).
Automatic Rating of Pronunciation
Historically, many studies have examined whether ASR systems can identify pronunciation errors in non‐native speech and give feedback that can help learners and teachers know what areas of foreign‐language pronunciation are most