4 Perceptual Control of Speech
K. G. MUNHALL1, ANJA‐XIAOXING CUI2, ELLEN O’DONOGHUE3, STEVEN LAMONTAGNE1, AND DAVID LUTES1
1 Queen’s University, Canada
2 University of British Columbia, Canada
3 University of Iowa, United States
There is broad agreement that the American socialite Florence Foster Jenkins was a terrible singer. Her voice was frequently off‐key and her vocal range did not match the pieces she performed. The mystery is how she could not have known this. However, many – including her depiction in the eponymous film directed by Stephen Frears – think it likely that she was unaware of how poorly she sang. The American mezzosoprano Marilyn Horne offered this explanation. “I would say that she maybe didn’t know. First of all, we can’t hear ourselves as others hear us. We have to go by a series of sensations. We have to feel where it is” (Huizenga, 2016). This story about Jenkins contains many of the key questions about the topic of this chapter, the perceptual control of speech. Like singing, speech is governed by a control system that requires sensory information about the effects of its actions, and the major source of this sensory feedback is the auditory system. However, the speech we hear is not what others hear and yet we are able to control our speech motor system in order to produce what others need or expect to hear. For both speech and singing, much is unknown about the auditory‐motor control system that accomplishes this. What role does hearing your voice play in error detection and correction? How does this auditory feedback processing differ from how others hear you? What role does hearing your voice play in learning to speak?
Human spoken language has traditionally been studied by two separate communities (Meyer, Huettig, & Levelt, 2016): those including the majority of contributors to this volume who study the perception of speech signals produced by others and those who study the production of the speech signal itself. It is the latter that is the focus of this chapter. More specifically, the chapter focuses on the processing of the rich sensory input accompanying talking, particularly hearing your own voice. As Marilyn Horne suggests, perceiving this auditory feedback is not the same as hearing others. Airborne speech sound certainly arrives at the speaker’s ear as it does at the ears of others, but for the speaker it is mixed with sound transmitted through the body (e.g. Békésy, 1949). A second difference between hearing yourself and hearing others is neural rather than physical. The generation of action in speech and other movements is accompanied by information about the motor commands that is transmitted from the motor system to other parts of the brain that might need to know about the movement. One consequence of this distribution of copies of motor commands is that the sensory processing of the effects of a movement is different from the processing of externally generated sensory information (see Bridgeman, 2007, for a historical review).
This chapter addresses a number of issues related to the perceptual control of speech production. We first examine the importance of hearing yourself speak through the study of natural and experimental deafening in humans and birds. This work is complemented by recent work involving real‐time manipulations of auditory feedback through rapid signal processing. Next, we review what is known about the neural processing of self‐produced sound. This includes work on corollary discharge or efference copy, as well as studies showing cortical suppression during vocalizing. Finally, we address the topic of vocal learning and the general question about the relationship between speech perception and speech production. A small number of species including humans learn their vocal repertoire. It is important to understand the conditions that promote this learning and also to understand why this learning is so rare. Through all of our review, we will touch base with research on birdsong. Birdsong is the animal model of human vocal production. The literature on birdsong provides exciting new research directions as extensive projects on the genetic and neural underpinnings of vocal learning are carried out demonstrating remarkable similarity to human vocal behavior (Pfenning et al., 2014).
Perceptual feedback processing
The study of the perceptual control of spoken language is an investigation of how behavior is monitored by the actor. Feedback can be viewed as a general process wherein online performance is referenced to a target, goal, or intention. When deviations from these targets are detected, these errors are ideally corrected by the speaker. In language, such errors can take numerous forms. A speaker’s meaning might be poorly formulated and misinterpreted by a listener; a single word might be substituted for another or words might be spoken out of order; single syllables or sounds might be dropped, emphasized incorrectly, or mispronounced. Monitoring of such language behavior is often broadly conceptualized according to a perceptual loop (Levelt, 1983). The perceptual loop model was designed to account for perceived errors at various levels of language production, and consists of three phases: (1) self‐interrupting when an error is detected, (2) pausing or introducing “editing” terms (um, uh, like), and (3) fixing the error.
Two broad features differentiate such high‐level error detection from other forms of target‐based correction, as in speech production. First, language‐error correction often interrupts the flow of output, while the same is not always true of compensation in response to auditory speech feedback perturbations. Second, language‐error correction typically involves conscious awareness. This is inconsistent with speech feedback processing.
Two bodies of literature – clinical studies of hearing loss and artificial laboratory perturbation studies – shed light on these unique features of speech feedback processing.
Deafness and Perturbations of auditory feedback
Loss of hearing has a drastic impact on the acquisition of speech (Borden, 1979). From the first stage of babbling to adult articulation, speech in those who are profoundly hearing impaired has distinct acoustic and temporal speech characteristics. Canonical babbling is delayed in its onset and the number of well‐formed syllables is markedly reduced even after clinical intervention through amplification (Oller & Eilers, 1988). Beyond babbling, Osberger and McGarr (1982) have summarized the patterns of speech errors in children who have significant hearing impairments. While the frequencies of errors (and hearing levels) varied between children, there were consistent atypical segmental productions including sound omissions, anomalous timing, and distortions of phonemes. These phonetic patterns are accompanied by inconsistent interarticulator coordination (McGarr & Harris, 1980). In addition, there are consistent suprasegmental issues in the population including anomalies of vocal pitch and vocal‐quality control and inadequate intonation contours (Osberger & McGarr, 1982).
These patterns of deficit most likely arise from the effects of deafness on both the perceptual learning of speech in general and the loss of auditory feedback in vocal learning. Data characterizing speech‐production behavior at different ages of deafness onset could shed some light on the extent to which learning to perceive the sound system or learning to hear yourself produce sounds contributes to the reported deficits. However, there are minimal data on humans that provide a window onto the importance of hearing at different stages of vocal learning. Binnie, Daniloff, and Buckingham (1982) provide a case study of a five‐year‐old who abruptly lost hearing. The child showed modest changes immediately after deafness onset but, over the course of a year, the intelligibility of his speech declined due to distortions in segmental production and prosody. Notably, the child rarely deleted sounds and tended to prolong vowels perhaps to enhance kinesthetic feedback. While this case study is not strong evidence for the development of auditory feedback, it is noteworthy that the speech representations that govern fluent speech are well developed even at this young age. Speech quality does not immediately degrade.
Age of onset of deafness of other postlingually deafened individuals, those who lost their hearing following the acquisition of speech, indicates that those deafened earlier in life deviated more from the general population than those deafened later in both suprasegmental and segmental characteristics