Читать книгу Engineering Acoustics - Malcolm J. Crocker - Страница 174
4.5 Speech Production
ОглавлениеSince speech and hearing must be compatible, it is not surprising to find that the speech frequency range corresponds to the most sensitive region of the ear's response (Section 4.3.2) and generally extends from 100 to 10 000 Hz. The general mechanism for speech generation involves the contraction of the chest muscles to force air out of the lungs and up through the vocal tract. This flow of air is modulated by various components of the vocal mechanism (Figure 4.25) to produce sounds which make up part of our speech pattern. The modulation effect first takes place at the larynx, across which are stretched the vocal cords. These are composed of two bands of membranes separated by a slit which can open and close to modulate the flow of air [56]. The modulation frequency depends upon the tension in the muscles attached to the vocal cords, and on the size of the slit in the membranes (about 24 cm for males and 15 cm for females). The sound emitted by the vocal cords has a buzz‐type sound corresponding to a sawtooth waveform containing a large number of harmonically related components.
Figure 4.25 Sectional view of the head showing the important elements of the voice mechanism.
This sound/air flow is then further modified by its flow through the numerous cavities of the throat, nose, and mouth, many of which can be voluntarily changed at will by, for example, changing the position of the tongue or shape of the lips, to produce a large variety of voiced sounds. It is possible to produce some sounds without the use of the vocal chords and these are known as unvoiced or breathe sounds. These are usually caused by turbulent air flow through the upper parts of the vocal tract and especially by the lips, teeth, and tongue. It is in this way that the unvoiced fricative consonants f and s are formed. In some cases, part of the vocal tract can be blocked by constriction and then suddenly released to give the unvoiced consonants p and g [57].
Generally, vowels have a fairly definite frequency spectrum whereas many unvoiced consonants such as s and f tend to exhibit very broadband characteristics. Furthermore, when several vowels and/or consonants are joined together their individual spectra appear to change somewhat. The time duration of individual speech sounds also tends to vary widely over a range of 20–300 ms.
In the general context of speech, vowels, and consonants become woven together to produce not only linguistically organized words, but sounds which have a distinctive personal characteristic as well. The vowels usually have greater energy than consonants and give the voice its character. This is probably due to the fact that vowels have definite frequency spectra with superimposed periodic short‐duration peaks. However, it is the consonants which give speech its intelligibility. It is therefore essential in the design of rooms for speech to preserve both the vowel and consonant sounds for all listeners. Consonants are generally transient, short‐duration sounds of relatively low energy. Therefore, for speech, it is necessary to have a room with a short reverberation time to avoid blurring of consecutive consonants; we would expect therefore speech intelligibility to decrease with increasing reverberation time. At the same time we find that in order to produce a speech signal level well above the reverberant sound level (i.e. high signal‐to‐noise ratio), we require increased sound absorption in the room. This necessitates a lower reverberation time. Although this may lead us to think that an anechoic room would be most suitable for speech intelligibility, some sound reflections are required both to boost the level of the direct sound and to give the listener a feeling of volume. Therefore, an optimum reverberation time is established. This is usually under one second for rooms with volumes under 8500 m3. If the speech power emitted by a male speaker is averaged over a relatively long period (i.e. five seconds), the overall sound power level is found to be 75 dB. This corresponds to an average sound pressure level of 65 dB at 1 m from the lips of the speaker and directly in front of him or her. Converting the sound power level to sound power shows that the long time averaged sound power for men is 30 μW. The average female voice is found to emit approximately 18 μW. However, if we average over a very short time (i.e. l/8 second) we find that the power emitted in some vowel sounds can be 50 μW, while in other soft spoken consonants it is only 0.03 μW. Generally, the human voice has a dynamic range of approximately 30 dB throughout its frequency range [58]. At maximum vocal effort (loud shouting) the sound power from the male voice may reach 3000 μW.
Table 4.3 gives the long‐term rms sound pressure levels at l m from the average male mouth for normal vocal effort as given by the American National Standards Institute [59] for both one‐third‐octave and one‐octave bands. Although approximately 80% of the energy in speech lies below 600 Hz (including most vowels), it is in the higher frequencies that most consonants have most of their energy. These low‐energy transient consonants contribute to the intelligibility perceived. For example, it has been found [60] that if speech is passed through a high‐pass filter having a cutoff frequency of 1000 Hz then 90% of the spoken words can be understood. However, if the same speech is passed through a low‐pass filter, then a cutoff frequency of 3000 Hz is required to produce the same percentage word intelligibility. Speech sounds below 200 Hz and above 6000 Hz do not significantly contribute to intelligibility but they do add to the natural qualities of the voice [57]. Calculation of the intelligibility of speech is discussed in Chapter 6.
Table 4.3 Male voice speech sound pressure levels +12 dB at 1 m from lips for both one‐third‐ and one‐octave bands. These levels represent the speech peaks that contribute to intelligibility. The voice peak sound power levels, LW,pk, can be evaluated by adding 10.8 to the above values as shown for octave bands.
Center frequency, Hz | Lp,pk (one-third-octave) | Lp,pk (octave) | L W,pk |
---|---|---|---|
200 | 67.0 | ||
250 | 68.0 | 72.5 | 83.3 |
315 | 69.0 | ||
400 | 70.0 | ||
500 | 68.5 | 74.0 | 84.8 |
630 | 66.5 | ||
800 | 65.0 | ||
1000 | 64.0 | 68.0 | 78.8 |
1250 | 62.0 | ||
1600 | 60.5 | ||
2000 | 59.5 | 62.0 | 72.8 |
2500 | 58.0 | ||
3150 | 56.0 | ||
4000 | 53.0 | 57.0 | 67.8 |
5000 | 51.0 |
Since speech is emitted from the mouth, it is not surprising to find that the acoustic radiation from this small aperture set in a larger object (the head) is subject to fairly strong directivity effects. These directivity effects become more marked at high frequencies. Figures 4.26 and 4.27 show the relative A‐weighted sound pressure levels for the human voice in the horizontal and vertical planes, respectively. These experiments were conducted by Chu and Warnock in 40 adults, 20 male and 20 female [61]. Directivity effects can become important for audience members seated at the end of the front rows of an auditorium, since they will receive considerably less of the direct sound at high frequencies. This can considerably reduce the intelligibility of speech.
Figure 4.26 Directivity patterns for the human voice in a horizontal plane.
(Source: From Ref. [61] with permission.)
Figure 4.27 Directivity patterns for the human voice in a vertical plane.
(Source: From Ref. [61] with permission.)