• Keine Ergebnisse gefunden

Laura Fern´ andez Gallardo

N/A
N/A
Protected

Academic year: 2022

Aktie "Laura Fern´ andez Gallardo"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Human Speech Intelligibility Measurements over VoIP Channels

Laura Fern´ andez Gallardo

Quality and Usability Lab, Telekom Innovation Labs, TU Berlin, Deutschland, Email: laura.fernandezgallardo@tu-berlin.de

Abstract

VoIP (voice over Internet Protocol) transmissions are able to deliver wideband (50-7,000 Hz) and super- wideband (50-14,000 Hz) voice, which brings manifold advantages over conventional narrowband channels (300- 3,400 Hz). It has been found that, while perceived speech quality, speaker recognition, and automatic speech recog- nition performance are improved, also human speech in- telligibility can benefit from the extended bandwidths.

However, it still remains unclear whether the human in- telligibility can be enhanced with super-wideband and full-band quality with respect to wideband, and whether it is significantly affected by coded-decoded speech. This paper presents an auditory test conducted with 30 par- ticipants where their intelligibility was assessed over 27 channels of different bandwidths, codecs, and bit rates.

This test was based on a closed set of vowel-consonant- vowel logatomes with eight alternatives. Furthermore, it is shown that the subjective intelligibility scores can be well predicted by POLQA mean opinion scores and POLQA-based intelligibility objective measures.

Introduction

In recent years, we have witnessed substantial deploy- ment of digital communication channels. Conventional narrowband (NB) telephony took advantage of the lower frequencies of the voice spectrum, promoting the stan- dard 300-3,400 Hz bandwidth of the public switched tele- phone network (PSTN). A number of voice over internet protocol (VoIP) channels were more recently developed to deliver narrowband and wideband (WB) speech, the latter being the result of expanding the NB bandwidth to 50-7,000 Hz, offering a more natural voice. Also, to satisfy the demand of even higher quality for other sig- nals, such as music, efforts have been made towards ex- tending WB codecs to super-wideband (SWB) and full- band (FB) codecs, operating in the 50-14,000 Hz and 20-20,000 Hz frequency ranges, respectively [1, 2].

It has been shown that widening the telephony spectrum from NB to WB results in about 30% improvement in perceptual speech quality [3]. It was later found in [4]

that SWB offers 39% increased quality in comparison to WB and 79% in comparison to NB. The extended bandwidth also allows for improved speaker recogniz- ability. Results from several listening tests suggest that known speakers can be easier identified in WB compared to NB [5], and advantages of SWB channels over NB and WB have been found for automatic speaker recog- nition [6](Section 5.2.1). Automatic speech recognition also benefits from the enhanced communication band- widths [7, 8].

The mentioned advantages and the fact that many high- frequency sounds are critical for human speech intelli- gibility [9] can imply an improved intelligibility perfor- mance when listening to voices in the extended band- widths compared to NB [10]. However, only one study exists, to the best of our knowledge, that examined the difference in phoneme intelligibility performance over dif- ferent bandwidths [11]. That study detected a supe- rior performance in WB (G.722 codec) compared to NB (AMR-NB codec). Other investigations have only con- sidered the effect of NB degradations [12, 13] or of codecs operating on speech sampled at 16 kHz [14]. It still re- mains to be shown whether the superiority of WB holds for a wider variety of codecs and bitrates, and whether the switch from WB to SWB transmissions brings a fur- ther increase in intelligibility performance.

The present contribution reports a listening test to mea- sure human speech intelligibility, where the speech stim- uli were degraded through 23 channel distortions involv- ing different bandwidth filtering, coding schemes, and bitrates. The original speech signals and three down- sampled versions were also considered as experiment con- ditions. Our main objective was to assess the differences in intelligibility performance over NB, WB, and SWB, while possibly identifying particularities in the perfor- mance delivered by some codecs or bitrates. A listening test employing a closed set of nonsense vowel-consonant- vowel (VCV) logatomes was considered as suitable for this investigation. Since consonants are crucial for intel- ligible speech, we opted for employing monosyllabic stim- uli varying consonants (a middle consonant) and main- taining unchanged the enclosing vowel sounds. Other intelligibility studies also employing non-sense combina- tions of vowels and consonants examine the effects of noise [15, 16] or of bandwidth-filtering [9].

The choice of a logatome-based intelligibility test en- abled us to compare the subjective results to objective predictions given by the POLQA intelligibility model (V1490intellV2). This model is a further development of PESQ intelligibility [17] using the latest developments in the objective assessment of speech quality [18, 19].

Our intelligibility results were also compared to objec- tive transmitted speech quality predictions made by the POLQA standard V2.4.1 (objective MOS).

Audio material

Eight different VCV logatomes were chosen for the in- telligibility test, varying the middle consonant: ”ama”,

”aba”, ”afa”, ”ana”, ”apa”, ”asa”, ”awa”, and ”ascha”.

The choice of these logatomes was based on the high phoneme confusions previously found in [11].

DAGA 2017 Kiel

864

(2)

The logatomes were extracted from words in purposely created sentences recorded by 4 German speakers (2m, 2f, age range 25–36 years). The recordings were made with sampling frequency of 48 kHz in clean conditions. The test stimuli are thus excerpts of natural speech, which can presumably be less carefully articulated than words or logatomes spoken in isolation as in the OLLO logatome speech database [16, 11], but on the other hand reflect a realistic pronunciation.

The 32 excerpts (4 speakers x 8 logatomes) were trans- mitted through 23 channel conditions via software simu- lation. These include uncoded and coded-decoded speech of three telephone bandwidth (NB, WB, and SWB) at different bitrates. Besides, conditions with no distortion applied, i.e. direct speech sampled as 8, 16, 32, or 48 kHz were examined. The condition names can be seen in Figure 1, which displays the intelligibility test results.

The channel transmissions involved downsampling the speech signal with an anti-aliasing filter to 8, 16, or 32 kHz for the NB, WB, or SWB conditions, respectively.

The speech was then level-equalized 26 dB below the overload of the digital system (-26 dBov), a character- istic level of telephone channels, using the voltmeter al- gorithm of ITU-T Rec. P.56. For the channel degra- dations, a bandwidth filter was applied, complying with ITU-T Rec. G.712 for NB, ITU-T Rec. P.341 for WB, and 14KBP for SWB. Finally, codecs were applied at dif- ferent bitrates for the conditions involving a codec, and the speech was again level-equalized to -26 dBov.

Subjective intelligibility ratings Intelligibility test setup

The complete set of stimuli presented in the intelligibility test consisted of 4 speakers x 8 VCV logatomes x 27 con- ditions = 864 segments. The task for the test participants was to choose among the eight logatome alternatives af- ter listening to each stimulus. There was no possibility to listen to the stimuli more than once. Short breaks were included every 15 minutes approximately to avoid listeners’ loss of focus. A brief familiarization phase was conducted before the actual test started. In the familiar- ization, the listeners clicked on each logatome button to hear a sample as many times as they wished.

The test was performed by 30 listeners (15m, 15f), mean age 25 years (range 18–38 years) and with German as mother tongue. The complete test session had a dura- tion of about one hour. It was performed in a 54m2 acoustically treated listening room (room Pinta in the Telefunken building of TU Berlin) using a laptop and Shure SRH240 headphones (diotic listening, frequency range 20–20,000 Hz). Listeners were not allowed to con- trol the speech loudness level.

Intelligibility test results

The performance of the group of listeners was computed as the percentage of correct answers calculated over all

Figure 1: Human intelligibility accuracy across channel con- ditions.

speakers and logatomes for each condition. Figure 1 dis- plays the obtained accuracies sorted from high to low.

It can be observed that better intelligibility performance is obtained with speech of greater bandwidth and with codecs operating at a higher bitrate. Consistently for each bandwidth, the condition which involves a band- width filter and no codec offers better intelligibility ac- curacy than the rest of channels of the same bandwidth.

The non-parametric Kruskal-Wallis test was applied in order to detect statistically significant differences in per- formance (p < .05) comparing the channel conditions.

With respect to the benefits of the switch from NB to WB, the three WB conditions WB-Speex at 42.2 kbit/s, G.722 at 64 kbit/s, and P.341filter permit a significantly better performance compared to the tested NB codecs at a bit rate of 11 kbit/s or lower. The rest of WB conditions (except for WB-Speex at 3.95 kbit/s) offer significantly better performance than that of AMR-NB at 4.75 kbit/s and of Speex at 2.15 kbit/s. In view of the test results, the significant advantages of the extended bandwidth are only manifested with WB codecs operating at a sufficient bitrate with respect to NB codecs at low bitrates. The performance of the Speex codec is markedly worse than other codecs at similar bitrates in NB and in WB. Other disadvantages of the Speex codec have previously been shown in [20, 8].

Among the conditions tested, the SWB performance is statistically similar to that in WB, except for the WB- Speex at 3.95 kbit/s. This suggests that the relevant frequencies contributing to human intelligibility are al- ready included in the WB bandwidth. Comparing SWB to NB, the worst performing SWB conditions, G.722.1C at 24 and at 48 kbit/s, are only significantly better than NB-Speex at 2.15 kbit/s. The best performing con- dition, 14KBPfilter, offers significantly higher accuracy than all NB conditions except for G.711 at 64 kbit/s and G.712filter.

DAGA 2017 Kiel

865

(3)

High rates of logatome confusions have been found be- tween ”aba”-”awa” (in both directions, ”w” being bet- ter recognized that ”b”), and ”afa”-”asa” (when ”afa”

was presented). A remarkable reduction of confusions could be identified with the switch from NB to WB, especially for ”awa”-”aba” (when ”awa” was presented) and for ”afa”-”asa” (when ”afa” was presented), as also found in [11]. This decrease of confusions between ”afa”-

”asa” was expected, as the phones ”f” and ”s” present similar spectral characteristics in the lower frequencies but spectral peaks at different distinctive locations over 6 kHz [21, 22]. High confusion rates between ”aba”-

”awa” still prevail for SWB channels and ”Direct” condi- tions withf s >= 16kHz. The voiced bilabial stop ”b”

and the voiced labiodental fricative ”w” appear to have spectral similarities [21] with decreasing energy until ap- proximately 3 kHz.

Figure 2: Subjective intelligibility scores vs. POLQA- intelligibility model estimations and second-order fit.

qP OLQAIN T ELL(x) = 92.1 + 27.2x−7.6x2, R2 = 0.870, RM SE= 2.10.

Subjective and objective intelligibility

The subjective intelligibility scores obtained in the test are compared to the objective scores provided by the POLQA intelligibility model (V1490intellV2) and to the MOS provided by the POLQA standard (V2.4.1). Both models operated with an input speech file containing the eight logatomes concatenated, uttered by one speaker.

The models were applied for each channel condition and the resulting scores were then averaged across the four speakers.

Our results reveal that a second-order curve can be fit to the pairs subjective vs. objective measures with a re- markably highR2. The fit is slightly better when the sub- jective intelligibility predictions are made by the POLQA intelligibility model compared to when those are made by the POLQA MOS. This possibility of predicting subjec- tive test results might be useful for network engineers in the communication channel design process, when subjec-

Figure 3: Subjective intelligibility scores vs. POLQA MOS estimations and second-order fit. qP OLQAM OS(x) = 92.1 + 26.5x−9.3x2,R2= 0.858,RM SE= 2.20.

tive testing costs are prohibitively high.

The curves and data points are plotted in Figures 2 and 3, for POLQA intelligibility and for POLQA MOS, respec- tively. The displayed texts indicate data points detected as potential outliers, with high leverage or large abso- lute residual value. According to Figure 2, the AMR- NB codec at 4.75 kbit/s offers a greater subjective in- telligibility score as predicted by the model, with higher residual value than other data points, in contrast to the outliers corresponding to the Speex codec. Figure 3 ex- hibits that the relation between human intelligibility and quality generally holds for the channel conditions tested.

Conclusions

A closed-response intelligibility test employing eight VCV logatomes was conducted with 30 listeners. The speech stimuli were degraded by 23 channel conditions involving NB, WB, and SWB, and included undistorted and downsampled speech. The gain in intelligibility ac- curacy is more salient for the transition from NB to WB than from WB to SWB, which indicates that the fre- quency components critical for consonant intelligibility are found in the bandwidth (50–7,000 Hz). Significant differences (p < .05) have only been found comparing WB conditions at a high bitrate to NB conditions at a low bitrate.

The quadratic correspondences between subjective and objective intelligibility scores have been shown to be high, with R2 = 0.870; RM SE = 2.10 and R2 = 0.858;

RM SE = 2.20 when the objective scores were pre- dicted by POLQA intelligibility (V1490intellV2) and by POLQA MOS (V2.4.1), respectively. This result high- lights the goodness of these objective measures to esti- mate subjective intelligibility scores.

DAGA 2017 Kiel

866

(4)

Acknowledgements

The author would like to thank Mr. John Beerends for his valuable support and for kindly providing the objec- tive intelligibility scores.

References

[1] M. Tammi, L. Laaksonen, A. R¨am¨o, and H. Touko- maa, “Scalable Superwideband Extension for Wide- band Coding,” inICASSP, 2009, pp. 161–164.

[2] J.-M. Valin, T. B. Terriberry, and G. Maxwell, “A Full Bandwidth Audio Codec with Low Complexity and Very Low Delay,” inEuropean Signal Processing Conference (EUSIPCO), 2009, pp. 1254–1258.

[3] S. M¨oller, A. Raake, N. Kitawaki, A. Takahashi, and M. W¨altermann, “Impairment Factor Frame- work for Wideband Speech Codecs,” IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 1969–1976, 2006.

[4] M. W¨altermann, I. Tucker, A. Raake, and S. M¨oller, “Extension of the E-Model Towards Super-Wideband Speech Transmission,” inICASSP, 2010, pp. 4654–4657.

[5] L. Fern´andez Gallardo, S. M¨oller, and M. Wagner,

“Human Speaker Identification of Known Voices Transmitted Through Different User Interfaces and Transmission Channels,” in ICASSP, 2013, pp.

7775–7779.

[6] L. Fern´andez Gallardo, Human and Automatic Speaker Recognition over Telecommunication Chan- nels, ser. T-Labs Series in Telecommunication Ser- vices. Singapore: Springer-Verlag, 2016.

[7] A. V. Ramana, L. Parayitam, and M. S. Pala, “In- vestigation of Automatic Speech Recognition Perfor- mance and Mean Opinion Scores for Different Stan- dard Speech and Audio Codecs,” IETE Journal of Research, vol. 58, no. 2, pp. 121–129, 2012.

[8] L. Fern´andez Gallardo, S. M¨oller, and J. G.

Beerends, “Predicting Automatic Speech Recogni- tion Performance over Communication Channels from Instrumental Speech Quality and Intelligibil- ity Scores,” insubmitted to Interspeech 2017, 2017.

[9] R. P. Lippmann, “Accurate Consonant Percep- tion Without Mid-Frequency Speech Energy,”IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, pp. 66–69, 1996.

[10] J. Rodman, “The Effect of Bandwidth on Speech Intelligibility,” 2003, polycom, White Paper.

[11] L. Fern´andez Gallardo and S. M¨oller, “Phoneme In- telligibility in Narrowband and in Wideband Chan- nels,” in Annual German Congress on Acoustics (DAGA), 2015, pp. 121–124.

[12] M. F. Spiegel, M. J. Altom, M. J. Macchi, and K. L. Wallace, “Comprehensive Assessment of the Telephone Intelligibility of Synthesized and Natural

Speech,” Speech Communication, vol. 9, no. 4, pp.

279–291, 1990.

[13] Y. Teng and R. F. Kubichek, “Speech Intelligibil- ity Evaluation of Low Bit Rate Speech Codecs,” in 12th Digital Signal Processing Workshop - 4th Signal Processing Education Workshop, 2006, pp. 251–256.

[14] E. Jokinen, J. Lecomte, N. Schinkel-Bielefeld, and T. B¨ackstr¨om, “Intelligibility Evaluation of Speech Coding Standards in Severe Background Noise and Packet Loss Conditions,” in ICASSP, 2015, pp.

5152–5156.

[15] V. Hazan and A. Simpson, “The Effect of Cue- Enhancement on Consonant Intelligibility in Noise:

Speaker and Listener Effects,”Language and Speech, vol. 43, no. 3, pp. 273–284, 2000.

[16] B. T. Meyer, T. J¨urgens, T. Wesker, T. Brand, and B. Kollmeier, “Human Phoneme Recognition De- pending on Speech-Intrinsic Variability,”The Jour- nal of the Acoustical Society of America, vol. 128, no. 5, pp. 3126–3141, 2010.

[17] J. G. Beerends, R. A. van Buuren, J. van Vugt, and J. A. Verhave, “Objective Speech Intelligibility Mea- surement on the Basis of Natural Speech in Combi- nation with Perceptual Modeling,” Journal of the Audio Engineering Society, vol. 57, no. 5, pp. 299–

308, 2009.

[18] J. G. Beerends, C. Schmidmer, J. Berger, M. Ober- mann, R. Ullman, J. Pomy, and M. Keyhl, “Per- ceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II – Perceptual Model,”Journal of the Audio Engi- neering Society, vol. 61, no. 6, pp. 385–402, 2013.

[19] J. G. Beerends, C. Schmidmer, J. Berger, M. Ober- mann, R. Ullman, J. Pomy, and M. Keyhl, “Per- ceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I – Temporal Alignment,”Journal of the Audio Engi- neering Society, vol. 61, no. 6, pp. 366–384, 2013.

[20] C. Hoene, J. M. Valin, K. Vos, and J. Skoglund,

“Summary of Opus Listening Test Results,” Inter- net Engineering Task Force (IETF), Tech. Rep., 2011.

[21] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ, USA:

Prentice Hall, 1993, ch. 2. The Speech Signal: Pro- duction, Perception, and Acoustic-Phonetic Charac- terization, pp. 11–37.

[22] A. Jongman, R. Wayland, and S. Wong, “Acoustic Characteristics of English Fricatives,” The Journal of the Acoustical Society of America, vol. 108, no. 3, pp. 1252–1263, 2000.

DAGA 2017 Kiel

867

Referenzen

ÄHNLICHE DOKUMENTE

1) Inventory, reprocessing and analysis of existing data from the Weddell Sea to evaluate and identify key areas, which are in the need of protection. 2) Discussion, improvement

Whatever the exact origins of county differences in test score production are, though, it is unlikely both that school systems are responsible for most of the variation in test

Evaluation of a Participatory Resource Monitoring System for Nontimber Forest Products: The Case of Amla (Phyllanthus spp.) Fruit Harvest by Soligas in South India.. Analysis

Si bien el análisis multinivel permite responder qué parte de la desigualdad educativa en Argentina se debe a desigualdades entre escuelas y qué parte a diferencias entre

Half of the participants were asked to form the mere achievement goal intention: ‘‘I will correctly solve as many problems as possible!’’ The other half of the participants had

If such a pairing is given and is rational, and T has a right adjoint monad T , we construct a rational functor as the functor-part of an idempotent comonad on the T-modules A T

Table 3 Summary of final comments on Research Lesson 2 (italics indicates the key point chosen for discussion in the main text) Key points of final commentsSummary of final

Priority issues for a G20 leaders’ summit should be those that are important for the future of the global economy, those that cannot be solved by individual countries acting