• Keine Ergebnisse gefunden

The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech

N/A
N/A
Protected

Academic year: 2022

Aktie "The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech"

Copied!
1
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1).. The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech 3 Ingmar Steiner1–3 , Marc Schröder , Annette Klepp1,3 .. steiner@coli.uni-saarland.de. 1. . Multimodal. Computing & Interaction. 2. . Saarland .University. We announce the release of the PAVOQUE corpus, a single-speaker, multi-style database of German speech, designed for analysis and synthesis of expressive speech. The corpus has been previously used for voice conversion [5] and expressive textto-speech synthesis [1, 4]. The full corpus data is now being made available to the public, under a Creative Commons license. It is hosted at https://github.com/marytts/pavoque-data. Corpus composition style aggressive cheerful depressed neutral poker. cheerful 25 375 184 584 46 min. depressed 25 375 156 556 55 min. aggressive 25 375 201 601 45 min. Stefan Röttig, a male native speaker of German trained as a professional actor and baritone opera singer, was hired to produce the corpus. The recordings were carried out in a sound-proof studio, over multiple sessions, with a sampling rate of 44.1 kHz at 24 bit per sample.. .. All utterances were automatically transcribed using MaryTTS [3]; the phone-level segmentation was manually verified by phonetically trained research assistants.. Selected statistics neutral cheerful. depressed. The bulk of the corpus consists of generaldomain sentences (A) automatically extracted from . Wikipedia using a greedy algorithm optimizing for phonetic and prosodic coverage [2]; these were spoken in a neutral, “news-reading” style. 375 more of these (B) are common to all styles. A number of domain-specific utterances (C) were spoken as well. set neutral A 2639 B 375 C 112 total 3126 time 321 min. . German Research . Center for Artificial Intelligence. Speaker and recordings src: www.stefan-roettig.de. . Overview. 3. poker 25 375 175 575 49 min. aggressive 5. .. 10. 15. 20. articulation rate (phones/s). neutral cheerful depressed aggressive 100. 150. .. 200. 250. F0 (Hz). 300. neutral cheerful depressed aggressive. Overall, 5442 utterances (8 h 37 min) are available in five different speaking styles.. -10. .. 0. 10. spectral slope (dB). References [1]. P. Gebhard, M. Schröder, M. Charfuelan, C. Endres, M. Kipp, S. Pammi, M. Rumpler, and O. Türk. “IDEAS4Games: building expressive virtual characters for computer games”. In: 8th International Conference on Intelligent Virtual Agents (IVA). Tokyo, Japan, 2008, pp. 426–440. DOI: 10.1007/978-3-540-85483-8_43.. [2]. A. Hunecke. “Optimal Design of a Speech Database for Unit Selection Synthesis”. Diploma thesis. Saarbrücken, Germany: Saarland University, 2007. .. [3]. M. Schröder, M. Charfuelan, S. Pammi, and I. Steiner. “Open source voice creation toolkit for the MARY TTS platform”. In: Interspeech. Florence, Italy, 2011, pp. 3253–3256.. [4]. I. Steiner, M. Schröder, M. Charfuelan, and A. Klepp. “Symbolic vs. acoustics-based style control for expressive unit selection”. In: 7th ISCA Tutorial and Research Workshop on Speech Synthesis (SSW). Kyoto, Japan, 2010, pp. 114–119.. [5]. O. Türk and M. Schröder. “Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques”. In: IEEE Transactions on Audio, Speech, and Language Processing 18.5 (2010), pp. 965–973. DOI: 10. 1109/TASL.2010.2041113.. 20.

(2)

Referenzen

ÄHNLICHE DOKUMENTE

The nature of expressive and emotional speech has garnered a mounting body of research over the past decade (Scherer, 2003; Schröder, 2009; Schuller et al., 2011, among many others);

The procedure we use in MARY TTS to create expres- sive voices, using an expressive or emotion label, is general enough to be used with explicitly recorded data in expressive

tempt to identify relevant categories of meaning for lis- tener vocalizations in a German dialog corpus, which was recorded in view of interactive speech synthesis as a long-

Back­channel vocalizations play an important role in communicating listener intentions while the other person has 

The above points can be assessed in the perspective of expression transformation: property 1 is important if one wishes to linearly de-couple the voice quality, related to

However, the general lack of information on user expectations poses a huge difficulty for TTS evaluations: if we want to come up with a diagnostic evaluation of our TTS voice that

Four emotive text materials (shown in Table 12) were newly prepared (two sentences expressing anger and two expressing joy), and these were synthesized using all four source

An August 21, 1932, review of he Conjure- Man Dies in the Long Island Daily Press proposed that “here was another piece of lively art, a work bound up with racial feeling and