Structure of the dissertation

1. INTRODUCTION

1.2. Structure of the dissertation

The present dissertation consists of an introductory part and copies of eight articles. The introductory part has seven chapters.

Chapter I introduces the problems and structure of the dissertation and provides a short overview of the publications and the author’s contribution to the co-authored articles. It also explains the terms and concepts related to the temporal structure of speech.

Chapter II gives an overview of the strategies of speech synthesis, speech timing theories and selection principles of the factors and features in the modelling of speech timing.

Chapter III contains a brief overview of the previous research on the temporal structure of Estonian speech: the treatment of the quantity degrees, micro-prosodic features of the segments (intrinsic durations), pauses and prepausal lengthening.

Chapter IV describes the data used in the present research.

Chapter V is devoted to the statistical methods used for predicting durations.

It also gives an overview of the statistical programme packets used in this research.

Chapter VI provides a description of the results of a number of modelling experiments, including the prediction of the duration and location of pauses in speech flow. Significant features are selected for the modelling of segmental durations and related issues of word prosody are analysed. Various statistical models are described and the relevance and predictive precision of models is tested. Finally, a comparison of methods for modelling segmental durations is presented.

Chapter VII contains conclusions and directions for further research.

1.3. Brief overview of the articles and the author’s contribution to the co-authored works

The present dissertation consists of eight scientific articles. The following is a brief overview of the articles and of the author’s contribution to the co-authored works. The co-authors of [P1], [P2] and [P4] have seen and accepted this overview.

[P1] deals with issues related to the modelling of prosody for the Estonian text-to-speech synthesiser: modelling the intonation of questions with kas-particle, initial notes on how pauses and prepausal lengthening are related to the text structure and the first modelling of phone durations by using regression analysis. The author wrote the analysis of pauses and prepausal lengthening, prepared the modelling data and interpreted the results.

[P2] presents the statistical modelling of segmental durations for speech synthesis, using regression analysis. The author wrote the analysis of pauses and established the link between pauses and the text structure. The author also contributed to the preparation of the material for the regression model and gathering expert opinions on significant features as well as presenting them in the context of regression analysis.

[P3] concentrates on the analysis of pauses and prepausal lengthening in connected speech and the modelling of pauses and their location in the speech flow. The author wrote the article and carried out the experiments. Jüri Kuusik consulted on the application of logistic regression on the input data.

[P4] is on the modelling of intonation based on morphological, syntactic and parts-of-speech features by using the linear regression method, and provides an analysis of pauses and breathing in speech. Pauses are treated as units which mark the boundaries of prosodic groups. The author focused on the theory, the statistical modelling of the fundamental frequency and the related analysis of the speech material. Hille Pajupuu analysed the pauses and breathing in the speech flow and determined sentence stresses. Krista Kerge carried out a syntactic analysis of the sentences and interpreted the generated models.

[P5] contains a longer treatment of pauses in Estonian speech as well as the modelling of pause durations based on the classic regression analysis, the classification and regression tree (CART) method and neural networks. Logistic regression was used to predict the location of pauses.

[P6] provides a comparison of various statistical prediction methods (linear regression, CART method and neural networks) in terms of their predictive error, model interpretability, preliminary data processing and other criteria.

[P7] investigates whether predicting durations in Estonian, which has a rich morphology, is in addition to morphological information also facilitated by the information on parts of speech and syntax.

[P8] focuses on the selection principles of significant features for modelling the temporal structure of speech for text-to-speech synthesis. In addition to traditional parameters describing the phonetic environment of sounds and its hierarchical position in a clause, morphological, syntactic and lexical features of words such as word form, part of sentence and part of speech also play an important role in predicting segmental durations in the Estonian language.

Significant features in predicting the positioning of pauses in the speech flow were the distance of words from the beginning of the sentence and from the previous pause, the duration and quantity degree of the previous foot and the punctuation marks or conjunctions in the text.

1.4. Terms and concepts used in the dissertation

The main aim of the functioning of a language as a sign system is to ensure the expression of thoughts and the communication and reception of information by means of spoken language or written text. Language is a sign system used in speaking (speech), writing (written language), and thinking (inner speech) or in other form of communication. The ability to speak is not innate; it is acquired through human activity. The biological linguistic abilities of humans have pro-vided a basis for acquiring a language system from speech and using the acquired system when speaking (Õim 1976).

Thus it can be said that linguistic communication is the transmission and reception of thoughts via speech signals. Unfortunately, computers are not yet able to think independently. Speech synthesis or, to be more precise, text-to-speech (TTS) synthesis is the ability of a device or a computer to translate orthographic text into speech without human interference.

Phonetics studies the expression of linguistic signs in the form of spoken language. The main unit of phonetics is the phone (or speech sound) which is the smallest speech segment that can be determined by articulatory and acoustic properties. Yet a phone has a large number of variants in an acoustic space depending on its context in the word and the speaker. By systematically grouping the phones we can establish the phonological system of a language the units of which are phonemes (Hint 1998). Thus, the input of speech synthesis is a sequence of text or phonemes which in the output is realised as a sequence of sounds, i.e. synthetic speech. In speech recognition, the process is reversed – by analysing speech waves we try to establish the in-depth structure of sounds, i.e. the sequence of phonemes. Kalevi Wiik has compared the relationship between a phoneme and a phone with the situation of a shooter in a shooting range (Wiik 1991): just as a shooter is trying to aim at the centre of the target, a speaker is trying to achieve the same target value of the phoneme /a/ in words such as, for instance, sada, tanu, pali, but due to the coarticulatory environment the result is, similarly with the shooting target, not a sound of the exact same quality but a cluster of similar sounds. Smaller linguistic units – segmental phonemes – are described through both qualitative properties of sounds as well as a parameter related to the temporal dimension – intrinsic duration.

In the presentation of speech (but also in music), a certain order is vital which appears in longer passages of speech than sounds (phonemes). This order is rendered through changes in the duration, fundamental frequency and intensity of the physical parameters of the sound signal. This is what prosody deals with. Suprasegmental phonemes or prosodemes which normally accompany several segmental phonemes can be described by the prosodic features which are formed on the basis of the duration, pitch and loudness (or their various combinations) of the psycho-acoustic perception parameters

derived from physical values. The ability of prosodemes to differentiate meanings is above all based on the distinctive difference of prosodic properties characterising the whole unit. Depending on the nature of the suprasegmental or prosodic phenomenon, the speech segment constituting a prosodeme may be a syllable, foot, word, word combination or sentence. Prosodic phenomena include, for instance, word stress, phrase stress, contrastive stress (focus), syllable tones (e.g. in Chinese), tonal word accents (e.g. in Swedish), Estonian quantity degrees, sentence intonation, etc.

The physical parameter duration marks the time spent on pronouncing any speech unit (sound, syllable, foot, word, phrase, sentence, pause, etc) or its part.

Duration may depend on the qualitative properties of a given unit (e.g. intrinsic duration dependent on phone quality) or its neighbours as well as on their quantity, position in the word and sentence and many other morphological, syntactic and paralinguistic factors (Eek, Meister 2003). The length of a speech unit is usually perceived as its duration (e.g. as a short or long sound).

Fundamental frequency (F0) and its variability (i.e. different fundamental frequency contours) is created by the vibration of vocal cords while articulating voiced sounds, which the listener perceives as pitch or a change in pitch. The F0 flow in a phrase or sentence forms the intonation of this phrase or sentence. The pitch and/or its variability in a syllable characterises syllable tones in a foot, i.e.

the tonal word accents. Intensity is an energetic speech wave parameter expressing atmospheric pressure differences occurring as a result of the inter-action of the lungs and vocal cords, as well as the intensity level of the arti-culation which the listener perceives as the loudness of the signal.

Stress is a complex hierarchical prosodic phenomenon which, depending on the phonological system of the language, is characterised by various physical parameters (duration, F0, intensity, and also vowel quality). Word stress is, depending on the language, either phonological (e.g. in English and Russian) or non-phonological (e.g. in native Estonian words stress usually functions as a boundary marker). Longer words have several stresses, the strongest of which is called the primary stress and weaker ones are called secondary stresses. In native Estonian words the primary stress usually falls on the first syllable of the word. A foot consists of a strong (stressed) and weak (unstressed) syllable. A foot can also have a third syllable if it ends with a short vowel or, at the end of a word, also with a short consonant. In monosyllabic Estonian words the weak part of the foot is made up of a so-called virtual syllable which is expressed by the word-final lengthening. On a higher level, i.e. in words stressed in a phrase or sentence, different types of contrastive stress usually fall on the foot carrying the primary stress in this word (Eek, Meister 2004). In Estonian, word stress is expressed by the higher F0 ofthe stressed syllable of the foot as compared to that of the unstressed syllable (Eek 1987). Contrastive stress is distinguished from word stress by a significantly higher F0of the stressed syllable of the foot

carrying the primary stress (Asu 2004). Alternation of stresses creates the speech rhythm.

Estonian quantity degrees are a prosodic phenomenon. Quantity degrees are independent distinctive prosodic units manifested over a disyllabic metric foot consisting of a stressed and unstressed syllable. Their distinctness depends on the duration ratios of adjacent phonemes and differences in F0 contours (and maybe also intensity, vowel to consonant transition, and vowel quality) (Eek, Meister 2004).

2. AN OVERVIEW OF SYNTHESIS STRATEGIES AND MODELS OF THE TEMPORAL STRUCTURE

OF SPEECH IN TEXT-TO-SPEECH SYNTHESIS

Text-to-speech synthesisers build on the analogue of a human reading. Figure 1 presents a simplified scheme of reading out loud and the physiological speech organs involved in the reading process.

A human being acquires reading skills during their first decade of life.

Thereafter their reading skills continue developing and improving. Having acquired the reading skills, they become automatic. Looking at reading from the physiological point of view, we can see that it is a very complicated process.

Images of letters are grasped by the sensor neurons of the eyes and transported to the human brain in the form of electrical stimuli. In the brain, the information is processed and translated into commands to motor neurons responsible for activating lungs, vocal cords and articulation muscles (Holmes 1988). This leads to the production of speech, whereas the articulation process is constantly monitored and controlled with the help of the information coming mostly from the auditory organs.

Figure 1. Schematic data flow diagram illustrating the reading process according to Holmes (Holmes 1988).

2.1. Synthesis strategies

The computer imitated TTS system is a simplified model of the physiological reading process (Figure 2).

Similar to a human reading, a TTS synthesiser contains a processing module for natural speech which transforms the input text into the output text together with the desired intonation and speech rhythm. The digital signal processing

module turns the symbol information contained in the input text into natural-sounding speech.

The processing module for natural speech provides the text with a phonetic description and determines the speech prosody. Normally, text processing includes various description levels: phonetics, phonology, morphology, syntax and semantics.

Figure 2. Generalised text-to-speech synthesis model.

In the 1960s speech synthesis techniques became divided into two paradigms.

Lingaard called these the system and the signal method (Lingaard 1985). The system method is also called articulatory synthesis. Articulatory synthesis is based on the physiological model of speech production and the physical description of sound production in the speech tract. Both methods developed independently, but the fastest practicable results were achieved by signal modelling thanks to the intrinsic simplicity of the approach. Contrary to the articulatory approach, signal modelling does not even attempt to explain the impact of coarticulation based on the kinematics of speech organs but simply describes the respective acoustic waveforms.

To produce comprehensible and natural output speech, the focus is on modelling transitions from sound to sound and coarticulation. Speech scientists have established long ago that phonetic transitions are just as important for comprehensibility as stationary parts (Liberman 1959). Taking into account phonetic transitions in synthesis can be achieved in two ways: directly – as a list of rules formally describing how phonemes affect each other – or indirectly – by saving phonetic transitions and thereby coarticulatory impacts into a database of speech segments and using them in the synthesis as final acoustic units instead of phonemes.

The two above-mentioned alternatives have developed into two main TTS system types – rule-based synthesis and chain synthesis. Both have a synthesis philosophy of their own.

Rule-based synthesisers are favoured by phoneticians and phonologists because they can be used to study pronunciation mechanisms. The most widespread is the so called Klatt’s synthesiser (Klatt 1980) because, due to a link between articulatory parameters and inputs of the Klatt model, it is possible to use the synthesiser in speech physiology research. Unlike rule-based synthesisers, synthesisers based on linking speech units have very little information about the data they operate with. Most of the information is contained in segments which are linked in the chain.

In chain synthesis it is presumed that articulated speech flow is not simply a sequence of phones but rather that speech consists of constantly overlapping transitions from one phone to another. The preceding segment contains, due to regressive coarticulation, features of the following phone. Diphones⁵ are the most-used speech units in chain synthesis, as a relatively small number of diphones is needed to synthesise speech from a random text. The Estonian diphone database contains approximately 1,900 diphones. While in common TTS diphone synthesis the speech database contains only one sound-to-sound transition, in corpus based synthesis the whole corpus constitutes the acoustic basis of synthesis. Diphones are also used as elementary units in the corpus-based synthesis of variable-length speech units (Clark et al 2007). Speech unit selection algorithms start their search on the higher levels of the phonological tree (phrase, word, foot), giving preference to longer passages in synthesis.

The primary focus of the present dissertation in modelling the temporal structure of speech is both on TTS chain synthesis based on single diphones (Mihkla, Meister 2002) and on corpus-based synthesis system of unit selection (Mihkla et al 2007). Because diphones contain transitions of adjacent phones, it is wise to treat segmental durations of sounds and pauses as elements of the temporal structure of speech.

2.2. Speech timing

Broadly speaking, there are three different types of timing in speech: mora-timed rhythm which is used to explain, for instance, Japanese, syllable-mora-timed rhythm which is most characteristic of French and Spanish, and stress-timed rhythm which is used in the temporal regulation of many Indo-European languages.

5 Diphones begin in the centre of the stable part of a speech sound and end in the stable part of the next sound.

In Japanese, mora isochrony has been observed as a temporal constraint controlling vowel duration. A negative correlation has been found to exist bet-ween the durations of vowels and their adjacent consonants. The phenomenon by which the temporal compensation of the duration of a vowel is more influenced by the duration of its preceding consonant is regarded as an acoustic realisation of mora-timing. Statistical analysis has shown that such compen-sation takes place in mora units and not in syllables (Sagisaka 2003). Mora metrics has been successfully applied in Estonian phonology as well. Arvo Eek interpreted intra-foot quantity degrees as a manifestation of mora isochrony where the quantity degree is determined by the distribution of durations within the foot (Eek, Meister 2004:336–357).

In a syllable-timed language, every syllable is thought to be roughly of the same duration when pronounced, although the actual duration of a syllable depends on the situation and context. Spanish and French are commonly quoted as examples of syllable-timed languages though there is no consensus in this respect (e.g. Wenk, Wioland 1982). When a speaker repeats the same sentence several times at the same rate of articulation, the durations of adjacent phones display a strong negative correlation, i.e. any variance in the duration of a single phone is compensated by the duration of adjacent phones. Thus the temporal regulation of articulation must be organised at levels higher than the phoneme, e.g. at the level of the syllable (Huggins 1968). The hypothesis of syllable-timing was applied by Campbell and Isard in the statistical modelling of the interaction between higher and lower levels (Campbell, Isard 1991).

In a stress-timed language, syllables may have different durations but the mean duration of the stretch between two consecutive stressed syllables is more or less constant. Isochrony has been under careful scrutiny in many languages for a long time; yet there is no consensus about speech timing and its acoustic features. In her extensive study (Lehiste 1977) of isochrony and speech rhythm, Ilse Lehiste concluded that the English language lacks direct acoustic correlates related to speech rhythm. It seems that Thierry Dutoit was probably right in saying that there are no so called “clean” languages that would completely match one of the above mentioned rhythm models, and it is more appropriate to talk about tendency to isochrony in languages (Dutoit 1997). In recent studies of the Estonian quantity system, it is considered appropriate to describe the quantity degrees in the context of foot isochrony (Wiik 1991; Eek, Meister 2003).

At the International Congress of Phonetic Sciences held in Saarbrücken in 2007 an entire session was devoted to speech timing with scholars of various languages (English, Japanese, Brazilian Portuguese and French) discussing the mechanisms of rhythm. Although there was not a complete consensus among scholars, many of them focused on different aspects of vowel onsets in the temporal structure of speech (Keller, Port 2007). The onset of voicing has provided a key for studying the temporal structure of syllables. Vowel onsets

play an important role in making speech synthesis more natural and contain significant parameters for speech perception (Keller 2007). Curiously, the new

Im Dokument KÕNE AJALISE (Seite 50-0)