• Keine Ergebnisse gefunden

Two unsupervised language-independent approaches

Im Dokument The Induction of Phonological Structure (Seite 123-126)

Automatic Syllabification

5.2 Two unsupervised language-independent approaches

In this section, the two main language-independent approaches (Hooper 1976 and Kury lowicz 1948) to the syllabification of words are presented.3 The first approach rests on the assumption that word-medial consonant clusters can be broken up by looking at the distribution of word-peripheral clusters. It induces the syllabification

1Remember that the main purpose of unsupervised morphological analysis is text processing (in-volving orthographic data).

2This information is sometimes also partly integrated in other approaches to syllabification (e.g., in the Legality Principle in Bartlett et al. 2009).

3This section is not intended to be a review of the vast amount of literature on syllable theory, which includes works by Blevins (1972, 1996), Kahn (1976), Vennemann (1972), Selkirk (1982), Hooper (1976).

105

of words entirely based on this principle without any additional knowledge of how syllables are structured. In contrast, the second approach assumes the existence of two universal principles (onset maximization and sonority sequencing, as discussed below) that determine the structure of syllables in a language.

Presumably, the oldest observation that has been made of the dependencies be-tween parts of a word goes back to the ancient Greek historian Herodotus, who re-marked with respect to the permissible consonant groups in Ancient Greek that a syllable can only begin with a group of sounds that also can be found (or could ex-ist) at the beginning of a word.4 Later, this idea was assumed to be a more general characteristic of human languages, namely that the permissible word-medial consonant clusters do not differ greatly from those that can be found at the edge of words. To my knowledge, the first who devised rules of syllabification that took this old idea into con-sideration was Kury lowicz (1948). This observation was also reported by other authors for a variety of languages. It was formulated as the tendency that “medial consonant sequences are often composed of a word-final cluster (C2) followed by a word-initial cluster (C1)” (Vogel 1977:10).5 This can be translated into the procedural universal that words can be syllabified by simply placing the syllable boundary for word-medial consonant clusters between the final and initial clusters (. . . C2.C1. . . ). In other words, the study of those consonant clusters that can occur before the first and after the last vowel in a word makes it possible to determine the boundary in word-medial clusters.

However, when applying this method to actual data two problems might theoret-ically occur: (i) more than one division is possible because there are several different ways in which the medial cluster can be broken up into word-final and word-initial clusters; (ii) no division is possible because any of the possible boundary placements gives rise to at least one sequence that is forbidden word-initially or word-finally. As to the first problem, O’Connor and Trim (1953:121) remark that the “point of sylla-ble division is often unambiguous” in English because the possisylla-ble syllasylla-ble breaks are reduced to only one when considering permissible word-initial and word-final clusters.

Although this might be true for English, other languages show a diverse picture. For those cases where several divisions are possible, various proposals have been made in the literature for how to analyze them.6

Anderson and Jones (1974) and Jones (1976) propose a method for syllabification which also considers the possible combinations at word-edges. In their account, “me-dial clusters, thus interpreted as a function of the possible combinations of initials and finals, show a preference for simultaneous membership in both a preceding and a fol-lowing syllable” (Jones 1976:121). Put differently, sounds in a word may belong to two adjacent syllables at a time. This is generally known under the term “ambisyllabicity”, according to which “a single segment is affiliated with more than one syllable” (Blevins 1996:232).7 Anderson and Jones (1974) and Jones (1976), however, assume a different

4Cited from Kury lowicz (1948:93): “la syllabe ne peut pas commencer que par un groupe existant à l’initiale du mot ou pouvant y exister.”

5What Vennemann (1988:32-33) calls the Law of Initials (LOI) and the Law of Finals (LOF), respectively: word-medial syllable heads/codas are the more preferred, the less they differ from possible word-initial/word-final syllable heads/codas of the language system.

6Vogel (1977) gives an overview of earlier proposals for syllabification which contain a well-described procedure to determine syllable breaks.

7An analysis involving ambisyllabicity in English is put forth in Kahn (1976) for the distribution

concept of ambisyllabicity from how it is usually defined. In their account, the medial consonant in the English wordpiperwould both belong to the first and second syllable because it can occur in word-initial and word-final position.

In contrast, O’Connor and Trim (1953:121) suggest that in cases of ambiguous word-medial clusters the “preference for one syllabic division as opposed to another may be explained in terms of the frequency of occurrence of different types of syllable finals and initials.” To this end, they count the number of times a certain syllable type occurs at the beginning and end of a word. For their corpus of the ‘Received’ Southern British dialect of English, they give the following counts of syllable types:

Table 5.1: Number of occurrences of different syllable types in initial and final position.

Reproduced from O’Connor and Trim (1953:121).

Type Initial Final

CV 421 276

VC 209 277

CC 26 59

VV 10 22

V 12

For a sequence of VCV, which can be analyzed in two different ways (viz. V.CV or VC.V), their method would favor the first candidate as the frequency of occurrence of both possibilities is277 + 12 = 289to12 + 421 = 433. This result is in accordance with the principle of onset maximization mentioned below, which states in such cases that the intervocalic consonant should be attributed to the following syllable. Furthermore, the same method also gives an overwhelming preference for the division VC.CV with the four possibilities V.CCV (12 + 26 = 38), VC.CV (277 + 421 = 698) and VCC.V (59 + 12 = 71).

An alternative approach to syllabification which assumes universal constraints on syllable structure but does not make use of word-initial and word-final consonant clus-ters to break up word-medial groups is discussed in Hooper (1976). Drawing on insights from earlier studies on the structure of syllables, she posits a universal hierarchy of consonantal strength, which is vital in determining the placement of syllable bound-aries. Her hierarchy reflects the well-known sonority hierarchy of sounds, where sounds are ranked by their amplitude (i.e., their loudness). On the basis of this hierarchy a universal syllable structure condition can be formulated in a way that “the strength scale values for the various C positions should descend from syllable-initial position inward toward the nucleus and descend from syllable-final position inward toward the nucleus” (Hooper 1976:230).8 In addition, whenever there is a choice in the place-ment of the syllable boundary because two or more consonants could be attributed to the first or second syllable without violating the above-mentioned condition, Hooper

of consonantal allophones.

8This observation has been covered in a wide range of literature, going back to the early work of Sievers (1881).

assumes the strongest consonant possible begins the next syllable. This principle is better known as the maximization of syllable onsets (or onset maximization principle, OMP) and is stated as follows: “the onsets of syllables are maximized, in accordance to the principles of basic syllable composition of the language” (Selkirk 1982:345).9 In particular, this means that intervocalic consonants are attributed to the following syllable as in the example given above.

Hooper (1976) does not provide an explicit procedure for determining the sylla-ble boundaries on the basis of her principles. However, later works in computational linguistics have addressed the question of how to integrate these principles in their sys-tems. An early implementation of this approach where a connectionist model makes use of a similar method is presented in Goldsmith and Larson (1990). Their neural network distinguishes between two kinds of sonority: (i) a so-called inherent sonority of a segment, which is organized according to the sonority hierarchy; and (ii) a derived sonority of a segment, which is dependent on the phonological context where it occurs in the word. Whereas the former represents the initial state of the model, the latter sort of sonority is derived by the activation level of the segment in the model after it has been run. The syllabification of the word is then determined by the derived sonority where the maxima of derived sonority represent the syllable nuclei and the minima the beginning of syllable onsets. Similarly, the principles of the sonority hi-erarchy and the OMP have been integrated in various other approaches to determine the syllabification of words (e.g., Goldwater and Johnson 2005, Bartlett et al. 2009 and references therein).

Bartlett et al. (2009) implement a hybrid approach using sonority and peripheral clusters together with additional language-specific constraints for English. The system requires no training data but determines the syllable breaks according to the sonor-ity sequencing principle. They also report on a categorical approach that relies on word-peripheral clusters. Their Legality approach combines OMP with the so-called Legality Principle, which “constrains the segments that can begin and end syllables to those that appear at the beginning and end of words” (Bartlett et al. 2009:309). In their implementation, all word-initial consonant clusters are collected from the corpus.

They determine syllable breaks as the maximal onset that can be found word-initially without testing for the legality of the resulting codas. Their results show that the legality method sometimes outperforms the sonority approach in word accuracy (i.e., the percentage of correctly syllabified words in the test set).

The focus in this chapter is on the presentation of a similar method to the one by O’Connor and Trim (1953) which does not require a preliminary classification of the symbols for the language as is necessary for a sonority approach.

Im Dokument The Induction of Phonological Structure (Seite 123-126)