• Keine Ergebnisse gefunden

Sukhotin’s algorithm

Vowels and Consonants

4.3 Sukhotin’s algorithm

As mentioned before, Sukhotin (1962, 1973) was the first to offer a totally unsupervised method to discriminate vowels from consonants on the basis of a phonemic transcrip-tion. The approach relies on two fundamental assumptions that are grounded on typological insights. First, vowels and consonants in words tend to alternate rather than being grouped together. Second, the most frequent symbol in the corpus is a vowel. The latter assumption is used to initialize the classification step by claiming that the most frequent symbol is the first member of the vowel class, with the rest of the symbols initially all classified as consonants. With the help of the first assump-tion the other vowels are then classified by iteratively checking which symbol is less frequently adjacent to the already detected vowels. The halting condition is reached when there are no more symbols with a positive frequency count after subtracting the counts of the adjacent vowels.

4.3.1 Typological basis

One of the most interesting aspects of Sukhotin’s algorithm for our purposes is the fact that the two underlying assumptions on which the method is based reflect insights from typological knowledge about the universal structure of language. The tendency for languages to alternate vowels and consonants in the speech chain, rather than grouping them together, has been reported in the typological literature at least since Jakobson and Halle (1956 [2002]), who remark that the languages of the world all tend to have CV as their basic syllable structure.12 Languages of course differ as to the number of possible syllable types that they allow; some allow a huge variety of consonant clusters whereas others show a large number of constraints on their phonotactic possibilities.

However, despite these huge differences in the shape of syllables, all languages seem to observe the law that a syllable with only a consonant followed by a vowel is more basic than any other syllable type. Thus, no language has only VC or CVC syllables in its repertoire.13 Evidence for the assumption that CV is the basic syllable type can be drawn from a variety of areas, including the observation that the syllable types of a language always include CV as one of its members, no matter how small the inventory of different syllable types may be.

With regard to the second assumption of the algorithm, I am not aware of any cross-linguistic studies on the token frequencies of phonemes in larger samples of text.

However, there is indirect evidence in support of the observation that the most frequent symbol is a vowel. In his study on consonant-vowel ratios in 563 languages, Maddieson (2008) states that there are always more consonant than vowel types in the languages of his sample. The consonant-vowel ratios range between1.11and 29. The lowest value has been reported for the isolate language Andoke [ano], which has only 10 consonants but a comparatively large inventory of vowels (9). The mean value of the ratio across all languages is 4.25. That is, on average there are over four times more consonants than vowels in the phoneme inventory of a language. Provided that it is true that

12Trubetzkoy (1939 [1967]:223) already remarked that the only universally acceptable phoneme combination is the sequence “consonant + vowel”.

13See Chapter 5 for an apparent counterexample to this claim.

languages always have more consonant than vowel types, it can be argued that the less numerous vowel types have higher token frequencies in order to be able to contribute their share to the make-up of syllables. In the French corpus that Goldsmith and Xanthos (2009) employed in their experiments, however, the most frequent phoneme turned out to be the consonant /K/. In the sample of orthographic texts on which the subsequent results are based the most frequent symbol is always a vowel. For some of the CELEX lists, however, a consonant turned out to be the most frequent phoneme. Even though the assumption has not been tested cross-linguistically and might be wrong for some languages (or rather corpora of those languages), there are good reasons for this claim.

4.3.2 Description of the algorithm

The fact that it rests upon typological insights is not the only aspect of Sukhotin’s algorithm that makes it particularly well suited for the present purposes. It is also conceptually and computationally very simple and can be illustrated with a small toy corpus.14 In what follows, I will give a short demonstration of how the partitioning into two disjoint sets of symbols on the basis of their distribution within words can be computed for a small corpus of seven words. Given a corpus with an inventory of n symbols S := {s1, . . . , sn}, we construct an n×n matrix M whose rows represent the first and whose columns represent the second symbol in a bigram sequence. The cells in the matrix then indicate the number of times the respective sequence could be found in the corpus. Since the bigram counts only reflect the occurrence of symbols in immediately adjacent position and the linear order of both symbols is not taken into account, the matrix is necessarily symmetric. That is, for all cells in the matrix the values for identical symbols in reverse order are the same: mij =mji. The cells along the main diagonal would in principle be equal to twice the number of times each symbol occurs directly in sequence. However, since these values do not play a role for the calculation of the result, they are set to zero by convention (mii := 0).

M =

m11 . . . m1n

. . . . mn1 . . . mnn

To illustrate the computation with a concrete example, consider the sample corpus C = {saat, salat, tal, last, stall, lese, seele}, for which we obtain the following 5×5 matrix for all symbols that occur in the corpus (n= 5, S ={s, a, t, l, e}). For ease of understanding the symbols have been put in front of the cells of the matrix and the row sums in the last column.

14The description here owes much to the accounts in Guy (1991) and Goldsmith and Xanthos (2009).

M =

s a t l e Sum

s 0 3 2 0 3 8

a 3 0 3 4 0 10

t 2 3 0 0 2 7

l 0 4 0 0 3 7

e 3 0 2 3 0 8

Initially, Sukhotin’s algorithm considers all symbols to be consonants before enter-ing into an iterative phase. In each cycle of the phase, the symbol with the highest row sum is classified as a vowel. The row sum for any symbolsais calculated by adding up all occurrences of the symbolsa as a first or second member in a sequencePn

i=1mai. In our example, the first symbol to be detected and classified as a vowel is a with a row sum of10. For a more compact representation, I will only give the vectors of row sums for the individual steps in the calculation. The row sums thereby represent the difference between the number of times a symbol is found next to a consonant and the number of times it occurs in immediate adjacency to a vowel. The first vector would therefore look as follows:

RSum1=

s a t l e 8 10 7 7 8

After a new vowel has been detected, its row sum is set to zero. Since this symbol is no longer considered to be a consonant, the row sums of all other symbols have to be updated in order to represent the current state of affairs. For each row sum of each remaining symbol, twice the number of times it occurs next to the new-found vowel is subtracted from its value. The reason for subtracting the value twice has to do with the fact that it is a binary decision between being a vowel and a consonant. Since the row sum marks the difference between occurring next to a consonant and occurring next to a vowel, the value has to be subtracted twice because when the symbol changes its status it automatically switches from consonant to vowel. In the second phase, the row sum vector then looks as follows:

RSum2=

s a t l e 2 0 1 −1 8

The vowelahas a row sum of zero now and all other symbols have been updated.

The highest sum remaining is for the symbole, which is thus considered to be a vowel.

Again, the row sum for e is set to zero and the row sums for all other symbols are updated, which leaves us with the following vector of row sums:

RSum3=

s a t l e

−4 0 −3 −7 0

Now, the halting condition for the algorithm is reached. There are no more row sums with a value higher than zero. The algorithm therefore terminates with two

symbols being classified as vowels. The rationale behind this algorithm with respect to its basic assumptions is the following. The fact that the symbol with the highest row sum is considered to be a vowel reflects the assumption that the most frequent symbol in the corpus is a vowel. The row sums in turn contain the alternation index between consonants and vowels. Those symbols where this index is highest, i.e., which have most frequently been associated with other (currently labeled) consonants, are then taken to be vowels themselves. This in turn guarantees that symbols which are most often encountered in adjacent positions are taken to be from different classes.

I made an implementation of the algorithm in Python based on Aris Xanthos’s Perl implementation in the Arabica tool (Xanthos 2007). It takes as its input a list of words from a text file. After the algorithm has reached its final state, the program outputs the list of vowels and the list of consonants for the given input text.

4.3.3 Results and discussion

To my knowledge, the algorithm has so far only been tested on a number of (mostly closely related) languages. Evaluations are included in Sukhotin (1962, 1973), Sassoon (1992), Xanthos (2007) and Goldsmith and Xanthos (2009). With the exception of Finnish, Hungarian, Hebrew and Georgian in Sassoon’s study, all languages have been taken from the Indo-European language family. The case of Hebrew shows that it is inappropriate to run the algorithm on a non-phonemic transcription. Similar problems arise when applying the method on orthographic texts in English, German or Russian where the spelling system does not accurately represent the pronunciation of the words.

The material that will be used here is also orthographic but includes many languages whose spelling system was only recently introduced and thus corresponds much better to a phonemic transcription.

In this section, I provide a more diverse cross-linguistic sample of languages on which the method is tested. For this purpose, I ran the algorithm on my sample of Bible texts (mostly New Testament) in 30 languages. Compared to the early experiments on the usefulness of the approach the present evaluation involves much larger corpora.

The size of the corpora in Sassoon (1992), for instance, range from 1,641 to 3,781 characters. The Bible texts, on the other hand, contain more than 100,000 characters (see Chapter 3.5 for the description of the material that is employed). After looking at his results for his sample of five languages Sassoon (1992) comes to the conclusion that the algorithm works very well on those languages that only allow few consonant clusters but has more problems when a larger number of complex consonant clusters are to be found in the corpus. It is particularly interesting to see whether this conclusion can still be maintained when running the algorithm on larger corpora in those languages.

The expectation would be that the effect of consonant clusters is leveled out when more data are taken into consideration. And indeed, looking at those languages in Sassoon’s sample for which he found errors in his results reveals no significant difference to other languages in the results on the basis of the present material. The overview in Table 4.1 shows the classification for all symbols that are contained in the Bible texts of the sample. Misclassified symbols are either very infrequent and happen to occur next to symbols of the same class or are part of digraphs that are used in the spelling system of the language. In what follows, I will discuss the errors for some of the languages.

Table 4.1: Classification results of Sukhotin’s algorithm for all 30 languages in the sample.

Language Vowels Consonants Misclassified

aau a o e i u h k m n p s l w y

aey a e i o u c b d g f h k j m l n q s t w

afr e a i o u y ch dj gh sj tj c b d g f h k j m l n ng p r t w v

s

azz i a e o ch h c j m l n p s t hu y x u

cha a i o u e q ch c b d g f h j m l n ˜n p s r t y

deu e a i o u ¨u ¨a ¨o y vßb d g f h k j m l n q s r t w x z c p eng e a i o u c b d f k j m l n q p s r t w v x z g h y epo a i o e u oj aj ej

eu

cx gx hx jx sx c b d g f h k j m l n p s r t v z

ux uj eus a e i o u ll ts rr tz d c b g ¸c f h k m l n q p s r

t x z

y v fin a i e u ¨a o y ¨o v b d g f h k j m l n p s r t c hat a i e an è o ou en

è on oun

ch b d g f k j m l n p s r t w v y z hix o a e i u dy ny ry tx b d f h k m n p s r t w y x hun e a o i ¨o u ¨u cs gy ly ny sz ty zs c b d g f h k j m l

n p s r t v z

ind a e u i o kh ny sy c b d g f h k j m l n ng p s r t w y z

kal a i u e o y q rn b d g f h k j m l n ng p s r t v z kat a e i o u b d g v t z k’ m l n zh p’ s r t’ k p q’

gh ch sh dz ts ch’ ts’ j kh

h lat e i a u o y c b d g f h m l n q p s r t v x z

mlt a i e u o ie ˙c ˙g ¯h g¯h b d g f h k j m l n q p s r t w v x z ˙z

mri a e i o u w h k m n ng p r t wh

nld e a i o u y ch nj tj c b d g f h k j m l n ng p r t w v z

s nus i u a E ¨a O e o i

¯o

¨ ¯ o ¨E¨e O

¯¨E

¯

dhN G th c b d g k j m l n p r t w y h pot i a e’ u o c ’ p k m sh n s t w y

quz a i u o e q ch ch’ k’ ll q’ p’ sh t’ c b d g f h k j m l n ˜n p s r t w v y z

swh a i u e o b d g f h k j m l n p s r t w v y ch z tgl a i u o e ly ny ts b d g h k m l n ng p s r t w y tur a e i ı u ¨u o ¨o ˘g ¸c ¸s c b d g f h k j m l n p s r t v y z vie a i u ê ô ư â e o ă

ơ

đ q t ch gh gi nh c b d g h m l n ng s r v y x

k p wbp a u i ny r rd k j m ly rt rn ng p rr t w y n l

Language Vowels Consonants Misclassified wim a u i e o ay ow iy

oy aw ew

’ ch nh ny th k m l n ng p r t w y ey wol a e u o i à ¨e é ó qNc b d g f k j m l n ˜n p s r t w y x

For most of the languages in the sample, minor modifications to the original texts have been made in that the most common digraphs have been treated as a single symbol. In Swahili, for instance, the digraph <ch>, which represents the affricate /Ù/, has been treated as one symbol ch. With the original orthography, the symbol c due to its exclusive co-occurrence with h was classified as a vowel by Sukhotin’s algorithm. After the revision, all symbols are correctly classified. Similarly, for the Maori Bible text, the digraphs <wh> (/f,F/) and <ng> (/N/) (Harlow 2007:86) led to a misclassification of w and g when applied to the original version of the text. After the revision both consonants are correctly labeled. Sometimes a symbol shows an unusual distribution because it only occurs in loanwords or proper names where it is mostly associated with a digraph in the original language from which it is taken. In the original Warlpiri Bible text, for instance, the symbol h is wrongly classified as a vowel because it is not part of the writing system for the language and only occurs in the digraph th from English (or rather Greek) proper names in the Bible texts. For this reason, words including infrequent symbols that only occur in loanwords in the Bible texts have been largely ignored in the language sample.

One of the most severe problems of using orthographic texts for inferring phonolog-ical categories is with symbols that represent both vowels and consonants depending on the context in which they occur. Since the classification is global in the sense that no context information is taken into consideration during the processing of the algorithm, a decision has to be made across all occurrences of the symbol in the language. In the case of Englishy, which marks a consonant in words like yoghurt and a vowel in words likelady, the classification does not do justice to all occurrences of the symbol, no matter how the decision will look like. Such cases show that a prerequisite for the algorithm is a phonemic transcription.15 The underlying assumptions on which the algorithm is based rest upon a proper transcription of the pronunciation of words.

The main errors in the results are not due to the principle of operation but a consequence of the fact that the input is not optimal. Most errors involve symbols that are digraphs in the spelling system of the language (e.g.,c or p in German, g or h in English) or only occur infrequently in the corpus so that no reliable information about its distribution within words can be inferred. Figure 4.1 shows that the incorrectly classified symbols are mostly less frequent than the mean frequency in the corpus. The outlier is the consonant s in Afrikaans, which due to its occurrence in digraphs has been classified as a vowel. Other more frequent misclassified symbols (English h or Dutchs) also feature in digraphs. My results confirm the findings of Sassoon (1992) that the algorithm is highly sensitive to the corpus size and the frequency of occurrence of individual symbols. Experimenting with different amounts of data has shown that a low frequency of occurrence might lead to a wrong classification if the symbol tends

15In fact, Kim and Snyder (2013) ignore the symbolyin their analyses because it cannot be uniquely attributed to one class.

to occur in contexts that are more diagnostic for the opposite class. One of the first symbols that gets wrongly classified in such cases is the sibilant s, which might be due to its frequent occurrence in consonant clusters and its exceptional status in such contexts where it can violate the sonority sequencing principle.

Correct Incorrect

0246

Classification result

Relative frequency to mean frequency

Figure 4.1: Boxplots for all correctly and incorrectly classified symbols together with their relative frequency in the corpus (Sukhotin’s algorithm).

The results in this section show that the algorithm yields a very good accuracy in classifying symbols into vowels and consonants. Languages where the orthography is notoriously close to the actual pronunciation of words (e.g., Finnish, Latin and Turkish) show an almost perfect classification.16 Even for a language with many and very complex consonant clusters such as Georgian the algorithm works very well, with only one misclassified symbol in the results.

Altogether, Sukhotin’s algorithm shows an average accuracy of 97.57%. This result is similar to what is reported in Xanthos (2007:70) for English, French (phonemic transcription) and Finnish (orthographic) data on average (0.96 precision and 0.94 recall). The quantitative evaluation for each language in the sample is given in Table 4.2. The worst result is for English with three misclassified symbols and an accuracy of 88.46%. 18 of the 30 languages in the sample show a perfect classification. In addition, I tested Sukhotin’s algorithm on the data set used in Kim and Snyder (2013) and obtained an average accuracy of 94.06%. Although the accuracy of Sukhotin’s

16The only misclassified symbol in Finnish (c) only occurs in loanwords (Branch 1990:597).

algorithm is lower than what Kim and Snyder (2013:1533) report for their clust method, it yields better results than the expectation maximization (EM) baseline and is much simpler than both methods.

Table 4.2: Quantitative evaluation of Sukhotin’s algorithm for all languages in the sample.

Language # symbols # misclassified symbols Accuracy

aau 14 0 100.00%

aey 20 0 100.00%

afr 29 1 96.55%

azz 18 1 94.44%

cha 23 0 100.00%

deu 30 2 93.33%

eng 26 3 88.46%

epo 33 2 93.94%

eus 29 2 93.10%

fin 24 1 95.83%

hat 29 0 100.00%

hix 23 0 100.00%

hun 31 0 100.00%

ind 27 0 100.00%

kal 25 0 100.00%

kat 33 1 96.97%

lat 23 0 100.00%

mlt 30 0 100.00%

mri 15 0 100.00%

nld 28 1 96.43%

nus 34 1 97.06%

pot 16 0 100.00%

quz 34 0 100.00%

swh 24 0 100.00%

tgl 23 0 100.00%

tur 29 0 100.00%

vie 34 2 94.12%

wbp 20 0 100.00%

wim 27 1 96.30%

wol 29 0 100.00%

The results on the basis of orthographic texts mostly reveal errors that are due to the non-phonemic aspects of the respective spelling systems or reflect infrequent symbols. In order to see how the algorithm performs on phonemic transcriptions I also ran the algorithm on the CELEX data for English, German and Dutch (both for the list of word forms and lemmas; cf. Table 4.3) as well as on the list of word forms across all languages in the ASJP database (Table 4.5).

For English, the algorithm yields the same results when being applied to the list

Table 4.3: Results for Sukhotin’s algorithm on the CELEX database for English, Ger-man and Dutch word forms and lemmas. Those cases where the classification swaps vowels and consonants are indicated with†.

Language Vowels Consonants Misclass.

Table 4.4: Quantitative evaluation of Sukhotin’s algorithm on the CELEX database for English, German and Dutch word forms and lemmas as well as the list of word forms across languages in the ASJP data set.

Language # symbols # misclassified symbols Accuracy

English word forms 54 5 90.74%

English word forms 54 5 90.74%