• Keine Ergebnisse gefunden

The Induction of Phonological Structure

N/A
N/A
Protected

Academic year: 2022

Aktie "The Induction of Phonological Structure"

Copied!
252
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Dissertation zur Erlangung des akademischen Grades eines Doktors der Philosophie

vorgelegt von Thomas Mayer

an der

Geisteswissenschaftliche Sektion Fachbereich Sprachwissenschaft

Tag der m¨undlichen Pr¨ufung: 9. Februar 2012 Referent: Prof. Frans Plank

Referentin: Prof. Miriam Butt Referentin: Prof. Emily Bender

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-262292

(2)
(3)

Acknowledgments xiii

Zusammenfassung xv

Abstract xvii

1 Introduction 1

Part I 5

2 Background 7

2.1 Introduction . . . 7

2.2 Setting the stage . . . 7

2.3 Linguistic typology . . . 9

2.4 A computational approach . . . 11

2.5 Motivation . . . 15

2.5.1 Distributional correlates of features . . . 16

2.5.2 Quantifying typological parameters . . . 19

2.5.3 A phonological component for the unsupervised learning of mor- phology . . . 21

2.6 Summary . . . 26

3 Methods and Language Data 27 3.1 Introduction . . . 27

3.2 Statistics . . . 28

3.2.1 Co-occurrence counts . . . 28

3.2.2 Conditional probabilities . . . 29

3.2.3 Association measures . . . 30

3.3 Cluster analysis and multidimensional scaling . . . 38

3.3.1 Cluster analysis with dendrograms . . . 39

3.3.2 Multidimensional scaling . . . 43

3.4 Visualization . . . 45

3.4.1 Why data visualization? . . . 45 iii

(4)

3.4.2 Visual analytics . . . 46

3.4.3 Matrix visualizations . . . 50

3.4.4 PhonMatrix . . . 53

3.5 Language data . . . 54

3.5.1 CELEX data . . . 56

3.5.2 ASJP data . . . 57

3.5.3 Maltese roots . . . 58

3.5.4 Bible texts . . . 59

3.6 Summary . . . 60

Part II 63 4 Vowels and Consonants 65 4.1 Introduction . . . 65

4.2 Previous approaches . . . 66

4.2.1 Fischer-Jørgensen (1952) . . . 66

4.2.2 O’Connor and Trim (1953) . . . 67

4.2.3 Sevoroˇˇ skin (1963) . . . 68

4.2.4 Boy (1977) . . . 68

4.2.5 Finch and Chater (1991) . . . 71

4.2.6 Ellison (1994) . . . 72

4.2.7 Powers (1997) . . . 74

4.2.8 Knight et al. (2006) . . . 75

4.2.9 Calderone (2009) . . . 76

4.2.10 Goldsmith and Xanthos (2009) . . . 77

4.2.11 Kim and Snyder (2013) . . . 79

4.2.12 Summary . . . 80

4.3 Sukhotin’s algorithm . . . 81

4.3.1 Typological basis . . . 81

4.3.2 Description of the algorithm . . . 82

4.3.3 Results and discussion . . . 84

4.4 The substitution approach . . . 90

4.4.1 Description of the algorithm . . . 92

4.4.2 Results and discussion . . . 97

4.5 Conclusion . . . 104

5 Automatic Syllabification 105 5.1 Introduction . . . 105

5.2 Two unsupervised language-independent approaches . . . 105

5.3 A totally unsupervised language-independent method . . . 108

5.4 The problem of evaluating syllabification methods . . . 110

5.5 Evaluation . . . 113

5.6 Discussion . . . 113

5.7 Conclusions . . . 116

(5)

6 Vowel Harmony 117

6.1 Introduction . . . 117

6.2 The phenomenon . . . 117

6.2.1 Turkish . . . 119

6.2.2 Finnish . . . 121

6.3 Related work . . . 124

6.4 Analyzing vowel successions . . . 127

6.5 Generating the visualization . . . 129

6.5.1 Data mapping . . . 131

6.5.2 Matrix arrangement . . . 131

6.6 Visualization results . . . 132

6.6.1 Finnish and Turkish . . . 132

6.6.2 Hungarian . . . 134

6.6.3 Warlpiri . . . 135

6.6.4 Maori . . . 136

6.6.5 Swahili . . . 136

6.6.6 Maltese . . . 137

6.6.7 Udihe . . . 138

6.6.8 Summary . . . 139

6.7 Clustering vowels . . . 140

6.8 Conclusion . . . 143

7 Place of Articulation 145 7.1 Introduction . . . 145

7.2 The principle of similar place avoidance . . . 146

7.2.1 Testing SPA . . . 148

7.2.2 Maltese results . . . 151

7.2.3 CELEX results . . . 154

7.2.4 ASJP results . . . 158

7.2.5 Parameters of SPA . . . 163

7.2.6 The repulsion of likes . . . 168

7.2.7 Summary . . . 169

7.3 Clustering consonants in terms of place features . . . 170

7.3.1 Maltese roots . . . 171

7.3.2 English lemmas from the CELEX database . . . 173

7.3.3 Cross-linguistic sample of word forms from the ASJP database . 176 7.4 Relevance for phonological theory . . . 178

7.5 Conclusion . . . 180

8 Conclusion 183

Bibliography 185

Appendices 201

(6)

A Language data 203

A.1 Language sample . . . 203

A.2 ASJP family coverage . . . 204

B SPA Results 211 B.1 Maltese verbal roots . . . 211

B.2 ASJP data . . . 216

B.3 CELEX English . . . 217

B.4 CELEX German . . . 221

B.5 CELEX Dutch . . . 225

B.6 Vowel length . . . 229

B.7 Manner of articulation and laryngeal features . . . 232

(7)

3.1 Dendrogram for the normalized dissimilarity matrix . . . 42

3.2 MDS plot for the normalized dissimilarity matrix . . . 44

3.3 Anscombe’s Quartet: Scatter plots with linear regression lines . . . 47

3.4 Unsorted visualization matrix ofφvalues . . . 51

3.5 Sorted visualization matrix of φvalues . . . 52

3.6 The processing pipeline of thePhonMatrix visualization tool . . . 53

3.7 Screenshot of step 2 of thePhonMatrix tool . . . 54

3.8 World map with languages in the ASJP database . . . 58

3.9 World map of the languages in the sample . . . 60

4.1 Boxplots for all correctly and incorrectly classified symbols together with their relative frequency in the corpus (Sukhotin’s algorithm) . . . 87

4.2 Theφ substitution matrices for all 30 languages in the sample . . . 95

4.3 Visualization of the substitution matrix for orthographic English text . 96 4.4 Boxplots for all correctly and incorrectly classified symbols together with their relative frequency in the corpus (substitution method) . . . 100

6.1 The visualizedφ matrix for Turkish and Latin . . . 130

6.2 Unsorted and sortedφ matrices for Finnish . . . 131

6.3 φmatrices for 30 languages sorted according to decreasing average (ab- solute)φ values . . . 133

6.4 φmatrix visualizations for Turkish and Finnish . . . 134

6.5 φmatrix visualization for Hungarian . . . 135

6.6 φmatrix visualizations for Warlpiri and Maori . . . 137

6.7 φmatrix visualizations for Swahili and Maltese . . . 138

6.8 φmatrix visualization for Udihe . . . 139

6.9 Dendrogram for Turkish vowels . . . 141

6.10 MDS plot for Turkish vowels . . . 142

6.11 Dendrogram for Finnish vowels . . . 142

7.1 Scatter plot and boxplots for languages in the ASJP database . . . 161 7.2 Map of all languages in the ASJP database and their conformity to SPA 162

vii

(8)

7.3 The six macro areas of the world according to Dryer (1989, 1992) . . . . 162 7.4 Dendrogram for Maltese consonants . . . 172 7.5 Dendrogram for consonants on the basis of their distribution in all word

forms in the English CELEX database . . . 174 7.6 Dendrogram for consonants on the basis of their distribution in all word

forms in the ASJP database . . . 177 7.7 MDS plot for consonants on the basis of their distribution in all word

forms in the ASJP database . . . 179

(9)

2.1 Greenberg’s (1960) quantitative results for morphological types . . . 19

3.1 Absolute values of bigram frequencies in the English toy corpus . . . 29

3.2 Conditional probability values of bigram frequencies in the English toy corpus . . . 30

3.3 Fourfold contingency table example . . . 31

3.4 Expected and observed values for the letter a . . . 32

3.5 φvalues of bigrams in the English toy corpus . . . 36

3.6 Euclidean distance between matrix row vectors . . . 40

3.7 Dissimilarity matrix for bigramφvalues . . . 40

3.8 Dissimilarity matrix for normalized bigramφ values . . . 40

3.9 Data for Anscombe’s Quartet . . . 46

3.10 The division of labor between humans and machines . . . 48

3.11 Number of word forms and lemmas in the CELEX database . . . 56

3.12 Languages in the sample . . . 59

4.1 Classification results for Sukhotin’s algorithm . . . 85

4.2 Quantitative evaluation of Sukhotin’s algorithm for all languages in the sample . . . 88

4.3 Results for Sukhotin’s algorithm on the CELEX database for English, German and Dutch . . . 89

4.4 Quantitative evaluation of Sukhotin’s algorithm on CELEX and ASJP data . . . 89

4.5 Results for Sukhotin’s algorithm on all word forms in the ASJP database 91 4.6 Absolute counts of substitutions of sounds in the English text . . . 93

4.7 Classification results of the substitution method . . . 97

4.8 Quantitative evaluation of the substitution method for all languages in the sample . . . 99

4.9 Results for the substitution method on the CELEX and ASJP data . . . 102

4.10 Quantitative evaluation of the substitution method on the CELEX and ASJP data . . . 103

ix

(10)

5.1 Number of occurrences of different syllable types in initial and final

position . . . 107

5.2 Example calculations for determining syllable breaks in word-medial clusters . . . 109

5.3 Different analyses of the syllabification of the English word happy. . . . 111

5.4 Examples of divergent analyses in CELEX and NETtalk syllabifications 111 5.5 Evaluation of the syllabification method for five languages . . . 113

6.1 Turkish vowels . . . 119

6.2 Turkish harmony classes . . . 120

6.3 Turkish harmony spreading from left to right . . . 121

6.4 Permissible vowel successions in Turkish harmonic words . . . 122

6.5 Finnish vowels . . . 122

6.6 Finnish harmony classes . . . 123

6.7 Permissible vowel transitions in Finnish harmonic words . . . 123

6.8 Example of a matrix with succession counts for the Finnish orthographic text . . . 128

6.9 The matrix with theφeffect strength values for Finnish . . . 129

6.10 The matrix ofφvalues for Turkish . . . 130

6.11 The matrix ofφvalues for Latin . . . 130

6.12 Hungarian short vowels . . . 135

6.13 Hypotheses about potential harmony patterns in Udihe . . . 139

6.14 Turkish vowels . . . 140

7.1 Assignment of consonants to symbols for the ASJP orthography . . . . 149

7.2 Maltese consonants . . . 152

7.3 Results for Maltese place combinations for triliteral roots . . . 153

7.4 Results for Maltese identical place combinations in Pos 1 and 2 . . . 153

7.5 Results for Maltese identical place combinations in Pos 2 and 3 . . . 153

7.6 Results for Maltese identical place combinations in Pos 2 and 3, ignoring identical consonants . . . 154

7.7 Results for Maltese identical place combinations in Pos 1 and 3 . . . 155

7.8 Twaddell’s results for CVC sequences in German with stressed vowels . 155 7.9 Results for CVC sequences in German with stressed vowels . . . 156

7.10 Results for CVC sequences in German lemmas . . . 156

7.11 Results for CVC sequences in English lemmas . . . 157

7.12 Results for CVC sequences in Dutch lemmas . . . 158

7.13 Results for CVC sequences for all word forms in the ASJP database . . 159

7.14 Results for the Dryer test for all languages in the ASJP database . . . . 164

7.15 Results for all CVC sequences with long and short intervening vowels for English lemmas . . . 166

7.16 Results for all CVC sequences with long and short intervening vowels for German lemmas . . . 166

7.17 Results for all CVC sequences with long and short intervening vowels for Dutch lemmas . . . 167 7.18 Twaddell’s results for CVC sequences in German with stressed vowels . 169

(11)

7.19 Results for all CVC sequences in Maltese verbal roots with respect to

their manner category . . . 169

A.1 Summary of the language data . . . 203

A.2 ASJP family coverage . . . 204

B.1 Maltese place combinations (all verbal roots) . . . 211

B.2 Maltese place combinations (triliteral roots) . . . 212

B.3 Maltese identical place combinations . . . 212

B.4 Maltese identical place combinations (Pos 1 and 3) . . . 213

B.5 Maltese identical place combinations (Pos 2 and 3) . . . 213

B.6 Maltese place combinations (all roots; ignoring identicals) . . . 214

B.7 Maltese place combinations (ignoring identicals) . . . 214

B.8 Maltese identical place combinations (Pos 1 and 2; ignoring identical consonants) . . . 215

B.9 Maltese identical place combinations (Pos 1 and 3; ignoring identical consonants) . . . 215

B.10 Maltese identical place combinations (Pos 2 and 3; ignoring identical consonants) . . . 216

B.11 ASJP place combinations . . . 216

B.12 ASJP place combinations (ignoring identicals) . . . 217

B.13 CELEX place combinations for English lemmas . . . 217

B.14 CELEX place combinations for English word forms . . . 218

B.15 CELEX place combinations for CVC sequences with stressed vowels in English lemmas . . . 218

B.16 CELEX place combinations for CVC sequences with stressed vowels in English word forms . . . 219

B.17 CELEX place combinations in English lemmas (ignoring identicals) . . . 219

B.18 CELEX place combinations in English word forms (ignoring identicals) . 220 B.19 CELEX place combinations for CVC sequences with stressed vowels in English lemmas (ignoring identicals) . . . 220

B.20 CELEX place combinations for CVC sequences with stressed vowels in English word forms (ignoring identicals) . . . 221

B.21 CELEX place combinations in German lemmas . . . 221

B.22 CELEX place combinations in German word forms . . . 222

B.23 CELEX place combinations for CVC sequences with stressed vowels in German lemmas . . . 222

B.24 CELEX place combinations for CVC sequences with stressed vowels in German word forms . . . 223

B.25 CELEX place combinations in German lemmas (ignoring identicals) . . 223

B.26 CELEX place combinations in German word forms (ignoring identicals) 224 B.27 CELEX place combinations for CVC sequences with stressed vowels in German lemmas (ignoring identicals) . . . 224

B.28 CELEX place combinations for CVC sequences with stressed vowels in German word forms (ignoring identicals) . . . 225

B.29 CELEX place combinations in Dutch lemmas . . . 225

(12)

B.30 CELEX place combinations in Dutch word forms . . . 226

B.31 CELEX place combinations for CVC sequences with stressed vowels in Dutch lemmas . . . 226

B.32 CELEX place combinations for CVC sequences with stressed vowels in Dutch word forms . . . 227

B.33 CELEX place combinations in Dutch lemmas (ignoring identicals) . . . 227

B.34 CELEX place combinations in Dutch word forms (ignoring identicals) . 228 B.35 CELEX place combinations for CVC sequences with stressed vowels in Dutch lemmas (ignoring identicals) . . . 228

B.36 CELEX place combinations for CVC sequences with stressed vowels in Dutch word forms (ignoring identicals) . . . 229

B.37 CELEX place combinations with long intervening vowels for English lemmas . . . 229

B.38 CELEX place combinations with short intervening vowels for English lemmas . . . 230

B.39 CELEX place combinations with long intervening vowels for German lemmas . . . 230

B.40 CELEX place combinations with short intervening vowels for German lemmas . . . 230

B.41 CELEX place combinations with long intervening vowels for Dutch lemmas . . . 231

B.42 CELEX place combinations with short intervening vowels for Dutch lemmas . . . 231

B.43 Maltese manner combinations . . . 232

B.44 Maltese sonorant/obstruent combinations . . . 232

B.45 Maltese voiced/unvoiced combinations . . . 233

B.46 CELEX voiced/unvoiced combinations for English lemmas . . . 233

B.47 CELEX sonorant/obstruent combinations for English lemmas . . . 233

B.48 CELEX voiced/unvoiced combinations for German lemmas . . . 233

B.49 CELEX sonorant/obstruent combinations for German lemmas . . . 234

B.50 CELEX voiced/unvoiced combinations for Dutch lemmas . . . 234

B.51 CELEX sonorant/obstruent combinations for Dutch lemmas . . . 234

(13)

This dissertation would not have been possible without the support and guidance that I have received from many people over the years. It is impossible to do justice to their contribution to this work by writing a few words of thanks here. Instead I hope that everybody who directly or indirectly had an impact on the final outcome of this work finds his or her share while reading through it. Having said that, I want to express my gratitude to the following people directly.

First of all, I would like to thank my supervisor Frans Plank. He has been my mentor ever since my first seminar at university and taught me many things about languages in general and linguistic typology in particular. I was very lucky to have him as a supervisor because he gave me the freedom to pursue my own ideas. I would also like to express my deep gratitude to Miriam Butt, my second advisor on the board, who has put so much energy in reading and correcting earlier versions of this work. I benefitted a lot from her enthusiasm in doing and organizing linguistic research.

Furthermore, I would like to thank Emily Bender, who kindly agreed to be the external examiner on my committee and sent me very detailed comments on a previous version of this dissertation. All remaining errors and omissions remain my own responsibility.

I am deeply grateful to Bernhard W¨alchli for the input he gave me on various issues that are integrated in this work. Many of the ideas and approaches that are discussed in this dissertation are a result of the numerous discussions that I had with him about the various topics or about language and (academic) life in general. But, first and foremost, I am thankful for his friendship.

I owe many thanks to Christian Rohrdantz, who was my colleague for over two years and with whom I discussed many of the computational aspects of this thesis.

His contribution is especially in the joint work we had on the visualization of vowel harmony patterns (Chapter 6) and the Dryer test for SPA in Chapter 7, for which Michael Hund gathered the necessary geographical information. Working together in the CALD project and in the group of Daniel A. Keim was a great experience for me.

Many thanks also go to all my colleagues in Konstanz with whom I shared offices, ideas and so many things that are not related to linguistics. Konstanz was an enor- mously inspiring environment for me. I would like to thank the whole department for their support in creating a very nice working atmosphere. In particular, I would like to mention Simon and Muna, who went through the same trouble with me for five years.

xiii

(14)

I would like to thank Maria V. Tolskaya and Irina Nikolaeva for providing the Udihe texts and Michael Spagnol for the list of Maltese verbal roots. In addtion, I would like to thank Young-Bum Kim, who kindly shared the data on vowel/consonant discrimination with me, as well as ¨Osten Dahl for sending me the geographical co- ordinates of the languages listed by Ethnologue. I am indebted to the reviewers of the papers that have been incorporated into this dissertation for their input (Michael Cysouw, John Goldsmith, Maria Koptjevskaja-Tamm, John Nerbonne as well as the anonymous reviewers). I also want to thank Michael Cysouw for the time he gave me to finish the writing process.

Last but not least, all this would not have been possible without the emotional support and encouragement that I received from my family and especially from Eva in the final stages of the writing process. Vielen Dank f¨ur alles!

This research was made possible through funding from the research initiative “Com- putational Analysis of Linguistic Development” and the DFG Sonderforschungsbereich 471 “Variation und Entwicklung im Lexikon” at the University of Konstanz.

(15)

Die vorliegende Arbeit befasst sich mit der Frage, inwieweit phonologische Strukturen anhand der Verteilung von Lauten in W¨ortern einer Sprache abgeleitet werden k¨onnen.

Zu diesem Zweck wird ein typologisch orientierter Ansatz verfolgt, der auf Methoden aus den Bereichen der Computerlinguistik, des Data Mining und der visuellen Analyse (Visual Analytics) beruht. Die vorgestellten Methoden werden dabei als prozedura- le Universalien im typologisch-sprachvergleichenden Sinne verstanden, welche auf alle nat¨urlichen Sprachen angewendet werden k¨onnen, wenngleich sie f¨ur einzelne Sprachen unterschiedliche Resultate liefern.

Die Grundannahme alle hier behandelten Methoden besteht in der Beobachtung, dass die Kookkurrenz von Lauten in relevanten Kontexten innerhalb von W¨ortern einer Sprache eingeschr¨ankt ist. Die Einschr¨ankungen f¨uhren wiederum zu einer ent- sprechenden Verteilung, welche dazu verwendet werden kann, eine Unterscheidung in den Lauten der Sprache zu ermitteln, die zu einer ¨ahnlichen Einteilung in nat¨urliche Klassen und Merkmale in der phonologischen Theorie in Beziehung gesetzt werden kann. Dabei liegt der Schwerpunkt der vorliegenden Arbeit nicht auf den statistischen Methoden, die notwendig sind, um die Strukturen schlussendlich ableiten zu k¨onnen, sondern vielmehr auf den linguistisch motivierten Kontexten, welche die zugrundelie- genden Beschr¨ankungen am effizientesten ausweisen.

Die Ableitung von phonologischen Strukturen aus Sprachdaten ist aus mehreren Gr¨unden von Interesse f¨ur die linguistische Forschung. Zum einen ist es bemerkens- wert, dass phonologische Merkmale, die gr¨oßtenteils auf der Basis von physiologischen oder akustischen Wesensz¨ugen definiert sind, auch aufgrund ihrer distributionellen Ei- genschaften in der Sprache verankert sind. In dieser Arbeit erg¨anze ich die bisherige Forschung im Bereich des automatischen Lernens von phonologischen Kategorien (z.B.

Ellison 1994; Goldsmith and Xanthos 2009) um einen Ansatz zur Ableitung von Konso- nantunterscheidungen bez¨uglich des Artikulationsortes. Die Methode basiert auf dem Prinzip der Similar Place Avoidance (SPA; Pozdniakov and Segerer 2007), welches be- sagt, dass Konsonanten in CVC-Sequenzen dazu neigen, unterschiedliche Merkmale hinsichtlich des Artikulationsortes aufzuweisen. Ich steuere dabei zu fr¨uheren Arbeiten bei, indem ich zeige, dass dieses Prinzip nicht nur innerhalb der semitischen Sprachen gilt (mit einer Studie ¨uber Verbalwurzeln im Maltesischen), sondern auch f¨ur westger- manische Sprachen (mit einer Untersuchung der CELEX-Daten f¨ur Englisch, Deutsch

xv

(16)

und Niederl¨andisch) und eine sprach¨ubergreifende Sammlung von Wortformen aus der ASJP-Datenbank einschl¨agig ist (Dryer-Test auf Universalit¨at), was darauf schließen l¨asst, dass es eine Universalie darstellt. Dieses Prinzip kann benutzt werden, um Konso- nantunterscheidungen hinsichtlich des Artikulationsortes abzuleiten und liefert nahezu perfekte Resultate auf den ASJP-Daten und der Liste der maltesischen Verbalwur- zeln. Die automatisch generierten Dendrogramme entsprechen dabei den hierarchischen Strukturen der nat¨urlichen Klassen, wie sie in der phonologischen Literatur angenom- men werden (Rice 1994; McCarthy 1994).

Außerdem erg¨anzt die vorliegende Arbeit fr¨uhere Ans¨atze zur maschinellen Er- kennung phonologischer Struktur um eine neue Methode zur automatischen Unter- scheidung von Vokalen und Konsonanten einer Sprache, welche nicht auf N-Gramm- Statistiken beruht. Die Substitutionsmethode basiert auf der H¨aufigkeit, mit der Laute als unterscheidendes Element in Minimalpaaren auftreten. Obwohl die Methode nicht dasselbe Genauigkeitsniveau wie fr¨uhere Ans¨atze (z.B. Sukhotin 1962; Ellison 1994;

Goldsmith and Xanthos 2009; Kim and Snyder 2013) erreicht, zeigt sie dennoch, dass eine Unterscheidung von Vokalen und Konsonanten auch auf der Basis ihrer Beziehung in absentia m¨oglich ist.

Zum zweiten wird die Ableitung phonologischer Strukturen in dieser Arbeit da- hingehend betrachtet, dass eine große Menge von Daten nach phonotaktischen Be- schr¨ankungen untersucht werden kann. Zu diesem Zweck wird ein Ansatz der visuellen Analyse f¨ur die Erkennung von Vokalharmoniemustern vorgestellt, der als Machbar- keitsnachweis dienen soll, dass mit Visualisierungen erg¨anzte statistische Analysen et- waige interessante Muster in den Daten f¨ur die menschliche Wahrnehmung zug¨anglicher machen. Wie die Matrixvisualisierungen zeigen, k¨onnen Sprachen, welche Muster von Vokalharmonie (oder ¨ahnlichen Ph¨anomenen) aufweisen, von Sprachen, bei denen sol- che Erscheinungen nicht vorhanden sind, unterschieden werden. Der Visualisierungsan- satz kann leicht auf verwandte Ph¨anomene, z.B. Konsonantharmonie (Hansson 2010), Synharmonismus (Trubetzkoy 1939 [1967]) oder jegliche Art von (statistischen) pho- notaktischen Beschr¨ankungen, erweitert werden. Das statistische Maß, auf dem die Matrixvisualisierungen beruhen, kann auch als ein typologisches Maß dienen, anhand dessen Sprachen miteinandern verglichen werden k¨onnen. Die Anordnung der Sprachen gem¨aßdieses Maßes spiegelt in etwa die Intuition wider, welche Sprachen auff¨allige Mu- ster aufweisen.

(17)

This dissertation explores to what extent phonological structure can be inferred from the distribution of sounds within words. For this purpose, a typologically oriented computational approach is pursued, which rests on techniques from the fields of com- putational linguistics, data mining and visual analytics. The methods that are pre- sented are considered to be procedural universals which can be applied to any natural language in the same way even though they yield different results for individual lan- guages.

The basic assumption that underlies all methods is that the co-occurrence of sounds in relevant contexts within words of a language is constrained. The restrictions of combinations of sounds lead to a given distribution, which in turn can be used to induce a distinction in the sounds of the language that can be related to natural classes and features in phonological theory. The focus of the present approach is not so much on the statistical methods that are necessary to induce the latent structures, but on the linguistically motivated contexts which manifest the existing constraints most clearly.

The induction of phonological structure from language data is an interesting re- search topic for various reasons. First of all, it is remarkable that phonological features, which are mostly defined in terms of articulatory or acoustic properties, are also re- flected in the distribution of sounds in a language. In this thesis, I complement previous work on learning phonological categories (e.g., Ellison 1994; Goldsmith and Xanthos 2009) with an approach to infer place of articulation distinctions in consonants. The method is based on the principle of similar place avoidance (SPA; Pozdniakov and Segerer 2007), which states that consonants in CVC sequences tend to exhibit differ- ent place features. I contribute to earlier work in this research area by showing that this principle is not only active in Semitic languages (with a study of Maltese verbal roots) but also holds for West Germanic languages (with an investigation of the entries in the CELEX database for English, German and Dutch) and a worldwide sample of word forms from the ASJP dataset (Dryer test for universality), leading to the con- clusion that it is a statistical universal. Using this principle to infer place distinctions in consonants yields almost perfect results for the ASJP data and the list of Maltese verbal roots. The automatically generated dendrograms closely correspond to the hi- erarchical structures for natural classes that have been postulated in the phonological

xvii

(18)

literature (e.g., Rice 1994; McCarthy 1994).

In addition, the present thesis complements previous work on the machine learning of phonological structure with a novel method to automatically discriminate vowels and consonants in a language that is not based on N-gram statistics. The substitution approach relies on the frequency of sounds to occur as the discriminating segments in minimal pairs. Although the method does not achieve the same level of accuracy as earlier approaches in this area (e.g., Sukhotin 1962; Ellison 1994; Goldsmith and Xan- thos 2009; Kim and Snyder 2013), it shows that a distinction of vowels and consonants can also be inferred from the relation of sounds in absentia.

Second, the induction of phonological structure is considered in the present work as a way to explore a large amount of language data in search for the presence of phono- tactic constraints. To this end, I present a visual analytics approach for the detection of vowel harmony patterns that is intended as a proof of concept that a graphically enhanced statistical analysis can make potentially interesting patterns in the data more accessible to human perception. As the matrix visualizations show, languages exhibiting patterns of vowel harmony (or similar phenomena) can be distinguished from languages without such constraints at a glance. The visualization approach can easily be extended to other related phenomena, e.g., consonant harmony (Hansson 2010), synharmonism (Trubetzkoy 1939 [1967]) or any kind of (statistical) phonotactic constraints. The statistical measure on which the vowel harmony visualizations are based can also serve as a typological measure on the basis of which languages can be compared. The ranking of languages according to this measure approximately reflects the intuition about which languages show conspicuous harmony patterns.

(19)

Introduction

The aim of this dissertation is the induction of phonological structure on the basis of the distribution of sounds within and across words. To this end, the co-occurrence of sounds is statistically analyzed with respect to relevant contexts where systematic constraints are in effect that restrict their possibilities of combination. The statistical analyses are performed for mainly two purposes. On the one hand, they serve as the input for a cluster analysis procedure that attempts to induce phonologically relevant distinctions in the sounds which can be related to a classification in terms of natural classes and phonological features. Since the constraints that are exploited to induce the distinctions are to a large extent tendencies rather than absolute restrictions and thus only reveal interesting findings for a larger amount of data, a computational analysis is employed. On the other hand, the statistical measures can be used for an automatically generated visualization that makes potential patterns in the data more easily accessible to human perception. The ultimate goal of a visualization approach is to allow for the detection of phonologically interesting patterns that would otherwise go unnoticed.

The relevant contexts from which the statistical values are extracted are motivated from research in linguistic typology (Comrie 1981; Croft 2002). Cross-linguistic knowl- edge is considered to be helpful in finding such contexts because it provides insights as to the unity and diversity of the world’s languages. The main aim of this thesis is therefore not to use sophisticated methods from machine learning or statistics (see, for example, Ellison 1994; Goldsmith and Xanthos 2009) but to concentrate on novel linguistically motivated contexts that serve as the basis for the statistical analyses.

The statistical analyses are a prerequisite to account for the fact that the constraints under investigation are for the most part tendencies rather than absolute constraints.

The dissertation can be divided into two main parts: Part I (Chapters 2 and 3) gives background information on the motivations and methodology, whereas Part II (Chapters 4, 5, 6 and 7) shows how the approach can be applied to four different areas of research. Finally, Chapter 8 concludes the dissertation.

Chapter 2 introduces the main characteristics of the present approach and the rea- sons why the combination of typological and computational knowledge might help to further insights into the structure of language. Moreover, the chapter discusses the main motivations for the induction of phonological structure from the distribution of

1

(20)

sounds within words. One important aspect of this investigation is thereby seen in the relationship between the definition of phonological features in terms of articulatory and acoustic properties and their description on the basis of distributional criteria. It is shown in this thesis that a set of distinctions of sounds with respect to articula- tory features can be induced from their distribution. A second incentive concerns the extraction of typologically relevant features from data on the language. The idea is that the statistical values can also be used to calculate a typologically relevant mea- sure on the basis of which languages can be compared. Finally, the usefulness of an unsupervised phonological component for the automatic induction of a morphological analysis of a language (cf. Roark and Sproat 2007; Hammarstr¨om and Borin 2011) is discussed, in particular with respect to the question of the relationship between syllable and morpheme boundaries.

Chapter 3 provides a closer look at the statistical methods that will be employed and how they can be motivated and interpreted. I also introduce the aims and motiva- tions of methods from the field of visual analytics (Thomas and Cook 2005) and how the exploration of potentially interesting patterns in language data can benefit from an integration of these methods. Furthermore, the language data that will be used throughout this thesis are presented.

In Chapter 4, I discuss the discrimination of vowels and consonants from their distribution within words. The basic distinction between vowels and consonants is fundamental for all other patterns and structures that will be extracted in this thesis because they rely on contexts for which the distinction is a prerequisite. Two main approaches for the discrimination are presented together with their results. First, I describe Sukhotin’s algorithm (Sukhotin 1962, 1973), which is based on the assumption that vowels and consonants tend to alternate rather than group together in words.

A cross-linguistic evaluation of its success is given on the basis of word lists in a number of languages. Second, an alternative approach is introduced that does not rely on relationships in praesentia (such as bigrams as they are employed in Sukhotin’s algorithm and other approaches), but rather makes use of in absentia relations to discriminate vowels from consonants. The idea is that minimal pairs that are extracted from word lists in a language provide a useful context to investigate the similarity of sounds. The underlying assumption is that the more often sounds can be substituted in these maximally large contexts on the word level the more similar they are, especially with respect to their vocalic or consonantal properties. Again, an evaluation of the results is provided, together with a discussion of the possibilities of visually presenting such dependencies and patterns.

Based on the vowel-consonant distinction, Chapter 5 explores the possibilities of inferring the syllabification of a word on the basis of the distribution of consonant and vowel clusters in reasonably-sized word lists in that language. Two major syllabification methods that have been described in the literature are discussed. One of them relies on the assumption that the distribution of word-peripheral consonant clusters helps in determining syllable boundaries in word-medial consonant clusters (Kury lowicz 1948).

I present a qualitative and quantitative evaluation of my own method, which is a slight modification of an earlier approach (O’Connor and Trim 1953) that takes into account the individual frequencies of peripheral consonant clusters. The main point of this chapter is to show how much information about the structure of syllables is

(21)

contained in the distribution of word-peripheral consonants rather than to improve existing syllabification methods.

Chapter 6 presents a visual analytic approach to the detection of vowel harmonic patterns in a language. For this purpose, an introduction to the phenomenon of vowel harmony is given, followed by a discussion of how the relevant patterns can be extracted from the data and visually represented so that an at-a-glance evaluation is made pos- sible. I then go on to show that the underlying statistical values can also be employed to induce a clustering of harmony features in those languages where vowel harmony is present. This chapter presents a proof of concept for the fruitful integration of methods from visual analytics for the exploratory analysis of other phonotactic patterns. Vowel harmony is in that sense considered to be a case study for what is possible with such an approach. In addition, this chapter shows that the underlying statistics can also be employed to extract a typologically relevant measure that indicates the tendency for these languages to show conspicuous patterns of harmony.

Another study dealing with the induction of phonological features, this time looking at consonants exclusively, is provided in Chapter 7 for place of articulation categories.

To this end, the phenomenon of similar place avoidance (SPA; Pozdniakov and Segerer 2007) is investigated, which was originally formulated as a co-occurrence constraint in Semitic roots (cf. Greenberg 1950) but has recently been argued to be more widespread in the languages of the world. Its basic claim is that consonants in CVC sequences tend to have different place of articulation features. First, I demonstrate that the tendency for SPA indeed holds for a wider range of languages. For this purpose, I analyze such sequences in a cross-linguistic database of word forms. Then I argue that this principle can be exploited to infer a clustering of place features on the basis of CVC sequences. The assumption is that consonants are less similar with respect to their place feature the more often they co-occur in such contexts. The results of the clustering procedure are promising and also interesting for phonological theory with respect to the hierarchical structure and representation of place in feature geometries (Clements 1985; Clements and Hume 1994).

Some of the methods presented in this work have already been published and presented at various occasions. ThePhonMatrix description in Section 3.4.4 is taken from Mayer and Rohrdantz (2013). Chapter 4 is partly based on Mayer (2010). The syllabification algorithm discussed in Chapter 5 is described in Mayer (2010), while Chapter 6 largely builds on Mayer et al. (2010a). Chapter 7 is an elaboration on Mayer (2014) and Mayer and Rohrdantz (2011).

(22)
(23)
(24)
(25)

Background

2.1 Introduction

The aim of this chapter is to give an overview of the framework of this thesis. First, the nature of the phenomena investigated is outlined in Section 2.2. Sections 2.3 and 2.4 then give an overview of the typological and computational components of the present work and how they fit together. This is followed by a description of the motivations for the present study (Section 2.5). An important contribution of this thesis addresses the question whether distinctive features, which are largely defined in terms of articulatory properties, are reflected in the distribution of speech sounds. Section 2.5.1 provides a brief overview of the different approaches in the early phonological literature to defining phoneme categories. I argue that the results in the subsequent chapters suggest that both approaches, the articulatory and distributional approach, can be related to one another. Further, it is argued in Section 2.5.2 that typological research can benefit from a computational, data-driven approach in that parameters for cross-linguistic variation are automatically extracted from data and thus allow for a better comparability of the results. In Section 2.5.3, it is discussed to what a extent a phonological component, in particular the segmentation of words into syllables, is relevant for higher level modules such as the unsupervised learning of the morphology of a language.

2.2 Setting the stage

The structure of words in a language is constrained in several ways. First and foremost, the language can only make use of a finite set of phonemes for composing its words, only rarely borrowing a sound from another language into its own system. The phoneme inventory is thus a characteristic feature of a language, which makes it stand apart from other languages. Yet it is not only the inventory as such in which languages differ but also the relative frequencies of individual sounds. Of course, different corpora from the same language will show varying frequency distributions. However, the overall tendency in these figures shows a consistent behavior (at least in its basic patterns) across corpora provided that they are large enough. Although the tendencies are quite

7

(26)

strong and robust,1 it is still possible for a speaker of the language to violate some of these constraints and at the same time produce a natural passage of text. Consider the following paragraph from an English novel:

Branton Hills was a small town in a rich agricultural district; and having many a possibility for growth. But, through a sort of smug satisfaction with conditions of long ago, had no thought of improving such important adjuncts as roads; putting up public buildings, nor laying out parks; in fact a dormant, slowly dying community. So satisfactory was its status that it had no form of transportation to surrounding towns but by railroad, or

“old Dobbin.” Now, any town thus isolating its inhabitants, will invariably find this big, busy world passing it by; glancing at it, curiously, as at an odd animal at a circus; and, you will find, caring not a whit about its condition. Naturally, a town should grow. You can look upon it as a child;

which, through natural conditions, should attain manhood; and add to its surrounding thriving districts its products of farm, shop, or factory. It should show a spirit of association with surrounding towns; crawl out of its lair, and find how backward it is.

Ernest Vincent Wright, Gadsby(1939) Most people would agree that this excerpt represents a grammatical and acceptable portion of the English language. However, it violates one of the strongest tendencies in English (orthographic) texts: the lettere, which is the most frequently occurring letter in English text, is never used in the whole novel of over 50,000 words. Its author Ernest Vincent Wright made sure not to use it in that he tied down the E-type bar of the typewriter. He mentions in the introduction to the novel that the greatest difficulty for him was to find substitutes for the past tense of regular verbs, all of which end in -ed.

This goes to show that tendencies in a language, even though they are very strong, do not necessarily have to be observed all the time, which is in contrast to absolute rules where there is a binary distinction between the presence or absence of a cer- tain combination. Such rules are mostly claimed with respect to restrictions in the (co-)occurrence of sounds within a certain domain (e.g., syllable, morpheme, word), a subfield of phonology which is commonly referred to as the phonotactics of a language.

One of the tasks in a phonotactic analysis is to give absolute rules on the proper shape of syllables.2

The restrictions that I am mainly concerned with in this thesis are of the first type.

They manifest themselves in statistical tendencies rather than absolute rules and can be detected by looking at the distribution of symbols. Harris (1963:5) defined the term distribution as “the freedom of occurrence of portions of an utterance relatively to each other.” The distributional or part/part relation constitutes one of two fundamental

1The relative frequencies for letters in English only differ slightly when compared across different input texts (cf. Trubetzkoy 1939 [1967]:233; Altmann and Lehfeldt 1980:102-103).

2A restriction on the combination of sounds would be the observation that English does not allow the consonant cluster [kn] at the beginning of a word.

(27)

types of relations which characterize linear units that consist of discrete parts (the other one being the functional or part/whole relation). Juilland (1961:24-26) distinguishes between two relevant distributional positions, which are defined as Before and After, where “any part may (+) or may not (−) occur in these two positions relative to any and all parts of the same order.” The main question to be investigated here, however, is not whether a particular combination is permissible or not (i.e., absent or present in the language) but whether the combination is statistically underrepresented (or overrepresented) in the data, i.e., whether it is more (or less) frequently observed than expected under the assumption of independence of all sounds in the combination.3

The reason for statistically investigating such constraints is to distinguish between systematic and accidental patterns in the data. Every language shows phonotactic gaps in its lexicon due to the fact that the number of theoretically possible forms is much larger than what is needed for the average size of a lexicon. It is therefore nec- essary to make a distinction between systematic restrictions and accidental gaps, i.e., forms that do not exist in the language but could in principle be made use of. I will consider a constraint to be systematic if the restrictions involve groups of sounds that can be subsumed under natural classes.4 A gap is thereby taken to be more systematic the more general these classes are and the more languages show the same kinds of restriction. In this thesis, I will mainly be concerned with statistical tendencies rather than absolute gaps. For this purpose, it is not only tested whether a certain associa- tion is under- or overrepresented, but also to what degree the association holds. The quantitative assessment makes it possible to infer relationships between the members of the combinations, which, in turn, allow for the induction of hidden structures from the data.

The present thesis combines a computational approach to investigate co-occurrence restrictions within words with cross-linguistic insights and motivations. The main characteristics of both components are laid out in the following two sections.

2.3 Linguistic typology

The goal of typological research in linguistics is to investigate the unity and diversity in the structures of the world’s languages. Cross-linguistic studies show that there is indeed a considerable amount of structural diversity. However, the fact that children are capable of learning the structure of any language within a relatively short period of time has given rise to a wealth of studies on universal characteristics of human languages (cf. Greenberg 1963).

Typological research (Comrie 1981; Croft 2002) is helpful in devising such univer- sally applicable operations as it investigates cross-linguistically which patterns can be observed more frequently than others. One of the aims of linguistic typology is the search for language universals, i.e., patterns that can be found systematically across languages and which are potentially true for all languages.5 The latter is a very bold

3A description of how to compute whether a combination is under- or overrepresented is provided in Section 3.2.

4Halle (1961) defines a set of speech sounds to form a natural class “if fewer features are required to designate the class than to designate any individual sound in the class.”

5A large collection of language universals that have been claimed in the literature can be found in

(28)

claim as typological studies have taught us that structural diversity is usually underes- timated. Nevertheless, even though the more ambitious claim of an absolute universal may not be tenable, it should at least hold for a larger number of languages (so-called statistical universals).

Linguistic universals can be understood as structures or as processes (cf. Mayer et al. 2014). Whereas a structural universal claims that there are constant structures attested in all languages (e.g., all spoken languages have vowels), procedural univer- sals refer to a universally applicable procedure that extracts different structures from different corpora in different languages. Some structural universals can be translated into a procedural form in order to integrate them into learning algorithms. As an example, consider the cross-linguistic finding that the universal model of the syllable consists of a consonant followed by a vowel.6 In this form, the universal states that all languages have syllables of the CV type in their repertoires whereas other syllable types can only be found if the language has a higher syllable complexity. Provided that syllables of the type CV are not only widespread cross-linguistically but also within languages as to their frequency of occurrence this universal claim can be translated into a procedural universal. This procedural universal can in turn be formulated in a method that aims to induce a classification of all symbols into two groups which are involved in building up the universal (or unmarked) syllable type. If we assume that across a larger amount of data the general model of the syllable is CV and that words to a large extent consist of more than one syllable, an alternation of consonants and vowels is a plausible pattern that an inductive method should look for. The corre- sponding procedural universal can thus be formulated as follows. Assume a sequence of consonants and vowels (or CVCVCV. . . ) to be the underlying pattern. A deviant structure where consonants (or vowels) group together is considered to be noise in the data. The desired result, a clustering of all symbols into vowels and consonants, is achieved if symbols which occur frequently together are taken to be from different groups. In Chapter 4, this procedural universal is tested with the implementation of Sukhotin’s algorithm.

The study of language universals can generally be approached from two different angles which differ in a number of parameters (cf. Comrie 1981):

• the data base on which the existence of language universals is investigated (a sample of languages vs. only a single language);

• the degree of abstractness of the analysis on which the definition of the universals is based;

• and the kinds of explanations that are put forward in order to explain the exis- tence of such universals.

the Konstanz Universals Archive athttp://typo.uni-konstanz.de/archive/(Plank and Filimonova 2002ff).

6“Since many languages lack syllables without a prevocalic consonant and/or with a postvocalic consonant, CV (Consonant + Vowel) proves to be the only universal model of the syllable” (Jakobson and Halle 1956 [2002]:51). See Chapter 5 for apparent counterexamples to this claim and their relevance for unsupervised syllabification methods.

(29)

The first approach is best known from the work of Noam Chomsky involving the notion of Universal Grammar (UG). Chomsky (1981:6) claims that “a great deal can be learned about UG from the study of a single language, if such study achieves sufficient depth to put forth rules or principles that have explanatory force but are underdetermined by evidence available to the language learner.” The second approach is most closely associated with the work of Joseph H. Greenberg, who opted for a study of language universals which is based on a wider range of languages and which remains theory-neutral as well as open-minded with respect to possible explanations.

The latter approach also reflects the orientation in this thesis, namely to test the various methods that are presented with data from a number of languages. They deal with the induction of comparatively simple and straightforward concepts, a limitation which is also largely due to the computational nature of the present approach.

Linguistic typology is not only interested in finding those patterns which are poten- tially true for all languages. It also investigates the structural diversity of languages, which naturally requires a cross-linguistic approach along the lines of Greenberg. A study which aims to induce phonological structures easily runs the risk of being too restrictive regarding the ways in which languages vary in their structure. Insights from cross-linguistic research can help to draw attention to the variety of structures that exist but at the same time show that this variation is still bounded. An induction system that is intended to be language-independent will therefore benefit from linguis- tic knowledge which takes this into account and is less prone to be overfitted to the languages on the basis of which the method is developed (cf. Bender 2009, 2011). I will come back to this point in the discussion of the syllabification procedure in Chapter 5.

In sum, typological knowledge is deemed helpful in finding relevant contexts for the induction of phonological structure for two reasons. On the one hand, it offers a variety of studies of those phenomena that are considered to be shared by all (or most) natural languages. On the other hand, cross-linguistic research shows the parameters of variation along which languages can differ. In this thesis, both kinds of knowledge are considered to be of great use when devising induction methods for phonological structures.

2.4 A computational approach

The goal of this thesis is to devise operations that are able to infer a certain structure from the data to which they are exposed. In that respect, the present work is similar in its goals to the research objectives in structural linguistics, for which Harris stated that it is about “the operations which the linguist may carry out in the course of his investigations, rather than a theory of the structural analyses which result from these investigations” (Harris 1963:1). These operations are intended to be valid across languages. To achieve this goal, linguistically motivated assumptions are formulated which constitute the basis for the operations. Apart from the typological motivation, the present work is also characterized by its computational approach. The operations are intended to be translatable into algorithms. An algorithm is usually defined as a finite list of well-defined instructions that transform a given input into an output (Cormen et al. 2009). Likewise, the methods that are discussed in this study can be

(30)

seen as step-by-step procedures for inferring the latent phonological structures from the data. The aim in this study is to formulate the operations in a way that allows them to be implemented in a computer program rather than to be carried out by a linguist.

The major characteristics of this framework are the following (see Ellison 1994:1 for a similar characterization):

• The approach is unsupervised, i.e., no language-specific knowledge or training data with their correct classifications are used as the input for the methods.

• The input data are provided in a transcription which more or less reflects the actual pronunciation of the words in the language. An identification of the basic sounds of the language (phonemic representation) as well as the marking of word boundaries is therefore a prerequisite of all methods that are presented.7

• The algorithms are language-independent, i.e., they are supposed to work for in- put data of any spoken language provided it fulfills the above-mentioned criteria.

• The basic assumptions on which the algorithms are built are motivated by (cross-)linguistic research on the unity and diversity of language structures.

A computational approach has several advantages and limitations which can be compared to those of computer simulations of language evolution (cf. Cangelosi and Parisi 2002:8-15). The major advantage of a computational approach can be seen in the use of computer programs as experimental laboratories of how learning can be achieved given a certain input and certain assumptions about the learning procedure.

This makes it possible to distinguish those properties of the input data which are relevant for the acquisition of a certain structure from those which are irrelevant.

Testing such assumptions is accomplished by integrating them into the algorithms and checking the final results. Replicating the results for other languages or on different data of the same language is only a matter of changing the input to the program.

Additionally, the assumptions are testable on a larger number of input cases than could be done manually. This allows for the study of language as a complex system that is “made up of a large number of entities that by interacting locally with each other gives rise to global properties that cannot be predicted or deduced from” a smaller set of instances which can be handled manually (Cangelosi and Parisi 2002:11). At the same time, it avoids the danger of examining only a very limited amount of data to test certain predictions or rules of a theory.8 The researcher is in a position to test his or her assumptions on a larger scale with more input data (cf. Bender and Langendoen

7This requires a considerable amount of work on the part of the (field) linguist without which the methods described in this thesis would not be possible. In that sense, Pike (1947) describes a preliminary step to the present approach where the linguist designs a practical orthography for a language, thereby abstracting away from discernable but irrelevant phonetic differences.

8This point was made in Karttunen (2006) for two closely related analyses of Finnish prosody which turned out to have errors in their OT account of the problem when implemented in a finite- state approach and confronted with a larger number of test cases. Karttunen argues that a finite-state implementation of thegen andevalfunctions guards against some of the errors but that debugging otconstraints is still a very difficult task.

(31)

2010). This is especially true if the assumptions under investigation can only be tested with enough input, which usually exceeds the amount of data that a human can handle in reasonable time. The concepts that are investigated in this thesis are a case in point.

Furthermore, the computational approach may detect errors in the assumptions that a human researcher might overlook when going through the analysis by hand.

This is particularly relevant if data from different languages are compared and even more so when various researchers take part in the process. In such cases, it is not guaranteed that everybody analyzes the data in the same way, which has the unwanted effect that the results are not directly comparable. A computer program, on the other hand, always processes the input identically. That is, even if the basic assumptions of the implementation turn out to be wrong at some point, the results that have been achieved are directly comparable. This is particularly important for a cross-linguistic approach where the extracted features should not be dependent on the analyst.9

When talking about its advantages the limits of a computational approach also have to be taken into consideration. The most obvious limitation that one has to face when using a computational approach is the simplification of the actual phenomena.

Most of the time, it is not possible to take into account all the relevant factors that might contribute to the final result. For that reason, most of the current computa- tional approaches make simplifying assumptions about the input parameters and the results.10 On the other hand, the researcher can explicitly constrain the input to those factors which are to be tested for their relevance to the phenomenon at hand. Some- times, however, more or less arbitrary decisions have to be made in order to be able to devise an algorithm. Whereas in theoretical investigations some aspects of an analysis can be left for future research, in a computational approach all details of the analysis must be fully specified (cf. Bender 2008).11

While the use of computational methods is a necessary and very promising ap- proach to further insights about the structure of language, the simplifying assumptions that are built into these models in order to be implementable should still recognize that the input is taken from a natural language. This is especially true if the application of such a method is not of a practical nature but is seen in modeling aspects of human language learning with the help of sophisticated data mining techniques. The present approach is to take linguistic knowledge seriously in devising the implementation of the computational methods. The following example from the literature on language learning is meant to show what is not intended in the present computational approach.

9Although the computational approach guarantees that the input is processed in the same way, there is still a potential bias in the analysis step for the (phonemic) transcription of the input data by the (field) linguist. However, the computational approach does not introduce yet another source of bias in the study.

10However, it could be argued with John R. Pierce that the history of science has taught us at least one thing: “many of the most general and powerful discoveries of science have arisen, not through the study of phenomena as they occur in nature, but, rather, through the study of phenomena in man- made devices, in products of technology, if you will. This is because the phenomena in man’s machines are simplified and ordered in comparison with those occurring naturally, and it is these simplified phenomena that man understands most easily” (Pierce 1980:19).

11This has already been emphasized by Greenberg (1978:247): “It seems unavoidable, for purposes of valid comparison among languages, that one must make a decision in such matters which, even though it may be arbitrary, will be consistently applied.”

(32)

Rumelhart and McClelland (1985) present a connectionist approach to learning the English past tense verb forms from their present tense forms. Even though their method is primarily interesting from a morphological point of view, their success in mapping present to past tense forms on the basis of training examples is also phono- logically significant as the ability to correctly predict the past tense forms rests on the identification of phonological classes which condition the choice of allomorphs of the past tense marker. The neural network that they train does not only learn to form the past tense of those words that they used as training data but is also able to generalize to unseen cases. In order to be able to predict such unseen forms it must have learned something about the phonological structure of the training words. In this sense, the method can be considered to have induced phonological generalizations about the data.

However, the problem with connectionist networks in general is that the acquired gen- eralizations cannot be inspected (the so-called black-box property). The knowledge about the phonological structure that must have been learned is stored in the weights of the network connections and cannot be easily translated into a symbolic representa- tion as no single parameter or weight in the network uniquely corresponds to a certain phonological structure or rule or to any single pair of present and past forms (see also Ellison 1990:8).

The most influential aspect of Rumelhart and McClelland’s work for this thesis is concerned with the non-language-specific structure of the network that was hinted at in Pinker and Prince (1988, 1989). Among many other things, one of the objections of Pinker and Prince to Rumelhart and McClelland (1985) is the fact that their network is not specifically tailored to model language. In fact, the network is capable of learning all sorts of relationships between two forms that are absent from natural languages.

In particular, Pinker and Prince (1988:100) state that the Wickelphone/Wickelfeature approach that Rumelhart and McClelland (1985) devised for the phonological repre- sentation of the word would also be able to learn a linguistically unrealistic mapping that relates a string to its mirror image reversal (i.e., understand would be mapped todnatsrednu). Although theoretically possible, no language uses such a pattern for morphological purposes. However, Pinker and Prince (1988) show that the network that Rumelhart and McClelland set up would be able to learn such a mapping. In addition, Pinker and Prince (1989:185) remark that the feature triplet of the Wick- elfeature approach suffers from the fact that it must represent both the decomposition of a string into its phonetic components and the order in which the components are arranged. The well-established units of phonological structure such as phonetic fea- tures, segments, syllables etc. are abandoned in favor of a unit that “demonstrably has no role in linguistic processes.” The main impact of the argumentation for the present work is with respect to the use of linguistically motivated contexts for the extraction of phonological information. In this sense, the approach by Rumelhart and McClelland (1985) can be regarded as an example of what is not intended in this thesis: the appli- cation of powerful machine learning techniques on linguistic structure at the expense of well-founded linguistically motivated assumptions.

The focus in this study is on the contexts which are used to calculate the statistical values on the basis of which the phonological structures are inferred. The intention is therefore not to give a full-fledged comparison of the various statistical and data mining techniques that could be used to infer the relevant structures. For the use

(33)

of linguistically motivated contexts I consider it to be more important to be able to visually inspect the results of the methods rather than to give some figures of how closely they correspond to some gold standard. A visual inspection enables the researcher to examine specific problems of the methods with some of the elements in the input data. These problems may be linguistically relevant or mere artifacts of the technique that is used. For this reason, a visual analytics approach (see Section 3.4) could also be of great importance for future research in this or similar areas.

In conclusion, the computational nature of the approach offers some desirable as- pects for a scientific investigation of language data. It provides an objective way to analyze the input which is not influenced by the researcher’s idea of what the result of such an analysis should look like. At the same time, the analyses can easily be replicated with the same or different input data in order to test their underlying as- sumptions. Additionally, the computational approach also forces the investigator to be very explicit in setting up the basic assumptions and procedures of the analysis.

Otherwise, the algorithms cannot be implemented in a computer program.

2.5 Motivation

Investigating co-occurrence constraints of sounds within words may be interesting in its own right, for example, for an analysis of the frequency distributions of symbols in the corpus or the description of preferences in the sequencing of sounds for individual languages. However, the reason why I consider statistical tendencies in the combination of sounds to be important for linguistic research is the fact that they potentially contain latent information about the phonological structure of the languages. The main topic of this study is thereby to make use of these tendencies to induce this structure from the data. Interestingly, the information that is contained in the data can be related to structures that are typically derived in a different way, i.e., not by looking at the statistical distribution of sounds within words of a language.

• Phonological features, such as the discrimination of vowels and consonants in Chapter 4 or the distinction of consonants regarding their place of articulation in Chapter 7, are usually defined in terms of articulatory or acoustic properties.

This thesis shows that a clustering of sounds on the basis of these features can also be detected when looking at the distribution of sounds in relevant contexts within words.

• Patterns of vowel harmony (Chapter 6) are detected by systematically comparing contrasts in the exponents of morphological markers with respect to their vowels.

In this work, it will be demonstrated that similar results can be achieved by investigating the distribution of VCV sequences within words.

• The syllabification of words is achieved by assuming certain underlying principles that determine the proper shape of syllables (Chapter 5). Alternatively, sylla- ble boundaries can be approximated from the distribution of consonants at the periphery of words.

Referenzen

ÄHNLICHE DOKUMENTE

If, however, perceptual compensation for phonological assimilation is based on early processing levels, listeners should be influenced by context in the discrimination task just as

3.0 Review. In their work on the classification of Moufang polygons, Tits and Weiss [16] have recently studied a group of transformations defined by octonion division algebras

This section provides articulatory, acoustic and perceptual descriptions of the change in place of articulation that occurs during the closed phase of retroflex stops. 345)

As far as two-element verbal complexes are concerned it has been shown by a number of authors, including recent work by Ebert, that the sequence non-finite verb form followed

For existence results for double phase problems with homogeneous Dirichlet boundary condition we refer to the papers of Colasuonno-Squassina [8] (eigenvalue problem for the double

In the present paper, we generalize the ambient space and investigate the conditions under which invariant pseudoparallel submanifolds of an almost Kenmotsu (κ, μ, ν) -space are

The following table lists for every word form in the experiment its phonetic transcription according to the three trained phoneticians, its word type, the canonical

THE GERMAN ACADEMIC EXCHANGE SERVICE (DAAD) AND THE INSTITUTE OF MODERN LANGUAGES RESEARCH (IMLR) joined forces for the seventh time to invite all learners and lovers of German