A computational approach - The Induction of Phonological Structure

The goal of this thesis is to devise operations that are able to infer a certain structure from the data to which they are exposed. In that respect, the present work is similar in its goals to the research objectives in structural linguistics, for which Harris stated that it is about “the operations which the linguist may carry out in the course of his investigations, rather than a theory of the structural analyses which result from these investigations” (Harris 1963:1). These operations are intended to be valid across languages. To achieve this goal, linguistically motivated assumptions are formulated which constitute the basis for the operations. Apart from the typological motivation, the present work is also characterized by its computational approach. The operations are intended to be translatable into algorithms. An algorithm is usually defined as a finite list of well-defined instructions that transform a given input into an output (Cormen et al. 2009). Likewise, the methods that are discussed in this study can be

seen as step-by-step procedures for inferring the latent phonological structures from the data. The aim in this study is to formulate the operations in a way that allows them to be implemented in a computer program rather than to be carried out by a linguist.

The major characteristics of this framework are the following (see Ellison 1994:1 for a similar characterization):

• The approach is unsupervised, i.e., no language-specific knowledge or training data with their correct classifications are used as the input for the methods.

• The input data are provided in a transcription which more or less reflects the actual pronunciation of the words in the language. An identification of the basic sounds of the language (phonemic representation) as well as the marking of word boundaries is therefore a prerequisite of all methods that are presented.⁷

• The algorithms are language-independent, i.e., they are supposed to work for in-put data of any spoken language provided it fulfills the above-mentioned criteria.

• The basic assumptions on which the algorithms are built are motivated by (cross-)linguistic research on the unity and diversity of language structures.

A computational approach has several advantages and limitations which can be compared to those of computer simulations of language evolution (cf. Cangelosi and Parisi 2002:8-15). The major advantage of a computational approach can be seen in the use of computer programs as experimental laboratories of how learning can be achieved given a certain input and certain assumptions about the learning procedure.

This makes it possible to distinguish those properties of the input data which are relevant for the acquisition of a certain structure from those which are irrelevant.

Testing such assumptions is accomplished by integrating them into the algorithms and checking the final results. Replicating the results for other languages or on different data of the same language is only a matter of changing the input to the program.

Additionally, the assumptions are testable on a larger number of input cases than could be done manually. This allows for the study of language as a complex system that is “made up of a large number of entities that by interacting locally with each other gives rise to global properties that cannot be predicted or deduced from” a smaller set of instances which can be handled manually (Cangelosi and Parisi 2002:11). At the same time, it avoids the danger of examining only a very limited amount of data to test certain predictions or rules of a theory.⁸ The researcher is in a position to test his or her assumptions on a larger scale with more input data (cf. Bender and Langendoen

7This requires a considerable amount of work on the part of the (field) linguist without which the methods described in this thesis would not be possible. In that sense, Pike (1947) describes a preliminary step to the present approach where the linguist designs a practical orthography for a language, thereby abstracting away from discernable but irrelevant phonetic differences.

8This point was made in Karttunen (2006) for two closely related analyses of Finnish prosody which turned out to have errors in their OT account of the problem when implemented in a finite-state approach and confronted with a larger number of test cases. Karttunen argues that a finite-finite-state implementation of thegen andevalfunctions guards against some of the errors but that debugging otconstraints is still a very difficult task.

2010). This is especially true if the assumptions under investigation can only be tested with enough input, which usually exceeds the amount of data that a human can handle in reasonable time. The concepts that are investigated in this thesis are a case in point.

Furthermore, the computational approach may detect errors in the assumptions that a human researcher might overlook when going through the analysis by hand.

This is particularly relevant if data from different languages are compared and even more so when various researchers take part in the process. In such cases, it is not guaranteed that everybody analyzes the data in the same way, which has the unwanted effect that the results are not directly comparable. A computer program, on the other hand, always processes the input identically. That is, even if the basic assumptions of the implementation turn out to be wrong at some point, the results that have been achieved are directly comparable. This is particularly important for a cross-linguistic approach where the extracted features should not be dependent on the analyst.⁹

When talking about its advantages the limits of a computational approach also have to be taken into consideration. The most obvious limitation that one has to face when using a computational approach is the simplification of the actual phenomena.

Most of the time, it is not possible to take into account all the relevant factors that might contribute to the final result. For that reason, most of the current computa-tional approaches make simplifying assumptions about the input parameters and the results.¹⁰ On the other hand, the researcher can explicitly constrain the input to those factors which are to be tested for their relevance to the phenomenon at hand. Some-times, however, more or less arbitrary decisions have to be made in order to be able to devise an algorithm. Whereas in theoretical investigations some aspects of an analysis can be left for future research, in a computational approach all details of the analysis must be fully specified (cf. Bender 2008).¹¹

While the use of computational methods is a necessary and very promising ap-proach to further insights about the structure of language, the simplifying assumptions that are built into these models in order to be implementable should still recognize that the input is taken from a natural language. This is especially true if the application of such a method is not of a practical nature but is seen in modeling aspects of human language learning with the help of sophisticated data mining techniques. The present approach is to take linguistic knowledge seriously in devising the implementation of the computational methods. The following example from the literature on language learning is meant to show what is not intended in the present computational approach.

9Although the computational approach guarantees that the input is processed in the same way, there is still a potential bias in the analysis step for the (phonemic) transcription of the input data by the (field) linguist. However, the computational approach does not introduce yet another source of bias in the study.

10However, it could be argued with John R. Pierce that the history of science has taught us at least one thing: “many of the most general and powerful discoveries of science have arisen, not through the study of phenomena as they occur in nature, but, rather, through the study of phenomena in man-made devices, in products of technology, if you will. This is because the phenomena in man’s machines are simplified and ordered in comparison with those occurring naturally, and it is these simplified phenomena that man understands most easily” (Pierce 1980:19).

11This has already been emphasized by Greenberg (1978:247): “It seems unavoidable, for purposes of valid comparison among languages, that one must make a decision in such matters which, even though it may be arbitrary, will be consistently applied.”

Rumelhart and McClelland (1985) present a connectionist approach to learning the English past tense verb forms from their present tense forms. Even though their method is primarily interesting from a morphological point of view, their success in mapping present to past tense forms on the basis of training examples is also phono-logically significant as the ability to correctly predict the past tense forms rests on the identification of phonological classes which condition the choice of allomorphs of the past tense marker. The neural network that they train does not only learn to form the past tense of those words that they used as training data but is also able to generalize to unseen cases. In order to be able to predict such unseen forms it must have learned something about the phonological structure of the training words. In this sense, the method can be considered to have induced phonological generalizations about the data.

However, the problem with connectionist networks in general is that the acquired gen-eralizations cannot be inspected (the so-called black-box property). The knowledge about the phonological structure that must have been learned is stored in the weights of the network connections and cannot be easily translated into a symbolic representa-tion as no single parameter or weight in the network uniquely corresponds to a certain phonological structure or rule or to any single pair of present and past forms (see also Ellison 1990:8).

The most influential aspect of Rumelhart and McClelland’s work for this thesis is concerned with the non-language-specific structure of the network that was hinted at in Pinker and Prince (1988, 1989). Among many other things, one of the objections of Pinker and Prince to Rumelhart and McClelland (1985) is the fact that their network is not specifically tailored to model language. In fact, the network is capable of learning all sorts of relationships between two forms that are absent from natural languages.

In particular, Pinker and Prince (1988:100) state that the Wickelphone/Wickelfeature approach that Rumelhart and McClelland (1985) devised for the phonological repre-sentation of the word would also be able to learn a linguistically unrealistic mapping that relates a string to its mirror image reversal (i.e., understand would be mapped todnatsrednu). Although theoretically possible, no language uses such a pattern for morphological purposes. However, Pinker and Prince (1988) show that the network that Rumelhart and McClelland set up would be able to learn such a mapping. In addition, Pinker and Prince (1989:185) remark that the feature triplet of the Wick-elfeature approach suffers from the fact that it must represent both the decomposition of a string into its phonetic components and the order in which the components are arranged. The well-established units of phonological structure such as phonetic fea-tures, segments, syllables etc. are abandoned in favor of a unit that “demonstrably has no role in linguistic processes.” The main impact of the argumentation for the present work is with respect to the use of linguistically motivated contexts for the extraction of phonological information. In this sense, the approach by Rumelhart and McClelland (1985) can be regarded as an example of what is not intended in this thesis: the appli-cation of powerful machine learning techniques on linguistic structure at the expense of well-founded linguistically motivated assumptions.

The focus in this study is on the contexts which are used to calculate the statistical values on the basis of which the phonological structures are inferred. The intention is therefore not to give a full-fledged comparison of the various statistical and data mining techniques that could be used to infer the relevant structures. For the use

of linguistically motivated contexts I consider it to be more important to be able to visually inspect the results of the methods rather than to give some figures of how closely they correspond to some gold standard. A visual inspection enables the researcher to examine specific problems of the methods with some of the elements in the input data. These problems may be linguistically relevant or mere artifacts of the technique that is used. For this reason, a visual analytics approach (see Section 3.4) could also be of great importance for future research in this or similar areas.

In conclusion, the computational nature of the approach offers some desirable as-pects for a scientific investigation of language data. It provides an objective way to analyze the input which is not influenced by the researcher’s idea of what the result of such an analysis should look like. At the same time, the analyses can easily be replicated with the same or different input data in order to test their underlying as-sumptions. Additionally, the computational approach also forces the investigator to be very explicit in setting up the basic assumptions and procedures of the analysis.

Otherwise, the algorithms cannot be implemented in a computer program.

2.5 Motivation

Investigating co-occurrence constraints of sounds within words may be interesting in its own right, for example, for an analysis of the frequency distributions of symbols in the corpus or the description of preferences in the sequencing of sounds for individual languages. However, the reason why I consider statistical tendencies in the combination of sounds to be important for linguistic research is the fact that they potentially contain latent information about the phonological structure of the languages. The main topic of this study is thereby to make use of these tendencies to induce this structure from the data. Interestingly, the information that is contained in the data can be related to structures that are typically derived in a different way, i.e., not by looking at the statistical distribution of sounds within words of a language.

• Phonological features, such as the discrimination of vowels and consonants in Chapter 4 or the distinction of consonants regarding their place of articulation in Chapter 7, are usually defined in terms of articulatory or acoustic properties.

This thesis shows that a clustering of sounds on the basis of these features can also be detected when looking at the distribution of sounds in relevant contexts within words.

• Patterns of vowel harmony (Chapter 6) are detected by systematically comparing contrasts in the exponents of morphological markers with respect to their vowels.

In this work, it will be demonstrated that similar results can be achieved by investigating the distribution of VCV sequences within words.

• The syllabification of words is achieved by assuming certain underlying principles that determine the proper shape of syllables (Chapter 5). Alternatively, sylla-ble boundaries can be approximated from the distribution of consonants at the periphery of words.

Basically, the constraints on the structure of words that are investigated in this thesis can be divided into two groups: those which are considered to hold for all languages and those which are only effective in a subset of the world’s languages.

An example of the first type would be the constraint on the alternation of vowels and consonants, which can be used to induce a grouping of all sounds into the two major categories of vowels and consonants for all languages (see Chapter 4). The latter can be exemplified by the restriction on the co-occurrence of vowels in vowel harmonic languages where an investigation of these contexts yields a result which more or less shows an accidental patterning of vowels for those languages which do not have vowel harmony (see Chapter 6). For vowel harmonic languages, however, conspicuous patterns emerge from the data. Even though some of the constraints are not universal and exhibit language-specific properties, the operation which is applied in order to detect the constraints is language-independent. This is similar to the phonotactic analysis of a language. Languages differ as to which combinations of sounds are permissible,¹² but the operations on how to arrive at such an analysis are always the same. In that sense, all of the methods that are presented in this thesis are considered to be (procedural) universals. The motivation for the present work can largely be divided into three areas which will be discussed in the following three sections.

2.5.1 Distributional correlates of features

One important motivation for this work can be seen in the study of how different ap-proaches for the definition of phoneme categories can be related to one another. In early work on phonological theory, linguists conceived of two main approaches on how to classify phonemes into different categories (cf. Fischer-Jørgensen 1952). American structuralists like Edward Sapir¹³and Leonard Bloomfield were in favor of a grouping according to the possibilities of combination of phonemes in the speech chain, as it is most relevant for a structural description of the language. Bloomfield claimed that this is the only pertinent approach and argued that a classification in terms of distinctive (i.e., mostly articulatory) features is “irrelevant to the structure of the language be-cause [tables with distinctive features] group the phonemes according to the linguist’s notion of their physiological character, and not according to the parts which the several phonemes play in the working of the language” (Bloomfield 1933:129-130).

Later, several works in the structuralist framework dealt with the question of how to describe linguistic data in an exact and well-defined way. One of the questions that was of particular interest was the classification of sounds in a language according to their distributional properties. Fischer-Jørgensen (1952) gives an overview of earlier ideas on

12For instance, German allows the consonant cluster [kn] in the onset of a syllable whereas English does not.

13The idea of using distributional criteria for the definition of phoneme categories can be traced back to Sapir (1925), who was presumably the first to conceive of the possibility of such a classification.

Sapir (1925:48) writes: “How can a sound be assigned a ‘place’ in a phonetic pattern over and above its natural classification on organic and acoustic grounds? The answer is simple. A ‘place’ is intuitively found for a sound [...] in such a system because of a general feeling of its phonetic relationship resulting from all the specific phonetic relationships (such as parallelism, contrast, combination, imperviousness to combination and so on) to all other sounds.”

how to define phoneme categories on a distributional basis. Some of them, such as her own method, propose methods which can be applied to any language whereas others are tailored to a particular language.¹⁴ Fischer-Jørgensen (1952) defines the basic unit on which the classification is based as the phonemic “syllable” in accordance with K. L.

Pike, who considered it to be “the basic structural unit which serves best as a point of reference for describing the distribution of the phonemes in the language in question”

(Fischer-Jørgensen 1952:15). The general problem might be that the unit serving as the best basis will not be the same across languages. However, in most languages the

Im Dokument The Induction of Phonological Structure (Seite 29-56)