Related work - Vowel Harmony - The Induction of Phonological Structure

Vowel Harmony

6.3 Related work

In the previous section, we have seen the basic concepts of VH and how it can be analyzed in terms of vowel successions as they occur in word forms of the language.

Before explaining in more detail how the visualization is generated, this section provides an overview of related work that deals with data-driven computational analyses of vowel harmony patterns.

Several authors have already been concerned with the study and identification of VH on the basis of raw texts. In an early approach, Altmann (1986) derives a statistical method to calculate the tendency for languages to have identical vowels occurring next to each other (what he callsTendenzielle Vokalharmonie). He illustrates the usefulness of his approach with respect to a number of Indonesian languages where the tendency for having successions of identical vowels could be statistically confirmed for all of them. The work is therefore more on the detection of reduplication processes where identical vowels tend to follow one another (see Sections 6.6.4 and 6.6.3 for a comparison of genuine VH languages and those showing the tendency for succession of identical vowels).

Hare (1990) presents the modeling of VH in a connectionist framework on Hun-garian data. In her account, the words are solely represented by their vowels, which are encoded according to their backness, height, rounding and sonority features in bit strings and then serve as the input for a simple recurrent network. The model thereby correctly predicts the pattern of behavior for both harmonic and transparent vowels.

In his approach to learning phonological structure, Ellison (1994) developed an account of VH using non-deterministic two-state automata and a similar methodology as for the vowel/consonant discrimination task (cf. Section 4.2.6). His method is capa-ble of discovering the correct harmonic patterns (including exceptions, some involving transparency and opacity) for data on Turkish, Kirghiz, Hungarian and Yoruba, while at the same time it finds no restrictive harmony for Latin vowel sequences.

More recently, John Goldsmith and his collaborators have approached the question of the nature of vowel harmony systems. Two papers in particular are concerned with

developing a formal device for the induction of a grammar and a formal device for the evaluation of a grammar from data, respectively. Goldsmith and Riggle (2012) have concentrated on the latter and investigated VH in Finnish by taking advantage of information theoretic concepts in order to better understand the phonological structure of the language. They developed a framework of phonological analysis (along the lines of the original framework of generative grammar) which intends to determine the best candidate of a given set of possible analyses for a given set of data from a language (in Chomsky’s 1965 sense). In contrast to earlier work in generative phonology, however, their device is able to quantitatively and algorithmically defend such an analysis for a language rather than to discover a certain phenomenon in the data.

The discovery of VH is one of the goals in Goldsmith and Xanthos (2009). Their paper deals with several questions regarding the phonotactics as well as the phonolog-ical categories that are required for the analysis of phonotactic constraints (see also their approach to discriminate vowels and consonants in Section 4.2.10). They present two approaches to discovering VH in Finnish. First, they describe a spectral approach where the vowels of Finnish seem to fall into two groups, one including front and neutral vowels, the other containing the back vowels. Second, an HMM model with maximum-likelihood methods is trained on the same data, which results in a parti-tioning of all values into four larger groups: the front vowels with a very large positive log ratio, the back vowelsa anduwith a very large negative log ratio, the two neutral vowels with a log ratio close to zero and, finally, a fourth group containing the back vowelo with a larger negative log ratio that is still separate from the larger values of the other back vowels. Both of the approaches for the discovery and evaluation of VH have been applied to a single language (Finnish) but could in principle be used for other languages as well.

Baker (2009) is an extension of Goldsmith and Riggle (2012), which attempts to detect VH in four different languages (Turkish, Finnish, English and Italian).³ For this purpose, two methods are described for learning and modeling VH on the basis of text corpora. In a first method, a two-state HMM for Finnish resulted in a final configuration where both states have a high tendency (more than 70%) to remain in the same state, which is a clear sign that the language has VH. It was less successful for Turkish where only a four-state HMM could correctly learn the generalizations of Turkish VH. At the same time, it rightly rejected English and Italian as VH languages.

The second approach that is described in the paper uses pointwise mutual information between vowel pairs in a Boltzmann distribution to detect VH. Whereas it correctly models the two harmony types in Turkish, it has more difficulties in capturing neutral vowels in Finnish. Again, it correctly recognizes English and Italian to be non-VH languages.

A related approach to quantify co-occurrence patterns in a corpus is the Vowel Harmony Calculator (Harrison et al. 2004).⁴ It is an online tool that computes the percentage of harmonic words in the input corpus and the harmony index, i.e., the extent to which the percentage of harmonic words exceeds random chance. Initially developed to measure backness harmony in Turkic languages, it can also be used to

3Baker’s analysis is based on an earlier manuscript of Goldsmith and Riggle’s (2012) paper.

4http://www.swarthmore.edu/SocSci/harmony/public_html/

quantify other harmony systems, such as Uralic backness harmony and Bantu height harmony. One of the main advantages of theVowel harmony calculatoris that there is a web interface to its Perl implementation so that it can be easily accessed by researchers to test with their own corpora. A major drawback is that only ASCII text is accepted as input.

After uploading the corpus the user is asked to specify two classes of vowels. Poly-syllabic words whose vowels belong to only one of these two classes are taken to be harmonic. Monosyllabic words are ignored. The user can additionally specify neutral vowels, which are then ignored in the calculations. In order to determine the harmony index for the input corpus, the tool first calculates the harmony threshold, which is the percentage of words expected to be harmonic by chance. The harmony threshold is based on the distribution of vowels in the corpus and on the average syllable count of polysyllabic words. The harmony index is then calculated as the percentage of har-monic words minus the harmony threshold. In addition to the percentage of harhar-monic words, harmony index and threshold, the output of the calculator comprises the follow-ing results: mean syllable count, mean syllable count in polysyllabic words, harmony threshold in the first two syllables, percentage of harmonic words considering only the first two syllables and percentage of vowels in each class (class skewing). However, it only gives an overview of the overall harmonic strength in the corpus and does not directly provide information about individual vowel pairs. For this, the user has to inspect one of the three log files that are also produced as a result: a harmony log, which contains details on harmonic word distributions, a disharmony log, which lists all disharmonic words and a frequency log, which shows vowel frequencies as well as symbol co-occurrence tables.

The major difference between theVowel Harmony Calculatorand the visualization approach that is presented in this chapter is that the former is restricted to the context of vowel harmony and requires the user to input the harmony classes beforehand. The Vowel Harmony Calculator is a way to quantify the notion of vowel harmony given the harmony classes of the language as input. The present approach, however, is intended to detect such co-occurrence patterns in the first place, without requiring any language-specific information as input. The exploratory aspect is also enhanced through the visualization component that helps to discover regularities in the data more easily.

In her master’s thesis, Knowles (2012) compared a number of machine learning and NLP methods to explore vowel harmony patterns in an unsupervised manner.

Her approach also features a tool for visualizing vowel harmony systems that is partly influenced by the work presented here (Mayer et al. 2010a). The visualization consists of a heat map whose vowel grid is manually ordered according to the feature table of the vowels (cf. Tables 6.1 and 6.5 for Turkish and Finnish, respectively). The coloring of the cells corresponds to the normalized probabilities of vowels to occur in a certain harmony class. Knowles (2012:44) claims that by comparing the visualizations for each harmony class the user can quickly determine whether or not the language under consideration appears to have vowel harmony. A major difference to the approach presented here is that her visualization is only semi-automatic because the layout of the grid has to be given as an input, while the visualizations that will be presented in this chapter are generated fully automatically.

With the exception of Knowles (2012), I am not aware of any approaches that aim to visually analyze linguistic phenomena such as VH and compare such language properties across a larger set of languages. VH is here considered to be a case study on what is possible for the visual analysis of phonotactic constraints. The main purpose of a visualization component is to be seen in generating new hypotheses about the data at hand, in this case with respect to phonotactic restrictions (between or across vowels and consonants) in a language that have so far gone unnoticed in the literature. In the following sections, I will show how such an approach can be set up for the visualization of VH. The main components of the visualization and a description of how to generate it has already been introduced in Section 3.4.3. In this chapter, I will describe the statistical analyses for the vowel harmony patterns.

Im Dokument The Induction of Phonological Structure (Seite 142-145)