• Keine Ergebnisse gefunden

PhonMatrix Alpha

Step 2: Select context

3.5 Language data

After presenting the statistical methods and techniques that are employed to infer the hidden structure from the data, this section gives an overview of the language data on which these methods are tested. The choice of the languages was largely constrained by the amount of data that was needed to obtain an approximately representative sample of the language with respect to the phenomena under consideration.20 The set of languages is thus a convenience sample that was also selected on the basis of language-external factors, such as the availability of electronic resources and the use of an alphabetic writing system. In addition, the material should adequately reflect the actual pronunciation in the languages.

The approach of the present work, as has been set out in the previous chapter, is data-driven. The results that will be discussed are achieved on the basis of a limited set of data for the language under investigation. When talking about the results for a particular language, I refer to the results on the basis of the given language data, which does not imply that it holds for other data of the same language. For this

20This is mostly relevant for the substitution method presented in Chapter 4.

reason, it is more appropriate to talk about “doculects” in this respect.21 Languages show high internal variability, which is also reflected in different data on the same (or even a slightly different variety of the same) language. Luckily, the degree of internal variability is lower for phonological phenomena. As has been discussed in the previous chapter, speakers can intentionally refrain from using certain constructions or sounds of their language. However, in a reasonably-sized corpus of the language the common patterns prevail. This is especially true for phonological phenomena where the amount of data that is necessary to extract reliable patterns is relatively small in comparison to semantic structures, for instance.22

Due to the lack of sufficiently large phonemically transcribed resources in a larger number of languages, one of the sources of language data are written Bible texts as they are more widely available, in particular for less well-documented languages. These texts are mostly available in practical orthographies, which might seem to be a major problem in light of the research topic that is to be pursued here. Luckily, most of these practical orthographies are not as idiosyncratic as is the case for English, French and German, a fact which has already been acknowledged for Finnish (cf. Goldsmith and Xanthos 2009:10) and Turkish (cf. Kornfilt 2009:524), for instance.23 In this work, I will use data from written texts for a larger set of languages where the practical orthographies were devised only recently and therefore more closely correspond to a phonemic transcription. On this account, the problem of using written texts is less severe than is sometimes thought. Yet of course, one still has to be careful when using such sources. Some of the errors in the results might be due to the inadequacy of the input data (e.g., the fact that many Bible translations are written by non-native speakers) rather than an artifact of the method that is used to infer the results. Minor problems with inadequate phonemic data can be treated as noise. The quality of the results, however, is considered to be influenced in only one direction. The lack of phonemic language data may lead to worse results in the induction of phonological structure in comparison to established phonological categories. It is not to be expected that a non-phonemic aspect in the transcription yields a better performance of the algorithms. In future work, the methods can then be tested with entirely phonemic data to assess their actual performance. The quality of the results in this thesis is thus seen as a promising first step for further investigations in this direction.

The language data that will be used in the subsequent chapters basically consist of word lists that differ with respect to their basic elements and the type of transcription.

The idea was to experiment with several data sources in order to test the usefulness of the various approaches. The language data can broadly be divided into the following categories:

Phonetically transcribed word lists Two main databases with phonetically

tran-21“The term ‘doculect’ is sometimes used for a variety of a language that has been described or oth-erwise documented in a coherent way” (http://www.glottopedia.de/index.php/Doculect, accessed on September 3rd, 2011).

22See, for instance, Mayer et al. (2010a) for how much data is required to get meaningful results for the vowel harmony visualization.

23In this respect, Dixon (2010:66) states that those writing systems that have been devised by professional linguists are generally excellent in representing the pronunciation while ‘traditional’ writing systems are less than fully adequate.

scribed entries are used: (i) the CELEX lexical database including a large num-ber of word forms and lemmas (ranging from around 50,000 for English to over 300,000 for German) for the languages English, German and Dutch; (ii) the ASJP database including Swadesh list items for almost 5,000 languages and dialects.

List of verbal roots A comprehensive list of almost 2,000 verbal roots for the Semitic language Maltese is analyzed with respect to the place features of consonants.

Practical orthographies Lists of word forms in practical orthographies have been extracted from written Bible texts for a sample of 30 languages.24

With the exception of the list of verbal roots, all data sources more or less closely reflect the actual pronunciation of words on the surface. Since the aim of this work is to explore the possibilities of inducing structures from data, the constraints that are investigated and used to induce the respective information are considered to be visible on the surface representation. In what follows, I will give a short description of the language data.

3.5.1 CELEX data

Among other things, the CELEX lexical database (Baayen et al. 1995) contains large lists of phonetically transcribed word forms and lemmas for English, German and Dutch, which were compiled from various sources (mostly dictionaries in the respective languages). Table 3.11 shows the number of word forms and lemmas that are included in the phonology section of the database.

Table 3.11: Number of word forms and lemmas in the CELEX lexical database (Baayen et al. 1995).

Language Type # items

English word forms 87,922

lemmas 45,266

German word forms 310,661

lemmas 50,481

Dutch word forms 281,810 lemmas 117,048

The CELEX data represent the largest amount of word forms for individual lan-guages as well as the closest phonemic representation for all sources that are used in the present work and thereby allow for a more fine-grained study of the assumptions that are to be tested. It is also useful to compare the results on the list of word forms with those for the collection of lemmas in order to assess the influence of grammati-cal markers on the results. This is particularly interesting for the induction of place distinctions in Chapter 7 and the vowel/consonant distinction in Chapter 4. In both

24A more detailed overview of the amount of data for each language is given in Appendix A.

cases, the presence of morphological markers is considered to decrease the quality of the results. I will come back to this point in the relevant chapters.

3.5.2 ASJP data

In order to test some of the assumptions for a wider range of languages a cross-linguistically varied sample of phonetically transcribed word forms is required. For this purpose, I make use of the data provided by the Automated Similarity Judgement Program (ASJP; Wichmann et al. 2010a,b). The original aim of the ASJP project was to achieve a computerized lexicostatistical analysis of the world’s languages, go-ing back to the lexicostatistical and glottochronological approach as worked out first by Morris Swadesh in the 1950s (cf. Swadesh 1951). The contents of the database are a reflection of the two major purposes of the project: “to provide a classification of all languages by a single, consistent and objective (if perhaps not ideal) method, and to perform various statistical analyses regarding the historical and areal behavior of lexical items” (http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm25). For this reason, genealogical and areal information is provided for each language in the database.

The database consists of Swadesh list items for a large number of languages. The items are transcribed in the ASJP orthography, which uses standard ASCII characters to represent the sounds of the world’s languages but merges some of the distinctions that are made by the IPA into one symbol. Suprasegmental features, such as stress, tone and vowel length are not marked in the database. However, for the purposes of the present investigation all necessary distinctions (e.g., vowel/consonant, place of articulation) are adequately reflected in the transcription.26 The number of Swadesh list items per language varies from one to one hundred, with many languages comprising 40 entries which have been shown to be most stable in the list (see Holman et al. 2008).

The database is constantly updated with new entries and languages. In this work, I use versions 12 and 13. They basically differ in the number of languages and total number of word forms. Whereas version 12 contains 188,475 word forms from 4,335 languages and dialects, version 13 already comprises 207,290 word forms from 4,816 different varieties. As can be seen from the number of languages and word forms, the data for individual languages is very small, with an average of 43 word forms per language.

The location of the languages in version 12 of the ASJP database is shown in Figure 3.8. Languages that are included in the database are plotted with green dots, while languages that are listed in Ethnologue27but not included in the ASJP database are given in red.28 The map shows that the languages in the ASJP database cover most of the world’s regions. Furthermore, the languages in the ASJP database are taken from a large variety of language families. Appendix A.2 lists all families given in Ethnologue together with the percentage of languages that are included in the ASJP

25Accessed on September 4th, 2011.

26The only exception is the symbol !, which represents all varieties of click sounds from different places of articulation. For this reason, click sounds have been ignored in Chapter 7.

27http://www.ethnologue.com

28The geographical coordinates for the languages listed in Ethnologue were provided by ¨Osten Dahl.

database. Although some smaller language families are not included in the database, the overall coverage shows that the database covers a fair proportion of the world’s language families. Hence, I will assume that the database contains a more or less representative sample of word forms that can be encountered in the languages of the world. In this sense, I will make use of the database when testing the validity of some of the assumptions for universal characteristics, i.e., independent of a particular language.

24.12.13 15:57 Parallel Bible Corpus

Seite 1 von 1 file:///Users/thommy/Dropbox/Diss/research/visualizations/ASJP-Coverage/index.html

Languages not in ASJP Version 12 Languages in ASJP Version 12

Figure 3.8: World map with languages in the ASJP database.

3.5.3 Maltese roots

For the analysis of co-occurrence restrictions with respect to consonantal place distinc-tions I complement previous work in this area with a study on Maltese verbal roots.

Maltese belongs to the Semitic family of languages, yet has been largely influenced over the last centuries by (Sicilian) Italian and, more recently, (British) English. The phenomenon of similar place avoidance that will be investigated in Chapter 7 was originally claimed to be active in non-derived forms of Semitic languages. In order to test the validity of the principle for Maltese and the impact of the non-concatenative component from the contact languages on its results, I made use of a comprehensive list of 1,958 Maltese verbal roots. The list has been compiled by a native speaker of Maltese (Michael Spagnol; cf. Spagnol 2011) using the dictionaries of Serracino-Inglott (1975-1989) and Aquilina (1987-1990) as well as the collection of loan verbs in Mifsud (1995:272-295). The list consists of roots of three and four consonants as well as weak roots (those involving a glide as one of their root consonants) and also includes recent borrowings from European languages that have been fully integrated into the non-concatenative root-and-pattern system of the Semitic type.29 For the investigation of similar place avoidance in Chapter 7 the list of triliteral (containing three consonants)

29An updated version of the list of roots is available on http://mlrs.research.um.edu.mt/.

roots will mostly be used. The roots are given in their orthographic forms, which are—

except for the digraphg¯h—reasonably close to a phonemic transcription. In particular, the place of articulation of consonants is sufficiently distinguished for the purposes of the present investigations.

3.5.4 Bible texts

In order to test some of the approaches on reasonably-sized texts for a larger number of languages, I compiled a collection of word lists from Bible texts (mostly New Testa-ment). Since the data is available in practical orthographies, one has to be careful in interpreting the results that are obtained from these data sets. However, as the results also show, the practical orthographies of most languages are not as idiosyncratic as those of English, German and French. The collection of word forms extracted from the Bible texts comprises 30 languages in total. For some languages, only Bible portions could be obtained. The languages in the sample are listed in Table 3.12.30 When choosing the languages of the sample the aim was to cover most of the major language families in the world. However, some families (e.g., Dravidian, Sino-Tibetan) are not represented because it is hard to get larger amounts of texts in an alphabetic writing system. Nevertheless, as the distribution of the languages on the world map in Figure 3.9 also shows, the sample covers most of the world’s regions.

Symbols that occur less than 10 times in the texts have been ignored and words containing those symbols have not been included in the word lists. In some cases, minor modifications in the spelling systems have been made in order to make the representation more consistent with a phonemic transcription. To this end, frequent digraphs in the language have been treated as one symbol. These digraphs can be seen in the results for the vowel-consonant discrimination. Accents marking stress or vowel length have been deleted so that only the vowel quality is marked for each distinct symbol.

Table 3.12: Languages in the sample.

Language ISO 639-3 Subfamily Family

Abau aau Abau Sepik

Afrikaans afr Germanic Indo-European

Amele aey Madang Trans-New Guinea

Basque eus Isolate Isolate

Chamorro cha Malayo-Polynesian Austronesian

Dutch nld Germanic Indo-European

English eng Germanic Indo-European

Esperanto epo Artificial Artificial

Finnish fin Finnic Uralic

Georgian kat Georgian Kartvelian

German deu Germanic Indo-European

Haitian Creole hat French based Creole

Hixkaryána hix Waiwai Cariban

30An overview of the number of types and symbols in the Bible texts is given in Appendix A.

Language ISO 639-3 Subfamily Family

Hungarian hun Uralic

Indonesian ind Malayo-Polynesian Austronesian

Latin lat Romance Indo-European

Maltese mlt Semitic Afro-Asiatic

Maori mri Malayo-Polynesian Austronesian

Highland Puebla Nahuatl azz Southern Uto-Aztecan Uto-Aztecan

Nuer nus Eastern Sudanic Nilo-Saharan

Potawatomi pot Algonquian Algic

Cusco Quechua quz Peripheral Quechuan Quechuan

Swahili swh Atlantic-Congo Niger-Congo

Tagalog tgl Malayo-Polynesian Austronesian

Turkish tur Turkic Altaic

Vietnamese vie Mon-Khmer Austro-Asiatic

Warlpiri wbp Pama-Nyungan Australian

West Greenlandic kal Eskimo Eskimo-Aleut

Wik-Mungkan wim Pama-Nyungan Australian

Wolof wol Atlantic-Congo Niger-Congo

12/30/13 11:05 AM Parallel Bible Corpus

Page 1 of 1 file:///Users/thommy/Dropbox/Diss/research/visualizations/Sample/index.html

epo

nus

hat lat

wol vie

ind eus

nlddeu eng

aau

kal

pot

azz

wim

afr

cha

swh kat fin

quz mlt

aey tgl

hix

mri hun

tur

wbp

Figure 3.9: World map of the languages in the sample.

3.6 Summary

In this section, I have given an overview of the statistical methods and techniques as well as the language data that will be used in order to test the approaches in the subsequent chapters. Although the contexts from which the association of symbols is derived differs for the individual methods that will be proposed, the underlying

statistical method to calculate the degree of association (from which a (dis-)similarity relation is derived) is basically the same. From the various tests for dependence which are suitable to compute an association strength value, I opted for the φ coefficient that is derived from the χ2 value for two dichotomous variables. The advantage of choosing this statistical value is the fact that it is bound in the interval between 0 and 1 (or [−1,1]when the direction of association is indicated by the algebraic sign) and therefore allows for an easy mapping of values to visual variables across corpora of different size. Moreover, the φ value is easier to interpret as it corresponds to the correlation coefficient under such circumstances. In light of the results of the subsequent chapters, the use of this statistical value seems to be appropriate for the present purposes. Yet it has to be kept in mind that the aim of this work is not to apply new and more sophisticated statistical methods. Instead, the intention is to propose the application of novel linguistically motivated contexts that can be used to infer phonological structure from the input data. The statistical part is therefore a tool that is indispensable to yield first results. Testing those contexts with other statistical methods, however, is left for future research.

One of the goals of this chapter was to point out the usefulness of including visual-ization techniques in the analysis of data. Section 3.4 has given a brief introduction to the field of visual analytics, whose aim is to develop methods that enable the user to see hidden patterns in the data at a glance. An example of how linguistic research can benefit from the integration of a visual analytics approach is demonstrated in Chapter 6 for the visualization of vowel harmony patterns. In addition, this chapter has de-scribed two standard methods for the inference of structure from data: agglomerative hierarchical clustering and multidimensional scaling. Both methods come with their own graphical display techniques (dendrograms and MDS plots, respectively), which will be used to present the results of the approaches.

Finally, this chapter has provided an overview of the language data that is used in this work. Since the studies deal with the induction of phonological structure, the type of data that is employed merely consists of word lists rather than corpora of running text in the languages at hand. The data sets range from orthographic texts for a larger number of languages to phonetically transcribed word forms and lemmas for only a few closely related languages. In order to test some of the assumptions about word forms in general (i.e., across languages and families) the database of the ASJP program is used. I will consider their collection of word forms from almost 5,000 different varieties to be a representative sample of the structure of words that can occur in the world’s languages.

This chapter marks the end of the first part of this thesis where the theoretical

This chapter marks the end of the first part of this thesis where the theoretical