The substitution approach - Vowels and Consonants

Vowels and Consonants

4.4 The substitution approach

We have seen before in Section 4.2 that a number of methods have been proposed in the literature to discriminate vowels and consonants on the basis of their distribution within words. The approaches mainly differ in the statistical techniques that are employed in order to yield a reasonable partitioning of symbols into two groups. Most of them are based on N-gram frequencies of symbols (similar to the bigram frequencies in Sukhotin’s algorithm) and thereby exploit the nature of the linear sequence of symbols

Table 4.5: Results for Sukhotin’s algorithm on all word forms across languages in the ASJP database. The symbols of the ASJP orthography are given followed by their corresponding IPA symbols in brackets.

Language Vowels Consonants Misclass.

ASJP word forms 3 [1,9,@,3,0,8,Æ] E [a,æ,E,œ] a [5] e [e,ø] i [i,ı,y,Y] o [7,2,A,6,o,O] u [W,u]

4[n

ˆ]5[ñ]7[P]8[T,D]C[Ù]

G[G]L[L,í,L]N[N]S[S]T [c,Í]X[X,K,è,Q]Z[Z]b[b,B]

c [ts,dz] d [d] f [f] g [g] h [h,H]j[Ã]k[k]l[l]m[m]n [n]p [p,F]q [q] r [r,R,etc.]

s [s]t [t]v[v]w[w]x[x,G]

y[j]z[z]

! [!,|,{,}]

to show an alternation between the major categories of vowels and consonants as has been described above for Sukhotin’s algorithm and which also lays the basis for the other approaches that have been described in Section 4.2.¹⁷

In this section, I want to introduce a novel approach to the problem of discriminat-ing vowels and consonants on the basis of a list of wordforms in a language. This new method differs from the previous approaches in a crucial aspect: it does not use dis-tributional information in linear sequences (on the syntagmatic level) of words in the speech chain; rather, it exploits the distribution of sounds on the paradigmatic level as their tendency to be substituted in minimal pairs of the word list. In other words, no N-gram statistics are taken into account to infer the vowel/consonant distinction as in almost all of the methods mentioned before. N-grams rest upon the relationship of sounds in praesentia, i.e., on their co-occurrence in the relevant contexts within words.¹⁸ The technique that will be described in this section, however, is based on the relationship of symbols in absentia. This means that the relevant context for the inference of similarities between symbols is not to be found in their co-occurrence in a specified context within the word but has to be established across words. The notion of substitutability and/or complementarity has proven to be a useful technique in lin-guistic research for finding structure in languages. The basic idea is that elements that can be substituted for are assumed to be of the same type. A well-known application of this principle in phonology is in finding phonemes with the help of minimal pairs.

In what follows, I will make use of this principle for learning the vowel/consonant distinction in symbols of the language.

17The only approach that does not take N-gram statistics into account to discriminate vowels and consonants is Ellison (1994).

18Saussure’s terminology for those two relations would besyntagmatiquefor the former andassociatif (which was later called paradigmatic by Hjelmslev; cf. Fischer-Jørgensen 1975:119) for the latter (de Saussure 1916 [1967]:156).

4.4.1 Description of the algorithm

The relevant context for inferring the relationship across words is the tendency of sounds to be substituted for another in so-called minimal pairs of the language.¹⁹ A minimal pair is a pair of words that differ in only one sound.²⁰ The idea is that the tendency for sounds to occur in such minimal pair substitutions (e.g., t - p in the minimal pairtin and pin) constitutes a relationship that marks their similarity with respect to the vowel/consonant distinction.²¹ The assumption is that sounds from the same category are more likely to occur in such substitutions than sounds from different categories. Whereas it is easy to come up with examples as the above, where two consonants are substituted in the minimal pair, or similarly cases where two vowels are substituted as inpotandpet, instances in which a vowel and a consonant are to be replaced are much harder to find. Nevertheless, such cases do exist. For instance, in the minimal pair orca and orcsthe vowel a and the consonants make up the pair of sounds in which the members of the pair differ. The occurrences of two sounds in such substitutions can be counted as with the co-occurrence in linear sequences in N-grams.

We then get a symmetric matrix of pairs of symbols where the cells mark the number of times both symbols have been substituted for in minimal pairs in the corpus.

The extraction of minimal pairs in a corpus is not a trivial task as the position of the sound that is to be substituted for can vary from one word to another. It may be in initial position (as in the pair tin-pin above), in medial position (as in the examplepot-pet mentioned before) or in final position (as in the pair tin-tip). It is therefore not sufficient to check for specific positions in which words may show the relevant distinctions. The extraction method should be able to find the relevant pairs of sounds in any position. For the implementation of the method I extended the Levenshtein distance algorithm (Levenshtein 1966)²²from the Lingpy library (List and Moran 2013) with a function to output those cases where the distance is equal to one and where no insertion (or deletion) is encountered. As a result, we get a list of substitutions from which the symmetric matrix of sound substitutions can be built where each cell marks the number of times the substitution of both sounds could be found in the list.

19Note that the concept of “minimal pairs” is not elementary as it requires a completed phonemic analysis (Fischer-Jørgensen 1975:183; see Section 2.4).

20According to Fischer-Jørgensen (1975:6), the term “commutation test” was coined by Hjelmslev to indicate “a test which consists of replacing a sound sequence forming a minimal utterance with another in order to find out whether the change is accompanied by a change in meaning.”

21This relationship is reminiscent of the notion of functional load of a phoneme (cf. Trubetzkoy 1939 [1967]:240; King 1967; Meyerstein 1970).

22The Levensthein algorithm (Levenshtein 1966) is a dynamic programming algorithm which aims to give the minimal cost that is necessary to change one string into the other. Determining the cost function is dependent on the actual implementation. Usually, three different kinds of operations are defined which add up to the final cost: insertion, deletion and substitution of a symbol. For the purpose of determining the substitutions in minimal pairs, the deletion and insertion operations are not relevant because only strings of the same length are compared. See Gusfield (1997) for a detailed description of the algorithm.

Table4.6:AbsolutecountsofsubstitutionsofsoundsintheEnglishtext. aeioubcdfghjklmnpqrstvwxyz a087123124594665380716518801461008142 e870518136114275111501542018170121730091300 i123510866051710623481730117826181 o1248186046533155015314054202021 u593660460512122011184060305020 b41155503135482743893763233805056359460822 c6413131021292530213303231370243438822014 d62773235210212927634454856500898966153101816 f5511148292101438416303226420244229844172 g31105227252914023817232017210282625619657 h8156524330273823051344413937046634784621112 j002008264850381017100136702009 k71531191334161713301229111702719162217639 l16445137304530234481204946290115575516301511 m520831633248322041102949046400533943193621218 n18181718233156261739171146460230995448123041112 p8173443837504221371017294023002644391633298 q00000000000000000001000000 r141211565024892428461327115539926005849223910814 s61774056348942266361957395444158060143742933 t103082335386629254771655434839049600165761711 v002009815868022161912160221416015252 w896254622314419462173036303303937571502163 x111000001620612420104622030 y4308228118751103512119082917516301 z20110224162712991118128014331123010

Table 4.6 shows the substitution matrix for the English text in the sample. For instance, the vowela can be found in 87 minimal pairs where it is substituted by the vowele, whereas there are only 4 minimal pairs which only differ in the soundsaandb.

It can already be seen in the matrix that there is a very strong tendency for sounds to be substituted by sounds of the same class, whereas sounds of different classes rarely occur in minimal pairs as the differentiating factors.

The absolute values in Figure 4.6 can be compared to bigram counts as in other methods (e.g., Sukhotin’s algorithm) to infer the vowel/consonant distinction. In the case of bigram counts, however, the higher the frequency of co-occurrence the less similar both sounds are considered to be.²³ With the substitution counts, on the other hand, two sounds are taken to be more similar the higher their substitution count. The absolute values suffer from the fact that certain sounds are more frequent than others and therefore tend to have a higher value for all substitutions. It is thus necessary to make use of the statistical methods introduced in Section 3.2 in order to give equal weight to all sound substitutions independent of the overall frequency of the individual sounds. To this end, the corresponding φ values are calculated for all cells in the matrix. A description of this procedure is given in Section 3.2 where the computation ofφcoefficients is explained in detail.

The clustering of sounds as well as their visual representation described below is based on theφvalues that are computed from the absolute counts of substitutions. We have seen before that a visual analysis of results may provide the user with an overview of potential patterns in the data. An introduction to the main assets of a visualization in terms of a matrix of sounds has been given in Section 3.4.3.²⁴ The same visualization technique can also be applied to the similarity matrix of sounds that is generated from the substitution counts in minimal pairs. Theφvalues of sound pairs thereby indicate their similarity. The closer the φ value is to +1, the more similar are two sounds.

The values are then mapped to colors, where red indicates a negative association and blue indicates a positive association between sounds. The color saturation marks the strength of the association. Most importantly, all sounds in the matrix are ordered according to the leaf order in the dendrogram after Ward’s clustering (Ward 1963) has been applied.

The results of such visualizations for all 30 languages in our sample are given in Figure 4.2. For most of the languages and substitution matrices a clear pattern emerges. There are two blocks that can be differentiated in the visualizations. Sounds within these blocks show a tendency to be associated with sounds of the same block but avoid to be associated with sounds of the other block. On closer inspection, the row and column labels show that the blocks represent the two major phonological categories of sounds in the language, viz. vowels and consonants. For the English orthographic text, for instance, the full matrix (Figure 4.3) represents the vowels in the left block and the consonants on the right. The colors mark the delineation of both blocks where sound pairs belonging to the same block are in blue and those belonging to different blocks are given in red.

23This is due to the hypothesis that vowels and consonants tend to alternate within words.

24For another example which clearly demonstrates the usefulness of the visualization approach see Chapter 6 where vowel harmony patterns are visualized as a vowel succession matrix.

Warlpiri Turkish Hungarian Maori Hixkaryána

Please select the association measure and order of symbols. You can also move individual columns or rows by dragging their label cell to the desired position.