• Keine Ergebnisse gefunden

Sequence Alignment

N/A
N/A
Protected

Academic year: 2022

Aktie "Sequence Alignment"

Copied!
129
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)Sequence Alignment Gerhard Jäger. ESSLLI 2016. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 1 / 62.

(2) Sequence alignment: Motivation. Sequence alignment: Motivation. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 2 / 62.

(3) Sequence alignment: Motivation. Motivation Example. suppose we have no information except word lists goals: estimate distances between languages estimate cognate classes track individual sound changes. Gerhard Jäger. Sequence Alignment. Meaning. Italian. English. few rub dull hunt year this fish rotten right when drink heavy heavy egg earth dust laugh grass sharp wash. ’pɔko fre’gare ot’tuzo kat’tʃare ’anno ’kwesto ’peʃʃe ’martʃo ’dʒusto ’kwando ’bere pe’sante ’grɛve ’wɔvo ’tɛrra ’polvere ’ridere ’ɛrba taʎ’ʎɛnte la’vare. fju: rʌb dʌl hʊnt jɪə ðɪs fɪʃ ’rɒtən raɪt wɛn drɪŋk ’hɛvɪ ’hɛvɪ ɛg ɜ:θ dʌst lɑ:f grɑ:s ʃɑ:p wɒʃ. cognate. ESSLLI 2016. 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0. 3 / 62.

(4) Sequence alignment: Motivation. Motivation Example. suppose we have no information except word lists goals: estimate distances between languages estimate cognate classes track individual sound changes. Gerhard Jäger. Sequence Alignment. Meaning. Italian. English. few rub dull hunt year this fish rotten right when drink heavy heavy egg earth dust laugh grass sharp wash. ’pɔko fre’gare ot’tuzo kat’tʃare ’anno ’kwesto ’peʃʃe ’martʃo ’dʒusto ’kwando ’bere pe’sante ’grɛve ’wɔvo ’tɛrra ’polvere ’ridere ’ɛrba taʎ’ʎɛnte la’vare. fju: rʌb dʌl hʊnt jɪə ðɪs fɪʃ ’rɒtən raɪt wɛn drɪŋk ’hɛvɪ ’hɛvɪ ɛg ɜ:θ dʌst lɑ:f grɑ:s ʃɑ:p wɒʃ. cognate. ESSLLI 2016. 3 / 62.

(5) Sequence alignment: Motivation. Preprocessing IPA is open-ended — 107 letters, 52 diacritics, 4 prosodic marks → 200,000 combinations good practice: map IPA strings to a uniform representation with fewer symbols common choices: 10 Dolgopolsky sound classes (Dolgopolsky 1986; used i.a. in List 2014) 41 ASJP sound classes. this course: ASJP. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 4 / 62.

(6) Sequence alignment: Motivation. Preprocessing IPA is open-ended — 107 letters, 52 diacritics, 4 prosodic marks → 200,000 combinations good practice: map IPA strings to a uniform representation with fewer symbols common choices: 10 Dolgopolsky sound classes (Dolgopolsky 1986; used i.a. in List 2014) 41 ASJP sound classes. this course: ASJP. Gerhard Jäger. Sequence Alignment. Meaning. Italian. English. few rub dull hunt year this fish rotten right when drink heavy heavy egg earth dust laugh grass sharp wash. poko fregare ottuzo kattSare anno kwesto peSSe martSo dZusto kwando bere pesante grEve wovo tErra polvere ridere Erba tallEnte lavare. fyu rob dol hunt yi3 8is fiS rot3n rait wEn driNk hEvi hEvi Eg 38 dost lof gros Sop woS. ESSLLI 2016. 4 / 62.

(7) Pairwise alignment. Pairwise alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 5 / 62.

(8) Pairwise alignment. Levenshtein alignment related to as edit distance defines the distance between two strings as the minimal number of edit operations to transform one string into the other edit operations: deletion insertion replacemant. example: grm. mEnS vs. Cimbrian menEs 1 2 3. mEnS → menS (replace) menS → menES (insert) menES → menEs (insert). dL (mEnS, menEs) = 3. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 6 / 62.

(9) Pairwise alignment. Levenshtein alignment. alternative presentation: alignment m E n − S | | | | | m e n E s distance for a particular alignment is the number of non-identities Levenshtein distance is the number of mismatches for the optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 7 / 62.

(10) Pairwise alignment. Computing the Levenshtein Distance recursive definition: 1. dL (, α) = dL (α, ) = l(α). 2.   dL (α, β) + δ(x, y) dL (αx, β) + 1 dL (αx, βy) = min  dL (α, βy) + 1. apparently requires exponentially growing number of comparisons ⇒ computationally not feasible but: if l(α) = n and l(β) = m, there are n + 1 substrings of α and m + 1 substrings of β hence there are only (n + 1)(m + 1) many different comparisons to be performed computational complexity is polynomial (quadratic in l(α) + l(β)) Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 8 / 62.

(11) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(12) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(13) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 0 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(14) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 0 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(15) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 0 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(16) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 0 1 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(17) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 0 1 2 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(18) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − m E n S − 0 1 2 3 4 m 1 0 1 2 3 e 2 n 3 E 4 s 5. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(19) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m E n S 1 2 3 4 0 1 2 3 1. Sequence Alignment. ESSLLI 2016. 9 / 62.

(20) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1. Sequence Alignment. E n S 2 3 4 1 2 3 1. ESSLLI 2016. 9 / 62.

(21) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1. Sequence Alignment. E 2 1 1. n S 3 4 2 3 2. ESSLLI 2016. 9 / 62.

(22) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1. Sequence Alignment. E 2 1 1. n 3 2 2. S 4 3 3. ESSLLI 2016. 9 / 62.

(23) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2. Sequence Alignment. E 2 1 1. n 3 2 2. S 4 3 3. ESSLLI 2016. 9 / 62.

(24) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2. Sequence Alignment. E 2 1 1 2. n 3 2 2. S 4 3 3. ESSLLI 2016. 9 / 62.

(25) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2. Sequence Alignment. E 2 1 1 2. n 3 2 2 1. S 4 3 3. ESSLLI 2016. 9 / 62.

(26) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2. Sequence Alignment. E 2 1 1 2. n 3 2 2 1. S 4 3 3 2. ESSLLI 2016. 9 / 62.

(27) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3. Sequence Alignment. E 2 1 1 2. n 3 2 2 1. S 4 3 3 2. ESSLLI 2016. 9 / 62.

(28) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3. Sequence Alignment. E 2 1 1 2 2. n 3 2 2 1. S 4 3 3 2. ESSLLI 2016. 9 / 62.

(29) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3. Sequence Alignment. E 2 1 1 2 2. n 3 2 2 1 2. S 4 3 3 2. ESSLLI 2016. 9 / 62.

(30) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3. Sequence Alignment. E 2 1 1 2 2. n 3 2 2 1 2. S 4 3 3 2 2. ESSLLI 2016. 9 / 62.

(31) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3 4. Sequence Alignment. E 2 1 1 2 2. n 3 2 2 1 2. S 4 3 3 2 2. ESSLLI 2016. 9 / 62.

(32) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3 4. Sequence Alignment. E 2 1 1 2 2 3. n 3 2 2 1 2. S 4 3 3 2 2. ESSLLI 2016. 9 / 62.

(33) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3 4. Sequence Alignment. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2. ESSLLI 2016. 9 / 62.

(34) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. Gerhard Jäger. m 1 0 1 2 3 4. Sequence Alignment. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. ESSLLI 2016. 9 / 62.

(35) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(36) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(37) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(38) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(39) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(40) Pairwise alignment. Computing the Levenshtein distance Dynamic Programming − − 0 m 1 e 2 n 3 E 4 s 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. m E n − S m e n E s. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 9 / 62.

(41) Pairwise alignment. Computing the Levenshtein distance − m e n E s. − 0 1 2 3 4 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. m E n − S m e n E s. Gerhard Jäger. Sequence Alignment. − m e n E s. − 0 1 2 3 4 5. m 1 0 1 2 3 4. E 2 1 1 2 2 3. n 3 2 2 1 2 3. S 4 3 3 2 2 3. m E n S − m e n E s. ESSLLI 2016. 9 / 62.

(42) Pairwise alignment. Normalization for length grm. mEnS (Mensch, ’person’) and Hindi manuSya are (partially) cognate grm. ze3n (sehen, ’see’) and Hindi deg are not cognate still dL (mEnS, manuSya) = 4 dL (ze3n, deg) = 3 normalization: dividing Levenshtein distance by length of longer string: dLD (mEnS, manuSya) = 4/7 ≈ 0.57 dLD (ze3n, deg) = 3/4 = 0.75. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 10 / 62.

(43) Pairwise alignment. How well does normalized Levenshtein distance predict cognacy?. 0.75. 0.6. empirical probability of cognacy. 0.8. 1.0. 1.00. LDN. cognate no. 0.50. 0.25. 0.0. 0.2. 0.4. yes. 0.2. 0.4. 0.6. 0.8. 0.00. LDN. no. yes. cognate. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 11 / 62.

(44) Pairwise alignment. Problems. binary distinction: match vs. non-match frequently genuin sound correspondences in cognates are missed: c v a i n a z 3 - - - f i S - - t u n - o s p i s k i s corresponding sounds count as mismatches even if they are aligend correctly h a n t h a n t h E n d m a n o substantial amount of chance similarities. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 12 / 62.

(45) Pairwise alignment. Background: probability theory Given two sequences: How likely is it that they are aligned? More general question: Given some data, and two competing hypotheses, how likely is it that the first hypothesis is correct?. Bayesian Inference!!! given: data: d hypotheses: h1 , h0 model: P (d|h1 ), P (d|h0 ). wanted: P (h1 |d) : P (h0 |d) Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 13 / 62.

(46) Pairwise alignment. Bayesian inference Bayes Theorem: P (d|h)P (h) 0 0 h0 P (d|h )P (h ). P (h|d) = P ergo:. P (h1 |d) : P (h0 |d) = P (d|h1 )P (h1 ) : P (d|h0 )P (h0 ) P (d|h1 ) P (h1 ) P (h1 |d) : P (h0 |d) = P (d|h0 ) P (h0 ) P (d|h1 ) P (h1 ) log(P (h1 |d) : P (h0 |d)) = log + log P (d|h0 ) P (h0 ). Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 14 / 62.

(47) Pairwise alignment. Bayesian inference suppose we have many independent data: d~ = d1 , . . . , dn. ~ P (d|h) =. n Y. P (di |h). i=1. ~ log P (d|h) =. n X. log P (di |h). i=1. ~ 1) P (d|h log ~ 0) P (d|h. =. n X. log. P (di |h1 ) P (di |h0 ). log. P (di |h1 ) P (h1 ) + log P (di |h0 ) P (h0 ). i=1. ~ : P (h0 |d)) ~ log(P (h1 |d) =. n X i=1. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 15 / 62.

(48) Pairwise alignment. Bayesian inference mein argument against using Bayes’ rule: the prior probabilities P (h1 ), P (h0 ) are not known there are various heuristics, but no generally accepted way to obtain them if n is large though, log P (h1 )/P (h0 ) doesn’t matter very much:1. ~ : P (h0 |d)) ~ log(P (h1 |d) ≈. n X i=1. log. P (di |h1 ) ~ 1 ) : P (d|h ~ 0 )) = log(P (d|h P (di |h0 ). ~ 1 ) : P (d|h ~ 0 )) is called log-odds the quantity log(P (d|h 1 Also, if we choose an uninformative prior with P (h1 ) = P (h0 ), we have log P (h1 )/P (h0 ) = 0 anyway. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 16 / 62.

(49) Pairwise alignment. Log-odds. log-odds can take any real value a positive value indicates evidence for h1 and a negative value evidence for h0 the higher the absolute value, the stronger is the evidence. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 17 / 62.

(50) Pairwise alignment. Weighted alignment. suppose our data are two aligned sequences ~x, ~y for the time being, we assume there are no gaps in the alignment h1 : they developed from a common ancestor via substitions h0 : they are unrelated. additional assumptions (rough approximation in biology, pretty much off the mark in linguistics): substitions in different positions occur independently. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 18 / 62.

(51) Pairwise alignment. The null model if ~x and ~y are unrelated, their joint probability equals the product of their individual probabilities as a start (quite wrong both in biology and in linguistics): let us assume the strings have no “grammar”; each position is independent from all other positions then P (~x, ~y |h0 ) = P (~x|h0 )P (~y |h0 ) Y = P (xi |h0 )P (yi |h0 ) i. log P (~x, ~y |h0 ) =. X. log(P (xi |h0 ) + log P (yi |h0 )). i. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 19 / 62.

(52) Pairwise alignment. The null model suppose ~x and ~y are generated by the same process (reasonable for DNA and protein comparison, false for cross-linguistic word comparison) then P (xi |h), P (yi |h) are simply the probabilities of occurrence qa : probability that symbol a occurs in a sequence log P (~x, ~y |h0 ) =. X. log qxi +. i. X. log qyj. j. q can be estimated from relative frequencies. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 20 / 62.

(53) Pairwise alignment. The alignment model suppose ~x and ~y evolved from a common ancestor via independent substitution mutations independence between positions: P (~x, ~y |h1 ) =. Y. P (xi , yi |h2 ). i. pa,b : probability that a position in the latest common ancestor of x and y evolved into an a in sequence ~x and into a b in sequence ~y. P (~x, ~y |h1 ) =. Y. pxi ,yi. i. log P (~x, ~y |h1 ) =. X. log pxi ,yi. i Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 21 / 62.

(54) Pairwise alignment. The log-odds score. taking things together, we have. log(P (~x, ~y |h1 ) : P (~x, ~y |h0 )) =. X i. log. pxi ,yi qxi qyi. log qpaabqb : score of the alignment of a with b also goes by the name of Pointwise Mutual Information (PMI) assembled in a PMI substitution matrix. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 22 / 62.

(55) Pairwise alignment. Substitution matrices. in bioinformatics, several commonly used substitution matrices for nucleotids and proteins based on explicit models of evolution and careful empirical testing for nucleotids: A G T C A 2 −5 −7 −7 G −5 2 −7 −7 T −7 −7 2 −5 C −7 −7 −5 2. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 23 / 62.

(56) Pairwise alignment. Substitution matrices for proteins: different matrices for different evolutionary distances for instance: BLOSUM50. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 24 / 62.

(57) Pairwise alignment. Substitution matrix for the ASJP data 1. identify large sample of pairs of closely related languages (using expert information or heuristics based on aggregated Levenshtein distance) An.NORTHERN_PHILIPPINES.CENTRAL_BONTOC An.MESO-PHILIPPINE.NORTHERN_SORSOGON. An.SOUTHERN_PHILIPPINES.KAGAYANEN An.NORTHERN_PHILIPPINES.LIMOS_KALINGA. WF.WESTERN_FLY.IAMEGA WF.WESTERN_FLY.GAMAEWE. An.MESO-PHILIPPINE.CANIPAAN_PALAWAN An.NORTHWEST_MALAYO-POLYNESIAN.LAHANAN. Pan.PANOAN.KASHIBO_BAJO_AGUAYTIA Pan.PANOAN.KASHIBO_SAN_ALEJANDRO. NC.BANTOID.LIFONGA NC.BANTOID.BOMBOMA_2. AA.EASTERN_CUSHITIC.KAMBAATA_2 AA.EASTERN_CUSHITIC.HADIYYA_2. IE.INDIC.WAD_PAGGA IE.INDIC.TALAGANG_HINDKO. ST.BAI.QILIQIAO_BAI_2 ST.BAI.YUNLONG_BAI. NC.BANTOID.LINGALA NC.BANTOID.LIFONGA. An.SULAWESI.MANDAR An.OCEANIC.RAGA. An.CENTRAL_MALAYO-POLYNESIAN.BALILEDO An.CENTRAL_MALAYO-POLYNESIAN.PALUE. An.SULAWESI.TANETE An.SAMA-BAJAW.BOEPINANG_BAJAU. AuA.MUNDA.HO AuA.MUNDA.KORKU. UA.AZTECAN.NAHUATL_HUEYAPAN_TETELA_DEL_VOLCAN UA.AZTECAN.NAHUATL_CUENTEPEC_TEMIXCO. MGe.GE-KAINGANG.KAYAPO MGe.GE-KAINGANG.APINAYE. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 25 / 62.

(58) Pairwise alignment. Substitution matrix for the ASJP data 2. pick a concept and a pair of related languages at random languages: Pen.MAIDUAN.MAIDU_KONKAU, Pen.MAIDUAN.NE_MAIDU concept: one. 3. find corresponding words from the two languages: nisam, niSem. 4. do Levenshtein alignment n n. i i. s S. a e. m m. 5. for each sound pair, count number of correspondences nn: 1; ii: 1; sS; 1; ae: 1; mm: 1. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 26 / 62.

(59) Pairwise alignment. Substitution matrix for the ASJP data steps 2-5 are repeated 100,000 times klem klom. Gerhard Jäger. S3--v S37on. ligini ji---p a i u n o m t k e r l b s d p w N h y 3 ... kulox Gulox. .. .. a 56,047 . . i 33,955 4 8 u 23,731 4 a n 21,363 G t o 19,619 i ! m 18,263 G y t 16,975 d ! k 16,773 s G e 12,745 Z 5 r 11,601 G s l 11,377 X z b 8,965 ! k s 8,245 q 8 d 6,829 a ! p 6,681 a ! w 6,613 ! y N 6,275 ! E h 5,331 j G y 5,321 G i 3 5,255 E ! Sequence .. ..Alignment. .. . 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2. Naltir---i Naltirtiri. … …. ESSLLI 2016. 27 / 62.

(60) Pairwise alignment. Substitution matrix for the ASJP data 6. determine relative frequency of occurrence of each sound within the entire database a i u o n e k m t r l b s w 3 y d h p N g. Gerhard Jäger. 0.1479 0.0969 0.0696 0.0626 0.0614 0.0478 0.0478 0.0465 0.0449 0.0346 0.0331 0.0248 0.0243 0.0232 0.0228 0.0222 0.0214 0.0213 0.0202 0.0201 0.0178. E 7 C S x c f 5 v q z j T L X 8 Z ! 4 G. 0.0134 0.0124 0.0073 0.0064 0.0062 0.0056 0.0052 0.0049 0.0045 0.0041 0.0035 0.0035 0.0029 0.0027 0.0022 0.0014 0.0011 0.0009 0.0002 0.0001. Sequence Alignment. ESSLLI 2016. 28 / 62.

(61) Pairwise alignment. Substitution matrix for the ASJP data 7. estimate pab as relative frequency of co-occurrence of a with b, qa , qb as individual relative frequencies, and determine PMI scores log2 qpaabqb G ! 4 8 Z X L z q f v 5 j T S c C 4 x G G 7 p N. Gerhard Jäger. G ! 4 8 Z X L z q f v 5 j T S c C G x X q 7 p N. 11.2348 10.0202 9.1480 8.0650 7.9575 7.9375 7.6276 7.2624 7.2542 6.9117 6.8418 6.7731 6.7587 6.6580 6.6054 6.5989 6.2439 6.1943 6.1210 5.3342 5.3017 5.2111 5.0693 4.9821. Z d g b s 4 E w h G Z y l ! 3 r X m t G k X T 8. j d g b s 5 E w h x z y l G 3 r q m t Z k x Z G. 4.9386 4.9263 4.8958 4.8906 4.8277 4.7508 4.7143 4.6512 4.5819 4.5573 4.4943 4.4637 ··· 4.4037 4.3760 4.3692 4.3061 4.1200 4.1087 4.1021 4.0429 3.9046 3.8116 3.7380 3.6993. o C j a E ! ! 5 T ! e ! f N ! L T 4 5 C ! ! ! !. Sequence Alignment. q a o m v w u q o k z s q S b b u i a N t e i a. -3.2842 -3.2893 -3.2914 -3.2915 -3.3035 -3.3079 -3.3087 -3.3116 -3.3158 -3.3526 -3.3763 -3.3788 -3.3942 -3.3954 -3.4077 -3.4558 -3.4690 -3.5529 -3.8294 -3.8451 -4.2625 -4.3534 -4.3712 -4.9817. ESSLLI 2016. 29 / 62.

(62) Pairwise alignment. Evaluation. −10. −5. 0. 5. 10. PMI. j. Z z L 8 y. l. d. r C T c S s. t. !. 4 5 x X g h 7 q k G f. v w p b n N m i. e E 3 o u a. j. j. Z. Z. z. z. L. L. 8. 8. y. y. l. l. d. d. r. r. C. C. T. T. c. c. S. S. s. s. t. t. !. !. 4. 4. 5. 5. x. x. X. X. g. g. h. h. 7. 7. q. q. k. k. G. G. f. f. v. v. w. w. p. p. b. b. n. n. N. N. m. m. i. i. e. e. E. E. 3. 3. o. o. u. u. a. a j. Gerhard Jäger. Z z L 8 y. l. d. r C T c S s. t. !. 4 5 x X g h 7 q k G f. Sequence Alignment. v w p b n N m i. e E 3 o u a. ESSLLI 2016. 30 / 62.

(63) Pairwise alignment. Evaluation ✵ ✶. ✺. ✵ ✹ ❣. ❦. ◆ ♦. ✉. ♠ ❛ ❊. ❡. ✸. ✐ ✈. ✇. ① ✺ ✲. ❜. ♣. ✦ ✼ ❈. ❳. ●. ❤. ❧ ♥. s. ✺ ❝ ✽. ②. r t. ❞ ▲. ③ ❙. ❥. ❩. q. ✁✂✄☎✆✝ ✝✂✄☎. Gerhard Jäger. ❚ ❢. ✞✟✁✂✠. ✂✄✡☛✄✂✠. Sequence Alignment. ☞✁✌✍✄✂. ESSLLI 2016. 31 / 62.

(64) Pairwise alignment. Gap penalties gaps in an alignment correspond either to an insertion or a deletion simplified assumption: insertions and deletions are equally likely at all positions; symbols are inserted according to their general frequency of occurrence Suppose an item xi is aligned to a gap. Let α be the probability that an insertion occured since the latest common ancestor, and β the probability of a deletion P (xi , −|h1 ) = αqxi + βqxi P (xi , −|h0 ) = qxi log(P (xi , −|h1 ) : P (xi , −|h0 )) = log(α + β) = −d. i.e., there is a constant term for each gap as α + β < 1, this term is negative, i.e. there a constant penalty for each gap Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 32 / 62.

(65) Pairwise alignment. Affine gap penalties deletions/insertions frequently apply to entire blocks of symbols (both in biology and linguistics) probability of a gap of length n are higher than the product of probabilities of n individual gaps penalty e for extending a gap is lower than penalty d for opening a gap g: length of a gap γ(g) = −d − (g − 1)e. no principled way to derive the values of d and e; have to be fixed via trial and error d = 2.5 and e = 1.6 work quite well for the ASJP data Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 33 / 62.

(66) Pairwise alignment. Weighted alignment so far, we assumed that the alignment between ~x and ~y is known to assess strength of evidence for h1 given ~x, ~y , we need to consider all alignments between ~x and ~y enumeration is infeasible, because the number of alignments between two sequences of length n is   n (2n)! 22 2n = ≈√ n (n!)2 πn computation is nonetheless possible using Pair Hidden Markov Models simpler task: find the most likely alignment and determine its log-odds!. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 34 / 62.

(67) Pairwise alignment. The Needleman-Wunsch algorithm. almost identical to Levenshtein algorithm, except: matches/mismatches are counted not as 1 and 0, but as log-odds scores of the corresponding symbol pair insertions/deletions are counted as gap penalties by convention, the similarity rather than the distance is counted, i.e. we try to find the alignment that maximizes the score. let ~x have length n, ~y lenth m, sab be the log-odds score of a and b, and d/e the gap penalties. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 35 / 62.

(68) Pairwise alignment. The Needleman-Wunsch algorithm. F (0, 0). =. G(0, 0). =. ∀i :. 0 0 0<i≤n. F (i, 0). =. F (i − 1, 0) + G(i − 1, 0)e + (1 − G(i − 1, 0))d. G(i, 0). =. 1. ∀j :. 0<j ≤m:. F (0, j). =. F (0, j − 1) + G(0, j − 1)e + (1 − G(0, j − 1))d. G(0, j). =. 1. ∀i, j : F (i, j). =. G(i, j). =. 0 < i ≤ n, 0 < j ≤ m   F (i − 1, j) + G(i − 1, j)e + (1 − G(i − 1, j))d F (i, j − 1) + G(i, j − 1) + (1 − G(i, j − 1))d max  F (i − 1, j − 1) + s xi yj    F (i − 1, j) + G(i − 1, j)e + (1 − G(i − 1, j))d  F (i, j − 1) + G(i, j − 1)e + (1 − G(i, j − 1))d 0 if arg max =3  F (i − 1, j − 1) + s  xi yj 1 else. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 36 / 62.

(69) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(70) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(71) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(72) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(73) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(74) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(75) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(76) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(77) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(78) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(79) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(80) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(81) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(82) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(83) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(84) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(85) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(86) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(87) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(88) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(89) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(90) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(91) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(92) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 s −8.9 −2.97 2.15 5.1. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(93) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 8.84 s −8.9 −2.97 2.15 5.1. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(94) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 8.84 s −8.9 −2.97 2.15 5.1 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(95) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 8.84 s −8.9 −2.97 2.15 5.1 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(96) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 8.84 s −8.9 −2.97 2.15 5.1 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(97) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 8.84 s −8.9 −2.97 2.15 5.1 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(98) Pairwise alignment. Finding the best alignment Dynamic Programming − m E n S − 0 −2.5 −4.1 −5.7 −7.3 m −2.5 4.13 1.53 0.03 −1.47 e −4.1 1.53 5.65 3.05 1.55 n −5.7 0.03 3.05 9.2 6.6 E −7.3 −1.47 4.75 6.6 7.62 8.84 s −8.9 −2.97 2.15 5.1 memorizing in each step which of the three cells to the left and above gave rise to the current entry lets us recover the corresponing optimal alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 37 / 62.

(99) Pairwise alignment. Evaluation left: Levenshtein alignment; right: Needleman-Wunsch alignment -iX ego. iXego. -blat folyu. b-lat folyu. han-t manus. han-t manus. du tu. du tu. haut--kutis. haut-k-utis. --brust pektus-. b--rust pektus-. vir nos. vir nos. ---blut saNgwis. ---blut saNgwis. leb3r yekur. leb3r yekur. ains unus. ain-s -unus. knoX3n --o--s. knoX3n --os--. triNk3n -bibere. triNk3n-bi-bere. cvai -duo. cvai duo-. hornkornu. hornkornu. --ze3n widere. --ze3n widere. ---mEnS persona. mEnS--persona. -au-g3 okulus. a-ug3okulus. -her3n audire. --her3n audire-. ---fiS piskis. fiS--piskis. na-z3 nasus. naz3nasus. Sterb3n -mor--i. Sterb3n -mor-i-. hun-t kanis. hun-t kanis. chan dens. chand-ens. khom3n wenire. khom3n--w---enire. -----laus pedikulus. ------laus pedikul-us. -chuN3 liNgwE. chuN--3 -liNgwE. zon3 so-l. zon3 sol-. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 38 / 62.

(100) Pairwise alignment. Evaluation vas3r -akwa. --vas3r akwa---. Stain lapis. Sta-in -lapis. -foia iNnis. fo-ia iNnis. pfat viya. p-fat viya-. bErk mons. bErk mons. n-at noks. na-t noks. ---fol plenus. fol---p-lenus. no--i nowus. no-inowus. nam-3 nomen. nam3nomen. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 39 / 62.

(101) Pairwise alignment. German — Swabian 'I': iX i. 0.3. 'louse': laus laus. 15.01. 'tongue': 9.8 chuN3 cuN. 'you': du du. 8.26. 'tree': baum bom. 6.57. 'knee': kni knui. 7.77. 'we': vir mia. -1.09. 'leaf': blat blad. 11.92. 'hand': hant hEnd. 8.6. 'one': ains ois. 4.63. 'skin': haut haut. 14.42. 'breast': 14.81 brust bXuSt. 'two': cvai cvoi. 16.0. 'blood': blut blud. 12.88. 'liver': leb3r leba. 10.01. 'person': 12.61 mEnS mEnZE. 'bone': knoX3n knoXE. 16.88. 'drink': triNk3n dXiNg. 4.99. 'fish': fiS fiS. 'horn': horn hoan. 8.75. 'see': ze3n se. 0.63. 16.35. Gerhard Jäger. Sequence Alignment. 'die': Sterb3n StEab. 10.16. 'come': khom3n khom. 11.84. 'sun': zon3 sonE. 8.79. 'star': StErn StEan. 16.16. 'water': vas3r vaza. 7.8. 'stone': Stain Stoi. 10.36. 'fire': foia fuia. 12.43. ESSLLI 2016. 40 / 62.

(102) Pairwise alignment. German — English 'I': iX Ei. -2.3. 'you': du yu. 2.34. 'we': vir wi. 2.21. 'one': ains w3n. -2.3. 'two': cvai tu. -5.25. 'fish': fiS fiS 'dog': hunt dag. 16.35. -7.46. Gerhard Jäger. 'tree': baum tri. -7.83. 'tongue':-0.63 chuN3 t3N. 'leaf': blat lif. -0.47. 'knee': kni ni. 3.86. 'blood': blut bl3d. 9.46. 'hand': hant hEnd. 8.6. 'bone': knoX3n bon. -1.36. 'horn': horn horn 'eye': aug3 Ei 'nose': naz3 nos. 15.73. -4.1. 1.63. 'breast': 16.93 brust brest 'liver': leb3r liv3r. 14.65. 'drink': triNk3n drink. 7.48. 'see': ze3n si. Sequence Alignment. -3.04. 'die': Sterb3n dEi. -7.7. 'come': khom3n k3m. 1.22. 'sun': zon3 s3n. 1.95. 'star': StErn star. 8.2. 'water': vas3r wat3r. 12.06. 'stone': Stain ston. 6.75. 'fire': foia fEir. 6.79. ESSLLI 2016. 41 / 62.

(103) Pairwise alignment. German — Latin 'I': iX ego. -3.87. 'you': du tu. 3.62. 'we': vir nos. -5.06. 'one': ains unus. 2.39. 'two': cvai duo. -5.51. 'person':-4.66 mEnS persona 'fish': fiS piskis. 0.29. Gerhard Jäger. 'louse': -0.08 laus pedikulus 'tree': baum arbor. -3.85. 'leaf': blat folyu. -3.57. 'skin': haut kutis. -0.25. 'blood': -9.18 blut saNgwis 'bone': knoX3n os. -5.72. 'horn': horn kornu. 7.55. 'nose': naz3 nasus. 'see': ze3n widere. -4.15. 'tooth': -2.78 chan dens. 'hear': her3n audire. -4.24. 'tongue':-3.4 chuN3 liNgwE. 'die': Sterb3n mori. -6.12. 'knee': kni genu. 0.8. 'come': khom3n wenire. -9.25. 'hand': hant manus. 0.73. 'sun': zon3 sol. 0.97. 'breast': 1.39 brust pektus. 'star': StErn stela. 5.72. 'liver': leb3r yekur. 'water': -5.4 vas3r akwa. Sequence Alignment. 4.49. 5.37. ESSLLI 2016. 42 / 62.

(104) Pairwise alignment. How well does PMI similarity predict cognacy? expert cognacy judgments used as gold standard 0.75. 0.6. empirical probability of cognacy. 0.8. 1.0. 1.00. LDN. cognate no. 0.50. 0.4. yes. 0.0. 0.2. 0.25. 0.2. 0.4. 0.6. 0.8. 0.00. LDN. no. yes. 1.0. cognate. 0.8. 20. 0.6. empirical probability of cognacy. 10. PMI. cognate 0. no. 0.4. yes. 0.2. −10. 0.0. −20. −20. −10. 0. 10. 20. PMI. no. yes. cognate. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 43 / 62.

(105) Pairwise alignment. How well does PMI similarity predict cognacy?. 1.0. precision−recall curve. 0.4. 0.5. PMI: 0.864. precision. LDN: 0.847. 0.6. Average Precision. 0.7. 0.8. 0.9. LDN PMI. 0.0. 0.2. 0.4. 0.6. 0.8. 1.0. recall. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 44 / 62.

(106) Estimating distances from pairwise alignments. Estimating distances from pairwise alignments. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 45 / 62.

(107) Estimating distances from pairwise alignments. Probability of cognacy logistic regression to predict probability of cognacy from PMI similarity concept. Italian. English. predicted prob.. expert judgment. sharp float Kill bark husband walk eat bark know come swim back burn think dust wife swell sing knee dry five skin hand blood flow wipe turn. tallEnte galleddZare ammattsare skordza marito kamminare mandZare kortettSa sapere venire nwotare dosso ardere pensare polvere molle gonfyare kantare rotElla aSSutto tSinkwe pElle mano sangwe skorrere aSSugare dZirare. Sop fl3ut kil bok hozb3nd wok it bok n3u kom swim bEk b3n 8iNk dost waif swEl siN ni drai faiv skin hEnd blod fl3u waip t3n. 0.004 0.004 0.007 0.009 0.010 0.011 0.011 0.013 0.015 0.016 0.016 0.018 0.018 0.019 0.019 0.020 0.021 0.022 0.022 0.022 0.023 0.024 0.025 0.025 0.026 0.026 0.026. 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0. Gerhard Jäger. concept. Italian. English. predicted prob.. expert judgment. father when night and name worm round many wind two mother thou child long fish count star belly sun fly three flow heavy person animal vomit fruit. padre kwando notte eed nome vErme tondo molti vEnto due madre tu fantSullo lungo peSSe kontare stella vEntre sole volare tre fluire grEve persona animale vomitare frutto. fo83 wEn nait End neim w3m raund mEni wind tu mo83 8au tSaild loN fiS kaunt sto bEli son flai 8ri fl3u hEvi p3s3n Enim3l vomit frut. 0.480 0.483 0.508 0.518 0.519 0.521 0.526 0.569 0.573 0.600 0.624 0.629 0.638 0.651 0.659 0.660 0.664 0.679 0.692 0.742 0.744 0.759 0.769 0.799 0.947 0.960 0.966. 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 0 1 0 0 1 1 1 1. Sequence Alignment. ESSLLI 2016. 46 / 62.

(108) Estimating distances from pairwise alignments. Estimating distances. ⇒ applicable across language families. 2.0. Welsh−Nepali. Polish−Breton Ukrainian−Breton Greek−Polish. German−Nepali. Breton−Bengali. Russian−Breton Czech−Breton Polish−Icelandic Greek−Russian Russian−Bengali Bulgarian−Breton Lithuanian−Bengali Polish−Dutch Ukrainian−Bengali Greek−Irish Greek−Ukrainian Greek−Breton Breton−Lithuanian Polish−German Ukrainian−German Polish−Bengali Polish−Irish Polish−English Breton−Hindi Polish−Welsh Ukrainian−Irish Polish−French Polish−Portuguese Greek−Czech French−Lithuanian Catalan−Lithuanian Polish−Italian Polish−Spanish Greek−Bulgarian Polish−Romanian Polish−Catalan Ukrainian−Danish Ukrainian−Dutch Greek−French Dutch−Lithuanian Ukrainian−Portuguese Icelandic−Bengali Ukrainian−Italian German−Spanish Russian−Irish Greek−Welsh German−Lithuanian German−Breton Ukrainian−Spanish Portuguese−Lithuanian Swedish−Nepali Icelandic−Breton Greek−Hindi Polish−Danish German−Italian Greek−Icelandic Ukrainian−Romanian Russian−Dutch Icelandic−Romanian Greek−English Greek−Portuguese Bulgarian−Italian Icelandic−Italian Russian−English Icelandic−Spanish Danish−Nepali Bulgarian−Icelandic Bulgarian−Bengali Lithuanian−Welsh Czech−Italian English−Breton Bulgarian−Spanish Russian−German Spanish−Irish Italian−Lithuanian Russian−Icelandic Danish−Breton Dutch−Portuguese Greek−German Russian−Italian Ukrainian−English Czech−Irish Ukrainian−Catalan Dutch−Nepali Polish−Hindi Spanish−Lithuanian Bulgarian−English Czech−Bengali Ukrainian−French English−Romanian Ukrainian−Icelandic Catalan−Welsh Greek−Romanian Danish−Romanian Dutch−Spanish Czech−Spanish Bulgarian−German Dutch−Romanian Italian−Breton Icelandic−French Icelandic−Portuguese Italian−Irish Russian−French Czech−Catalan Dutch−Irish Dutch−Welsh Czech−Portuguese Polish−Swedish English−Welsh Russian−Portuguese Greek−Danish Catalan−Bengali Danish−Portuguese German−French Russian−Catalan German−Portuguese Spanish−Hindi Bulgarian−Irish Czech−German English−Hindi Russian−Danish Greek−Dutch Lithuanian−Hindi English−Lithuanian Portuguese−Breton Italian−Hindi Icelandic−Lithuanian Greek−Italian Danish−Italian Dutch−Breton Greek−Swedish Russian−Spanish German−Romanian Russian−Romanian Greek−Spanish Dutch−Italian Czech−French Greek−Bengali English−Italian Czech−Dutch Bulgarian−French Breton−Romanian Ukrainian−Swedish Swedish−Lithuanian Bulgarian−Portuguese Hindi−Welsh Danish−Spanish Lithuanian−Irish Bengali−Welsh Czech−Hindi Czech−Romanian Swedish−Bengali Greek−Nepali Catalan−Breton Greek−Catalan English−Spanish Russian−Welsh Bulgarian−Dutch Icelandic−Irish Romanian−Bengali Swedish−Portuguese Ukrainian−Hindi French−Welsh Catalan−Irish Greek−Lithuanian Icelandic−Catalan Czech−Danish Portuguese−Hindi Czech−English Swedish−Italian Bulgarian−Catalan Swedish−Spanish Swedish−Breton German−Irish English−Bengali Danish−Lithuanian Italian−Bengali English−Irish Spanish−Breton Danish−Bengali German−Welsh Bulgarian−Swedish French−Breton Swedish−Romanian Swedish−Irish Czech−Icelandic Icelandic−Hindi Danish−Irish Dutch−French Spanish−Bengali Portuguese−Irish German−Catalan French−Bengali English−Portuguese Spanish−Welsh Catalan−Hindi Ukrainian−Welsh French−Hindi German−Hindi Irish−Bengali Bulgarian−Danish Portuguese−Welsh English−Nepali Russian−Hindi Bulgarian−Hindi German−Bengali Breton−Irish Bulgarian−Romanian Russian−Swedish Spanish−Nepali Danish−Hindi English−French French−Irish Icelandic−Welsh Dutch−Hindi Romanian−Lithuanian English−Catalan Danish−French Portuguese−Bengali Catalan−Nepali Swedish−French Italian−Welsh Irish−Hindi Czech−Welsh Romanian−Hindi Romanian−Welsh Bulgarian−Welsh Polish−Lithuanian Czech−Swedish Lithuanian−Nepali Swedish−Hindi Portuguese−Nepali Danish−Catalan Dutch−Bengali Romanian−Irish Dutch−Catalan Danish−Welsh Swedish−Welsh Romanian−Nepali Swedish−Catalan Bulgarian−Lithuanian Irish−WelshCzech−Lithuanian Irish−Nepali Russian−Lithuanian Ukrainian−Lithuanian Italian−Nepali French−Nepali Czech−Nepali Ukrainian−Nepali Icelandic−German Polish−Nepali Bulgarian−Nepali French−Romanian Russian−Nepali Icelandic−English Spanish−French Danish−English Danish−German Icelandic−Dutch English−German Portuguese−French Catalan−Romanian French−Italian Portuguese−Romanian English−Dutch Hindi−Bengali Bulgarian−Polish Catalan−French Danish−Dutch Swedish−English Swedish−German Spanish−Romanian Icelandic−Danish Italian−Romanian Bulgarian−Ukrainian Breton−Welsh Bulgarian−Czech Catalan−Italian Catalan−Portuguese Swedish−Dutch Polish−Ukrainian Bengali−Nepali Portuguese−Italian Catalan−Spanish Bulgarian−Russian Spanish−Italian Russian−Polish Hindi−Nepali Icelandic−Swedish Dutch−German Polish−Czech Ukrainian−Czech Swedish−Danish Russian−Czech Portuguese−Spanish Russian−Ukrainian. 1.0. prediction. 1.5. Icelandic−Nepali. 0.0. ⇒ distance estimation from raw data. Breton−Nepali. 0.5. expected relative frequency of cognate pairs = e−t. 2.5. average or maximal predicted probability of cognacy per concept = expected relative frequency of cognate pairs. 0.0. 0.5. 1.0. 1.5. 2.0. 2.5. expert. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 47 / 62.

(109) Estimating distances from pairwise alignments. Estimating distances Neighbor Joining tree. average or maximal predicted probability of cognacy per concept = expected relative frequency of cognate pairs expected relative frequency of cognate pairs = e−t ⇒ distance estimation from raw data ⇒ applicable across language families. Gerhard Jäger. Sequence Alignment. Irish Welsh Breton Romanian French Catalan Italian Spanish Portuguese Danish Swedish Icelandic English German Dutch Greek Bengali Nepali Hindi Lithuanian Bulgarian Ukrainian Russian Czech Polish. ESSLLI 2016. 47 / 62.

(110) Estimating distances from pairwise alignments. Languages of Eurasia/ ASJP data cf. Jäger (2015); full tree can be inspected here 96.8%. 99.4%. 100%. 100%. 99.9%. 100%. Indo-European. Uralic. Chukotko-Kamchatkan. Nivkh. Turkic. Yukaghir. Sequence Alignment. Tungusic. Mongolic. Tai-Kadai. Austronesian. Hmong-Mien. Sino-Tibetan. Austroasiatic. Ainu. Japonic. Nakh-Daghestanian. Dravidian. Yeniseian. Gerhard Jäger. 96.9%. ESSLLI 2016. 48 / 62.

(111) Estimating distances from pairwise alignments. Languages of the World/ ASJP data. ai Tai-Kad. Timor-Alor-Pantar. a ic er. Q. ue. ch. As. Am. n ya. SE. Ma. an tec -Az. ibetan Sino-T. Uto. Austro-Asiat ic. cf. Jäger and Wichmann (2016); full tree can be inspected here.. ua. ia. n. Austronesian. Australia/Papua. a. u Pap. Australian. Afro-Asiatic Afro-Asiatic. Niger-Congo. Gerhard Jäger. Altaic Uralic Indo-European. ian vid Dra -Congo NigerSaharan NiloKadugli haran Nilo-Sa Khoisan. Su bs Af aha ric ra a n. a. si ra. NW. Sequence Alignment. Eu. Africa Eurasia Papunesia Australia America. ESSLLI 2016. 49 / 62.

(112) Multiple sequence alignment. Multiple sequence alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 50 / 62.

(113) Multiple sequence alignment. Multiple sequence alignment. Needleman-Wunsch only does pairwise alignment desirable: aligning all sequences of a taxon into one matrix necessary for character-based phylogenetic inference improves the quality of the alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 51 / 62.

(114) Multiple sequence alignment. Multiple sequence alignment example: ’one’ PIE: oinos Bosian: yedan Kashubian: yEdEn optimal pairwise alignments:. o y. i e. n d. o a. s n. o y. i E. n d. o E. s n. y y. e E. d d. a E. n n. optimal multiple alignment (maximizing sum of pairwise similarities per column): y y. E o e. d d. E i a. n n n. o -. s -. alignment of all ’n’s is etymologically correct Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 52 / 62.

(115) Multiple sequence alignment. Multiple sequence alignment. in principle, the Needleman-Wunsch algorithm can be generalized to aligning k sequences 2. however, aligning k sequences of length n has complexity O(nk ) ⇒ computationally intractable two strategies heuristic search progressive alignment. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 53 / 62.

(116) Multiple sequence alignment. Progressive sequence alignment. start with a guide tree (using some heuristics like pairwise alignment + Neighbor Joining) working bottom-up, at each internal node, do pairwise alignment of the block alignments at the daugher node complexity is O(n2 k 3 ) ⇒ computationally feasible. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 54 / 62.

(117) Multiple sequence alignment. T-Coffee progressive alignment only uses (phylogenetically) local information erroneous decisions cannot be corrected later. dendron 8en-rod---ru-. dendron 8en-rod---ru---tri-. dendron 8en-rodendron. 8enro. Gerhard Jäger. dru. tri. Sequence Alignment. ESSLLI 2016. 55 / 62.

(118) Multiple sequence alignment. T-Coffee Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application) 1. dendron 8en-ro8enro d--ru. dendron 8---ru8enro t--ri. dendron ---drudru tri. dendron t---ri-. dendron ---tri-. pairwise alignment for all word pairs, using PMI scores. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 56 / 62.

(119) Multiple sequence alignment. T-Coffee Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application) 1. pairwise alignment for all word pairs, using PMI scores. 2. ternary alignments via relation composition. Gerhard Jäger. dendron 8en-ro8enro d--ru. t--ri 8enro d--ru. Sequence Alignment. dendron 8---ru8enro t--ri. t---ridendron ---dru-. t---ridendron d---ru-. dendron ---drudru tri. dendron t---ri-. dendron ---tri-. ---tridendron ---dru-. ---tridendron t---ru-. ESSLLI 2016. ... ... .... 56 / 62.

(120) Multiple sequence alignment. T-Coffee Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application) 1. pairwise alignment for all word pairs, using PMI scores. 2. ternary alignments via relation composition. 3. indirect alignment scores between sound occurrences. Gerhard Jäger. dendron 8en-ro8enro d--ru. t--ri 8enro d--ru. Sequence Alignment. dendron 8---ru8enro t--ri. t---ridendron ---dru-. t---ridendron d---ru-. dendron ---drudru tri. dendron t---ri-. dendron ---tri-. ---tridendron ---dru-. ---tridendron t---ru-. ESSLLI 2016. ... ... .... 56 / 62.

(121) Multiple sequence alignment. T-Coffee Tree-based Consistency Objective Function for alignment Evaluation (Notredame et al., 2000) (slightly adapted for linguistic application) 1. 2. 3. 4. pairwise alignment for all word pairs, using PMI scores. dendron 8en-ro8enro d--ru. t--ri 8enro d--ru. 8enro t--ri. t---ridendron d---ru-. t---ridendron ---dru-. ternary alignments via relation composition. Gerhard Jäger. dru tri. dendron t---ri-. dendron ---tri-. ---tridendron ---dru-. dendron 8en-rod---ru-. indirect alignment scores between sound occurrences progressive alignment using those scores. dendron ---dru-. dendron 8---ru-. ---tridendron t---ru-. ... ... .... dendron 8en-rod---rut---ri-. dendron 8en-rodendron. Sequence Alignment. 8enro. dru. ESSLLI 2016. tri. 56 / 62.

(122) Multiple sequence alignment. Examples cognate class. language. word. cognate class. language. word. one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A one:A. German Dutch English Danish Swedish Icelandic Irish Breton French Catalan Spanish Portuguese Italian Romanian Bengali Nepali Czech Polish Ukrainian Russian Bulgarian. -a-i--n---e--n-w-o--n---e--n---E--n---eidn---e--n---i--n---E------u--n---u--no ---u------u--no ---u--nu ---E--k---e--kyEdE--nyEdE--n-odi--n-adi--n-3di--n-. heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J heart:J. German Dutch English Danish Swedish Icelandic French Catalan Spanish Portuguese Italian Hindi Lithuanian Czech Polish Ukrainian Russian Bulgarian Greek. h-Er-t--sh-or-t---h-o--t---y-Ea-d--3y-E--t--aS-ar-t--ak-Er-----k-or-----k-ora8--on k-uras--aw kwor----eh--r-d--ai S-ir-dis-s--r-t-sEs-Er-t-sEs-Er-t-sEs-Erdt-sEs-3r-t-sEk-ar-8-Sa-. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 57 / 62.

(123) Multiple sequence alignment. Examples cognate class. language. word. cognate class. language. word. two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A two:A. German Dutch English Danish Swedish Icelandic French Catalan Spanish Portuguese Italian Romanian Nepali Czech Polish Ukrainian Russian Bulgarian Greek. tsvait-we-t--u-d--o-t-vo-t-veir d--e-d--o-s d--o-s d--oiS d--ued--o-y d--uid-va-d-va-d-wa-d-va-d-va-8-io--. mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A mother:A. German Dutch English Danish Swedish Icelandic French Catalan Spanish Portuguese Italian Czech Polish Ukrainian Russian Bulgarian Greek. mu-t--amu-d--3r mo-8--3mo----amu-d--3r mou8--ir mE---r-ma---r3ma-8-rema----ima-d-rema-t-kama-t-kama-t--ima-t---ma-y-k3mi-tera-. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 58 / 62.

(124) Multiple sequence alignment. Examples cognate class. language. word. cognate class. language. word. tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W tongue:W. German Dutch English Danish Swedish Icelandic French Catalan Spanish Portuguese Italian Romanian Hindi Czech Polish Ukrainian Russian Bulgarian. ---tsuN--3 ---t-oN-----t-oN-----d-oN--3 ---t-3N--a ---t-uNg-a ---l-o-g----l-ENgw3 ---l-eNgwa ---l-i-gua ---l-ingwa ---l-im-b3 ---dZi--bya-z-ik--yEwz-3k--ya-z-ik--yi-z-3k---3-z-ik---. tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B tooth:B. Greek German Dutch English Danish Swedish Icelandic French Catalan Spanish Portuguese Italian Romanian Bengali Hindi. 8-ondi tsan-t-ontt-u-8d-an-t-andt-En-d-o--d-en-dyente d-e-t3 d-Ente d-inte d-o-td-a-t-. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 59 / 62.

(125) Multiple sequence alignment. Examples. cognate class. language. word. cognate class. language. word. dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A dog:A. Lithuanian Ukrainian Russian Danish Swedish Icelandic German Dutch Welsh Breton Irish French Italian Portuguese Romanian Greek. S-u---os-obakas-abakah-u-n--h-3-nd-h-i-ndir h-u-nt-h-o-nt-k-----ik-----ik-----uS-i---Ek-a-n-ek-a---uk-3yn-eTio-n---. tree:C tree:C tree:C tree:C tree:C tree:C tree:C tree:C tree:C. Danish Swedish Icelandic English Ukrainian Russian Polish Bulgarian Greek. d---G-E-t---r-Edt---ryE-t---r-i-dE--r-Ewo dE--r-Evo d---Z-Evo d3--r--vo 8endr---o. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 60 / 62.

(126) Wrapping up. Wrapping up. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 61 / 62.

(127) Wrapping up. Important topics not covered Bayesian tree estimation ⇒ no good introductory texts so far (that I would be aware of); best starting point might be Chen et al. (2014), esp. chapters 1 and 7 estimation of time depths in years (rather than in “expected number of mutations”) ⇒ Chang et al. (2015) automatic cognate detection ⇒ first part of the Jäger/List-manuscript on the course homepage Hot research topics automatic discovery of regular sound correspondences and sound laws automatic reconstruction of proto-forms factoring vertical descent from language contact integrated probabilistic inference of sequence alignment and phylogenies Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 62 / 62.

(128) References. Chang, W., C. Cathcart, D. Hall, and A. Garrett (2015). Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis. Language, 91(1):194–244. Chen, M.-H., L. Kuo, and P. O. Lewis (2014). Bayesian Phylogenetics. Methods, Algorithms and Applications. CRC Press, Abingdon. Dolgopolsky, A. B. (1986). A probabilistic hypothesis concerning the oldest relationships among the language families of northern eurasia. In V. V. Shevoroshkin, ed., Typology, Relationship and Time: A collection of papers on language change and relationship by Soviet linguists, pp. 27–50. Karoma Publisher, Ann Arbor. Jäger, G. (2015). Support for linguistic macrofamilies from weighted sequence alignment. Proceedings of the National Academy of Sciences, 112(41):12752–12757. Doi: 10.1073/pnas.1500331112. Jäger, G. and S. Wichmann (2016). Inferring the world tree of languages from word lists. In S. G. Roberts, C. Cuskley, L. McCrohon, L. Barceló-Coblijn, O. Feher, and T. Verhoef, eds., The Evolution of Language: Proceedings of the 11th International Conference Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 62 / 62.

(129) Wrapping up. (EVOLANG11). Available online: http://evolang.org/neworleans/papers/147.html. List, J.-M. (2014). Sequence Comparison in Historical Linguistics. Düsseldorf University Press, Düsseldorf.. Gerhard Jäger. Sequence Alignment. ESSLLI 2016. 62 / 62.

(130)

Referenzen

ÄHNLICHE DOKUMENTE

2) Дълго докосване: докосвайте за по-дълъг период от време и светлият лумен ще се увеличава постепенно. Преместете ръката си от сянката на лампата и

Para saber qué perfiles y accesorios están disponibles en su área, visite www.deceuninck.com o comuníquese con un distribuidor local.. Por favor, consultar siempre la última

The Europarl corpus is a parallel corpus sentence aligned from English to ten languages, Danish, Dutch, Finnish, French, German, Greek, Italian, Portugese, Spanish, and Swedish..

Часть средств фонда должна быть непосредственно направлена на финансирование проектов восстановления всех пострадавших от войны регионов Восточной Украины и,

Nine source languages -Bulgarian, Dutch, English, Finnish, French, German, Italian, Portuguese and Spanish- and 7 target languages -all the source languages but Bulgarian and

One child acquires Italian simultaneously with German, the other child acquires Italian simultaneously with Swedish (we will be referring to the two Germanic languages

ASCII: Danish, French, French Belgian, French Canadian, German, Italian, Norwegian, Spanish, Swedish, Swiss, German/French, U.K. English,

Lander: Class prediction and discovery using gene expression data, in Proceedings of the Annual International Conference on Research in Computational Molecular Biology (RECOMB),