The uttered tokens of the old and new linguistic variants are analyzed analogous to alleles at the same locus in population genetics, copied under possible reanalysis and innovation (i.e

(1)

IGOR YANOVICH

igor.yanovich@uni-tuebingen.de

DFG Center for Advanced Studies “Words, Bones, Genes and Tools”

University of T¨ubingen R¨umelinstraße 23, Rm. 604b

T¨ubingen, 72070, Germany

Abstract. Sapir’s drift and unidirectionality with exceptions are two empirical patterns observed in language change. In Sapir’s drift, languages descending from the same proto-language undergo very similar changes after their divergence, constituting independent parallel evolution. In unidirectionality with exceptions, grammaticalization processes, such as when lexical verbs turn into tense markers as Latin verbhabeo ‘have’ turned into French future marker-ainlir-a‘will read’, are largely unidirectional, but there are rare instances of back-developments. Both patterns are prob- lematic enough for the conventional views on language change that there have been attempts in the literature to deny their existence despite the evidence. I set up a stochastic framework for analyzing time-courses of language change where both Sapir’s drift and unidirectionality with exceptions are actually expected to arise and can be meaningfully studied. The uttered tokens of the old and new linguistic variants are analyzed analogous to alleles at the same locus in population genetics, copied under possible reanalysis and innovation (i.e. mutation) and differential proneness to be repeated by speakers (i.e. fitness). In multi-speaker networks, the finite size of speaker memory creates substan- tial genetic drift, which in turn causes Sapir’s drift. The existence of exceptions to unidirectionality is parasitic on whether token-level linguistic mutations have precise back-mutations. Certain previously puzzling empirical data from historical sociolinguistics receive a new explanation under our model. Finally, as some evolutionary parameters are invariant over speaker-network structures, the road to practical evolutionary inference opens. However, identifiability is relatively poor in the general case, so background constraints on the model will need to be obtained through classical historical-linguistic analysis before successful inference can be conducted.

Key words and phrases. Language change, Sapir’s drift, grammaticalization, population genetics, inference of evolutionary parameters.

1

(2)

Contents

1. Two phenomena in language change: Sapir’s drift and unidirectionality with exceptions 2

2. Applying evolutionary modeling to language change 3

3. Modeling language change as replication of tokens 4

3.1. Atomic events of change 4

3.2. The evolutionary process of language maintenance and change 5

3.3. The base process: a single speaker or a small group 5

3.4. The behavior of the base process 6

3.5. The interesting case: multiple populations 10

3.6. The behavior of the multi-speaker process 10

4. Flexibility of the model: how to add more realism 13

5. Empirical benefits 14

6. The road to inference from empirical data 15

Acknowledgements 17

References 17

Supplementary Information 19

SI 1. Base process 19

SI 2. Multi-speaker process 19

1. Two phenomena in language change: Sapir’s drift and unidirectionality with exceptions

Sapir’s drift [1, 2, 3] is a pattern in language change when languages related by descent show very similar developments long their separation. For example, prehistoric Iranian language Old Avestan has both short and long endings for -a-stem verbs expressing the first person singular present indicative, namely ¯a and ¯ami, but Younger Avestan preserves only the long ending ¯ami.

Similarly, Vedic Sanskrit, from the Indic language family which is the closest relative to the Iranian languages, has short and long endings for subjunctive first person singular present,¯a and ¯ani, but its descendant Classical Sanskrit only preserves the long ending [4]. Both languages changed their inflection systems in the same direction in the first millennium BCE, many centuries after their separation from the common ancestor, which happened before mid-second millennium BCE.

As another example, languages of the Tibeto-Burman family frequently feature so-called anti- ergative, or secundative, marking of nominal arguments in a clause [2]: the ‘case’ marker of the direct object ‘a deer’ in ‘Mary saw a deer’ is the same in this marking scheme as that on the goal argument ‘to Ann’ in ‘Mary gave the book to Ann’, but different from the case on ‘the book’. This system is relatively rare cross-linguistically [5], but was present in 84 Tibeto-Burman languages in the sample of [2], while only 20 had a system analogous to e.g. German where ‘a deer’ and ‘the book’

from the above sentences have the same case. Crucially, the linguistic markers used by secundative Tibeto-Burman languages cannot stem from any common ancestor. They are so different that they must have arisen independently many times throughout the family [2].

In a yet another case, present-day English must, Dutch moeten and German m¨ussen express various flavors of necessity and descend from the same common-Germanic verbm¯otan. In the early recorded texts in their predecessor languages, these verbs all have a different meaning, sometimes analyzed as possibility [6, 7] and other times as an unusual meaning indicating that possibility and necessity collapse together [8,9]. The old meaning was lost at different times during the last millennium, when German, Dutch and English were already quite different languages.

This list can be continued. While there can be questions about any single case,the phenomenon of parallel independent development, or Sapir’s drift, is evidently present in at least some cases of

(3)

language change. When changes after separation are typologically common, this is not particularly surprising, as they could happen simply by chance. But when the shared independent developments are uncommon cross-linguistically, this is puzzling for the following reason. There must be some factor special to the family in question that determines the later uncommon parallel developments.

But if that factor is present in all languages of the family, it must have been already present in their common ancestor. But then why did the change itself not happen in that ancestor already?

The stochastic framework for analyzing language change developed below answers those questions by demonstrating that under many natural settings, the delay of this type is actually expected.

Unidirectionality with (rare) exceptions concerns a different type of case. Many linguistic changes are so common across the world that we can speak of regular change pathways, and it is an empirical fact that many such pathways allow change predominantly in one direction. For example, there are many documented cases where expressions of internal ability develop a new meaning of permission, as English can did: in Middle English, it completely lacked the sense of permission [10] that it has now. But an opposite development has never been registered to date in a large sample of examined languages [11,12,13,14]. The regular change pathway in this case is [internal ability]→ [circumstantial possibility]→ [permission]. However, while there are no known cases of permission verbs turning into internal-ability verbs, the back-development from circumstantial possibility to internal ability appears sometimes possible, if not very common [14,15]. This is a typical situation:

most known grammaticalization developments are unidirectional, but there is a small number of exceptional back-shifts [16,17].

The problem with unidirectionality with exceptions is that currently there is no conceptual framework that can derive it. If regular pathways did not allow exceptions, unidirectionality would have had the status of a law of nature, plain and simple. Unfortunately, the reality is more complex.

But if a pathway allows movement in both directions, why then do we mostly see only one of them? The quantitatively oriented framework developed here will allow us to easily understand processes of change with such characteristics without having to claim that there are never true back-developments [18].

2. Applying evolutionary modeling to language change In this section, we first consider the

(4)

3. Modeling language change as replication of tokens

Globally speaking and simplifying slightly, language change is when a linguistic construction A present at timet in the linguistic output of some speech community gives rise to, or gets replaced by, a construction B at a later time t⁰. A and B can belong to any level of linguistic structure:

they can be sounds, suffixes, word order patterns, form-meaning pairings, etc. We model change and maintenance by directly modeling the linguistic output over time, in the spirit of [19]. Unlike the Utterance Selection model [20], also based on [19], our model abstracts away from speaker grammars, whose influence is only captured indirectly in the utterance reproduction rule.¹

3.1. Atomic events of change. On the most local level, crucial atomic steps of change involve reanalysis and innovation. Reanalysis occurs on the hearer side. You hear what the speaker intended to be utteranceα(A), the notation that denotes a linguistic sentenceαwith an elementA substituted into it. But the overall context of the conversation is such that it’s perfectly reasonable for you to assume that the speaker intended some β(B). In general, there wouldn’t be too many such confusion contexts available, and even when they occur, you wouldn’t always recover β(B) and not the intendedα(A). But sometimes you might. Now,αandβ would often be constrained to have many identical elements, so the reanalysis would affect relatively small elementsA andB, to be deduced from the bigger structure while keeping everything else fixed [21]. As an example, Ann may say ‘I am going to read’ intending to describe her current action of walking (literal, directional meaning of ‘be going to’), while Mary takes Ann to have meant that there is a future event of reading by Ann (futurate meaning of ‘be going to’). Elements like ‘I’ or ‘read’ remain constant between Ann’s and Mary’s analyses, and the thing changed is the meaning of thegoing-construction. Note that the target B can be widely present in the language (as the futurate meaning of ‘be going to’

in Present-Day English), or be absent from it (as when somebody made a similar reanalysis for the first time). The probability of reanalysis in these two cases will differ, but the abstract effect of the change event remains the same.

The second type of atomic change,innovation, happens on the speaker side. Imagine Jim who wants to express the thought ‘In the near future, there’s an event of my walking to the library’.

People around him use the futurate ‘be going to’ construction to express such meaning, but always articulate it more or less fully. Jim is in a hurry, and he instead pronounces something closer to

‘I’m gonna walk to the library’. What just happened is that Jim substituted a new form-meaning pairingB =hbe gonna, futuratei, whose formal side is more reduced than in the original pairingA

=hbe going to, futuratei.

Both types of atomic change events can be usefully viewed asmutation: a change that makes the copied linguistic unit to differ from the source it was copied from. For reanalysis, there will by definition exist exact back-mutations. Indeed, suppose the speaker produced innovative β(B) in a confusion context where α(A) would also be appropriate. The hearer could then misanalyze that utterance asβ(B). Confusion contexts shield the hearer from reliably identifying the speaker’s intentions, and this necessarily can produce both mistakes in favor of β(B) and of α(A), though the relative probabilities for two misidentifications can differ. For instance, if people use futurates much more often than directionals, then a rational hearer would interpret an ambiguous expres- sion as a futurate more often. So the rate for [directional]→[futurate] will be greater than for [futurate]→[directional].

Innovation, in contrast, can allow or disallow back-mutations depending on the actual linguistic process in question. Consider first the speaker aiming to reach a particular articulatory target A, but missing and producing B instead. Since articulatory mistakes can in principle go both ways, there would be a parallel possibility to aim forB, but produceAinstead, though the probability of

1Mathematically, the focus of [20] is on diffusion approximations valid for very large populations of uttered tokens and of speakers. Our interest, in contrast, is on the exact behavior in small-size settings, which is required to better understand Sapir’s drift.

(5)

A→BandB→Amay differ. In contrast, consider the speaker omitting a case ending of some word.

If somebody then uses that produced form as their target, the probability of them spontaneously adding the prior case ending will be extremely small, if that form is the only source of information for the new production: the form just does not by itself encode anything about the missed ending.

So in some innovation-driven changes, unlike in those driven by reanalysis, the probability of atomic back-mutation will be negligible.

3.2. The evolutionary process of language maintenance and change. Most speech production by speakers involves faithful reproduction rather than atomic change events. A speaker will construct a new token of construction A based on previous experience with earlier tokens of A.

Here, we model the speaker as having a finite-size memory pool designated for this construction and its alternatives. The pool stores previously heard or produced tokens. This exemplar-based memory, while having parallels in the literature [22], is not intended here to necessarily be the true picture of what happens in our brains as we speak: that picture is for psycholinguists to develop.

It is only an idealized conceptualization that captures the important fact that we mostly make use of linguistic units that we already encountered in the past.

Note that we are only interested in modeling alternative linguistic units, thus forming a category of sorts. Of course, in reality changes between such alternatives are related to the language state as a whole. But we only model such dependencies on the factors external to our set of alternatives through the evolutionary parameters of mutation rates, explained above, and “fitnesses”.

We allow different construction types to have different proneness to being copied into new ut- terances. This proneness is characterized by a single numerical value called fitness. The way the formal model is construed below, the absolute values of fitnesses do not matter; only their relative size does. Fitness differentials between linguistic types may stem from various sources. To name just a few: B could be cognitively easier to process thanA;B could be associated with greater social prestige;B could work better given how the rest of the language is set up. As in biology, fitness is an aggregate measure of reproductive success which many heterogeneous factors contribute to.

Abstracting away from the underlying causes of fitness differentials, we can understand the general workings of diverse factors that affect differential reproduction of linguistic constructions.More generally, it is worth stressing out that the model is not intended as the true and only model of language change. Instead, it aims to capture in a simplified and thus tractable form some important abstract characteristics of the process of language perpetuation and change.

Putting this all together, we can view the process of language perpetuation and change as an evolutionary process where memory pools of different speakers are populations of tokens of linguistic units that (i) get reproduced and copied into future pools via memory retention and perception of new speech events, (ii) experience fitness effects, and (iii) are subject to mutation in the form of reanalysis and innovation. Crucially, the units subject to reproduction are tokens of linguistic constructions [19]. This differs from the majority of work on evolutionary modeling of language where the evolving units are speaker grammars [23, 24, 25, 26]. We will now define our model formally.

3.3. The base process: a single speaker or a small group. We first consider a single memory pool, or population, ¯x = hx_A, xB, ...i associated with a restricted set of alternative constructions A, B, etc. x_A is the number of As in the population, and so forth. The pool models the memory of one speaker or a small tightly-knit linguistic community. The whole model is given as hN, R, F,x(t¯ 0), repri. Here, N is the number of tokens stored in the pool. R contains mutation rates between variants rA→B, etc. F determines the fitnessesf_A,f_B, etc. ¯x(t₀) is the initial state of the memory pool. repr is the reproductive rule. In principle, N,R,F may be allowed to vary according to current timet and current population state ¯x, but we will mostly consider the cases where those are constant. Some further formal details are provided in SI 1.

(6)

Time flows in discrete steps, also called generations. Objects in our pools are not speakers, but tokens of linguistic units, so a generation here does not refer to the biological generations of speakers. Instead, a “generation” simply represents a time slice of the linguistic output and memory that influence the speaker. A new generation is formed on the basis of the preceding generation according to the reproductive rule. For simplicity, below we work with the classical Wright-Fisher rule: parents for the units in the new generation are drawn with replacement from the current pool, with probabilities of a unit being drawn linearly proportional to the relative fitness of its type. Once a parent linguistic unit is selected, its child can be of the same type, or it can experience a mutation and thus change into another type, with mutation rates rA→B, rB→A, etc. This is the familiar Wright-Fisher model of population genetics, with linguistic constructions playing the role of alleles, and linguistic processes of reanalysis and innovation viewed as mutation.² Unlike in biology, we do not have a natural generation length in our model. Generations should be short enough for the linguistic output at the preceding time slice to directly affect a speaker’s behavior, so psycholinguistic research should eventually inform the choice. Here, we provisionally note that intervals like a day, a week or a month seem reasonable as the first approximation.

Importantly, generation length and the reproductive rule are interdependent in the model. For instance, it is more likely that a token heard yesterday will survive in today’s memory than a token heard a month ago, so with shorter generations, the probability of each token leaving at least one offspring should be greater. This in turn means that the reproductive rule needs to be adjusted if we rescale the time from day-long generations to month-long generations. Generally, differences in reproductive rules can lead to substantially different behavior [27]. But when we can rescale the generation length accordingly, this becomes of much less importance than in the biological case.

Henceforth we restrict ourselves to the Wright-Fisher rule in this paper.

The base process just defined models one speaker or a small community of such. In the real world, speakers die, and new speakers get born and learn the language. We do not represent this explicitly for simplicity. A new speaker replacing the old speaker is modeled in the same way as today’s time slice replacing yesterday’s slice of the same speaker. Given that (i) speaker replacement is rare on our scale of linguistic-memory time slices, and (ii) children do copy their linguistic units from those of older speakers, this seems a reasonable simplification.

3.4. The behavior of the base process. The dynamics of Wright-Fisher models is well-studied, and we can draw on its known properties to build an initial understanding of Sapir’s drift and unidirectionality with exceptions. Of course, we are interested in whether those phenomena arise in large communities of interacting speakers, and not in an isolated base process representing a single speaker or a small group thereof. But it will be easier to analyze the interesting case referring back to the base process.

For convenience, we restrict our attention to the case with only two alternative linguistic unit types, A and B. In this setting, the current population state is fully characterized by the current share of Bs.

The crucial property of a finite-population base process is its inherent stochasticity. In an infinite population of linguistic units, the evolutionary trajectories of each type’s share are fully determined by the initial state, mutation rates and differences in fitness. While each unit leaves a

2This model is applicable when each variantA,B, etc. is connected to some other by just one elementary step of reanalysis or innovation. Among our motivating examples above, Indo-Iranian verbal endings appear to pass this test. The changing meanings of Germanicmotanwords and secundative case marking in Tibeto-Burman languages are more complex phenomena, and may require more than two variants to be explicitly modeled. When the interest is in constructions A andB that are several reanalysis or innovation steps away from each other, it would not in general be appropriate to abstract away from the actual chain of micro-changes and only include the endpoints A andB. Just as with chain reactions in chemical kinetics, the overall dynamics of the chain may be complex and not characterizable in terms of a single mutation rate betweenAandB, so the intermediate variants would need to be directly modeled.

(7)

Generation Share of Bs 0%50%100%

0 10000 20000 30000 (a) infinite population

0 10000 20000 30000 (b) N=10000

0 10000 20000 30000 (c) N=1000

0 10000 20000 30000 (d) N=100

Figure 1. Example trajectories for the share of Bs inA-and-B populations of size (a) infinite, (b) N=10000 linguistic tokens, (c) N=1000, (d) N=100. In all cases, there is no fitness differential, mutation ratesrA→B= 10⁻⁴,rB→A= 10⁻⁵.

random number of children units in the next generation, in an infinite population the randomness thus introduced is averaged out, and the actual trajectory is the same as the expected trajectory.

In contrast, in a finite population, the randomness of reproduction often moves the process away from the expected value. That random force is calledgenetic drift (NB: no relation to “drift” in Sapir’s drift).

Both mutation and fitness nudge the population in particular determined directions. E.g., if rA→B rB→A, mutation favors Bs. In contrast, genetic drift is blind to the labels of linguistic constructions. Together, mutation, fitness and genetic drift determine both the distribution of the base process in the limit of infinite time, and the likelihood of particular trajectories. When mutation or fitness are strong compared to genetic drift, the stochasticity introduced by the latter is relatively mild, Fig.1(b). When genetic drift is strong, the stochastic component dominates the short term time scale, while the long term properties of the process are still affected by mutation rates and fitness values. In particular, strong drift favorspure states, where the whole population is of the same type, Fig.1(c-d).

Crucially, the strength of genetic drift is determined solely by population size. To get an intuitive grasp on this, consider that in a uniform-fitness population withN = 10000, the probability that at least 80% of current units won’t leave any offspring is 0.2¹⁰⁰⁰⁰≈2×10⁻⁶⁹⁹⁰, while forN = 100, it is

(8)

0.2¹⁰⁰≈1×10⁻⁷⁰. While fairly improbable in both cases, such an extreme event is less far-fetched for the smaller population.

For a small set of alternative linguistic constructions (e.g., for tokens of going to associated with directional or with futurate meanings), the population size of 50 or 100 tokens stored in the memory of one speaker appears reasonable. On the practical side, modeling populations of such size is reasonably tractable. The true token population size is crucial for predictions, as it determines the relative strength of genetic drift, Fig. 1. But that strength also depends on mutation rates and fitness differentials. We will further discuss estimation of the evolutionary parameters from data below in Section6. Here, we study the general behavior of our process using some illustrative choices for the model parameters, including the cases where genetic drift is relatively strong.

Genetic drift in the base process leads to an outcome analogous to the linguistic Sapir’s drift.

Imagine several independent instances of the same base process starting from an all-A population, but with mutation rates or fitnesses or both favoring B. Suppose that those directional forces are relatively weak compared to genetic drift. In each generation, there will only be a small probability of some Bs emerging. Even when they do emerge, they will often be eliminated before achieving high frequencies, as the initial upticks in Fig.1(c-d) demonstrate. Sooner or later a takeover byBs will inevitably happen, but its timing will differ.

Recall that in Sapir’s drift, genealogically related languages show the same change, but at different times. It was a challenge for the classical conceptual frameworks for language change. They could not explain why a force already present in the common ancestor would not make the change happen already there. With our base process, we see that weak but persistent forces favoring change (such as small and crucially different rates of reanalysis or innovation) determine that the change will happen, but cannot force it right away. The time when the actual shift to the favoredB happens can vary widely. If language change is anything like our base process, then we should not be surprised that, say, Avestan and Sanskrit implement parallel changes at different times even if those changes are determined by exactly the same factors already present in their common ancestor.

Generally, whenever the triggers for change are relatively weak, we expect exactly such behavior in the base process. Below, we show that it survives into the multi-speaker setting as well.

Let’s turn to unidirectionality with exceptions. As we noted in Section 3.1, all reanalysis mutations allow exact back-mutations, while innovation mutations may or may not allow such back- developments. The prediction of the model is straightforward: whenever the back-mutation is present, our base process will in the limit of infinity oscillate between all possible states. A useful tool for studying such long-term behavior is the stationary distribution of the process. Formally, our process is a Markov chain characterized by transition matrix T where Tij is the probability of moving from i Bs to j Bs at the next step. When both forward and backward mutations are present so that all states can turn into each other over time with positive probability, by the Perron-Frobenius theorem there exists stationary distribution φ s.t.φT =φ. Entry φi of φis the probability that the process will be in the state with i Bs at an arbitrary time in the limit of infinity. Consequently, φi is also the share of time that the process will spend in that state in the limit.

Examples of stationary distributions for different base processes are given in Fig.2. We can see that even whenB is favored overA, under some settings the process still spends considerable time in the all-A state in the limit, especially under strong drift. This means that there must be backshifts from all-B to all-A. Such back-shifts are “exceptions” to the predominant unidirectionality of the process.

While the probability of a back-shift is always non-zero when back-mutation is present, it can vary dramatically depending on the properties of the process. Naturally, the magnitude of mutation rate or fitness asymmetry between A and B is a very important factor. The strength of genetic drift is important, too: with strong drift the process is mostly in the pure states, Fig.2(a-b), while with relatively weak drift, it either hovers near one of the pure states, Fig. 2(c), or in the middle

(9)

0 20 40 60 80 100 (a)

Number of Bs

Time in the state

0.01%

0.1%

1%

10%

100%

0 20 40 60 80 100 (b)

Number of Bs

Time in the state

0.01%

0.1%

1%

10%

100%

0 20 40 60 80 100 (c)

Number of Bs

Time in the state

0.01%

0.1%

1%

10%

100%

0 20 40 60 80 100 (d)

Number of Bs

Time in the state

0.01%

0.1%

1%

10%

100%

Figure 2. Stationary distributions of different base processes on the log scale. (a) Strong drift, strong fitness asymmetry:fB/fA= 1.01,rA→B=rB→A= 10⁻⁴. (b) Strong drift, strong mutation asymmetry: fB =fA,rA→B = 10⁻⁴,rB→A = 10⁻⁵. (c) Weak drift, strong mutation asymmetry:

fB =fA, rA→B = 10⁻², rB→A = 10⁻³. (d) Weak drift, weak mutation asymmetry: fB = fA, rA→B = 10⁻²,rB→A= 0.8×10⁻³.

of the state space, Fig. 2(d), depending on the mutation and/or fitness differential. Practically observable exceptions to unidirectionality are thus expected to arise in the base process under a considerable range of conditions. This survives into the multi-speaker setting, just as Sapir’s drift does.

On the other hand, if we can show that a particular innovation-based global change has no back-mutation on the atomic-change level, it follows that there should never be global back-shifts.

Sooner or later, the token population would reach the all-B state. Without back-mutations, As will not be able to arise again. In the limit of infinity, the process will be always in the all-B state.

Typical linguistic examples of two types of changes are as follows. (i) Grammaticalization processes are often argued to arise due to reanalysis. In our model, it follows from it that we should expect to see some exceptions to predominant unidirectionality for such changes. (ii) Erosion processes involve the loss of previously present material, such as when heavily reduced endings get omitted altogether. Given the absence of back-mutation in such settings, we trivially predict there to be no back-shifts after the old forms have all been obliterated from the population. Thus the crucial difference between the two types of cases is that with back-mutation, a back-shift may

(10)

(a)

10 20 30 40 50

5000 10000 15000 20000 25000 30000 0.0 0.2 0.4 0.6 0.8 1.0

The number of Bs Genera

tions

Probability

(b)

10 20 30 40 50

5000 10000 15000 20000 25000 30000 0.0 0.2 0.4 0.6 0.8 1.0

tions

Probability

(c)

10 20 30 40 50

5000 10000 15000 20000 25000 30000 0.0 0.2 0.4 0.6 0.8 1.0

tions

Simulated probability

Figure 3. Example probabilities that a single node will have a particular number ofBs at a given generation. For infinite-graph cases (a) and (b), actual probability is shown; for finite-graph case (c), it is simulated probability, i.e. the share of occurrences in 1000 simulations. (a) Infinite speaker population, overall number of incoming external units per generationτ = 1. (b) Infinite speaker population,τ= 5. (c) Fully connected five-speaker population,τ= 4. Across all conditions, number of linguistic units per nodeN= 50, fitnessesfA=fB, mutation ratesrA→B= 10⁻⁴,rB→A= 10⁻⁵. Consequently, the expected share ofBs has the same trajectory in (a)-(c), cf. Fig.S1.

arise even after all the old forms disappeared, while without it, we can only go back to the initial language state if the old forms have still been preserved somewhere.

3.5. The interesting case: multiple populations. We model a (large) community of speakers as a directed edge-weighted graphhV, Ei. Each verticev ∈V hosts an instance of our base process, modeling an individual speaker. An edgev →u represents that some tokens produced by v enter speaker u’s memory pool. The weight of that edge reflects the number of such “traveler” tokens.

By convention, for small graphs we use natural numbers recording the constant number of travelers copied at one step from v’s process into u’s, denoting that number as ξvu. For large graphs, our weight is the probability that a token will be copied along the edge in one step. We write the overall number of units entering v in one step as τv =P

uξvu. Some further details on our modeling can be found in SI 2.

At the first part of each time step, the base process at v generates N children as it would in isolation. Thenτ_v linguistic tokens come fromv’s neighbors into v according to the edge weights, displacing some of the locals. In parallel, v’s own output affects its neighbors the same way. We call this second part of the step the travel process.³

3.6. The behavior of the multi-speaker process. For simplicity, let’s first assume that the parameters and the initial state of all base processes are identical. An interesting initial state for us is the all-A pure state, and we want to see how the model will move from there to the all-B state. The overall number of Bs at node v at time t+ 1, written v_B(t+ 1), is the sum of the Bs generated by the base process and not displaced, and those generated by the travel processes.

The expectation for the base part is simply ^N_N^−τ^vE_base(v_B(t+ 1)|v_B(t)). The expectation for the second, travel part is a function ofE_base(u_B(t+ 1)) forv’s neighbor nodesu. If all vertices started from the same conditions, their expectations for the second generation are the same. Soτv traveler

3While there are obvious parallels to migration in biology and neighbor replacement in evolutionary graph theory, our setting differs from either. When speakervproduces some token perceived byu, that token is copied intourather than moved for good, as in migration. Furthermore, current evolutionary graph theory deals with Moran processes where each vertex is occupied by a single unit. This differs from our vertices having sets of units in them. To my knowledge, no explicit studies of the travel process needed for our linguistic case yet exist.

(11)

units are copied from processes with the same expected share of Bs, which results in the same expected share as for the base process. By induction, the expectations of uniform nodes starting from the same conditions will always remain the same. Thus remarkably, the aggregate share of Bs in a homogeneous multi-speaker community grows along the very same trajectory as in a single base process, regardless of the community’s size and structure, Fig. S1.

Of course, unlike the aggregate B share itself, the internal distribution of Bs giving rise to that same aggregate share differs dramatically across speaker-community structures. The base component, under small linguistic-unit population sizes, features great stochasticity due to genetic drift. The travel component, though fed with populations that individually are equally stochastic, draws each node towards the states of its neighbors. Depending on the structure of the multi- speaker community, this results in a wide range of behaviors.

Consider first the extreme analytical case of an infinite well-mixed speaker community, where all vertices are neighbors and are equally likely to provide traveler tokens to each other. Fix an arbitrary vertex v. Its effect on the infinite community is infinitely small, so we can only consider the community’s effect on v. A traveler token can come into v from any other process, so the probability that it will be a B is equal to the aggregate share of Bs across all speakers. Thus neighbor influence in this case draws each individual vertex towards the mean. When τ_v, the overall number of incoming traveler tokens, is small, the base process still dominates, and ensures a relative preference for the pure states, Fig.3(a). But with largerτ_v, the mass becomes concentrated around the expectation, Fig. 3(b).

In finite and especially in small graphs, each vertex has a non-negligible effect on the speaker community, so the whole community needs to be analyzed jointly. We consider the effects of changes in (i) the number of speakers in the graph, (ii) the amount of neighbor influence, and (iii) the graph topology. By necessity, we only consider a small number of cases, each with a small graph, but this will demonstrate the general principles that would hold for larger multi-speaker structures. Again, our interest is in how the community develops from the all-A state.

A larger number of speakers means a greater chance that one of them would depart from a pure state into a middle one: for each individual node, this is not very probable in high-drift settings, but the more speakers there are, the more regularly that small chance would be realized, Fig.S2. What happens after such departure into the middle depends on the level of coherence of the graph. The stronger the ties between the nodes, the closer their shares of Bs will be, Fig.S3. For example, in a ring graph where each node has only two neighbors, it is relatively easy for a contiguous group of nodes to become all-B while the rest of the community remains close to all-A. In a fully-connected graph, this is much less probable, Fig.S4. Overall, as is to be expected, genetic drift pushing the population towards the pure states is drastically stronger in a small finite graph than in an infinite one, cf. Fig.3(b) and3(c).

But while the probability distribution over states differs depending on the graph, Sapir’s drift behavior remains very similar to the single-speaker setting. If anything, it gets stronger with larger speaker networks. Fig.4 illustrates how fast different finite graphs may reach their first all-B state after starting from all-A. Fig. S5 provides the probabilities that an arbitrary node in an infinite well-mixed graph with the same base evolutionary parameters will be in the all-B state, which is strictly less than the expected share of nodes that have reached the all-B state at least once.

Together, these illustrate that in quite different multi-speaker settings, Sapir’s drift still occurs, namely the behavior when separate evolutionary processes that start with the same, relatively small bias towards a new state, will all eventually reach it, but at different times. No further contact between such multi-speaker communities is necessary to ensure the parallel change.⁴

4The speed at which a single node in a finite graph reaches all-B, as opposed to the whole graph reaching it, expectedly differs, Fig. S6. The more individual nodes there are, the faster the first of them gets to all-B. In the other direction, the more coherent the graph, the longer it takes for the first node to reach all-B, as the other nodes would be pulling such an advanced innovator node back.

(12)

Generation

Number of simulations having reached all-B

0 10000 20000 30000

0%

20%

40%

60%

80%

100%

3 speakers, ξ=1 3 speakers, ξ=5

5 speakers, ξ=1, well-mixed 5 speakers, ξ=1, ring 5 speakers, ξ=5, well-mixed 5 speakers, ξ=5, ring 10 speakers, ξ=1, well-mixed 10 speakers, ξ=1, ring

Figure 4. The share of multi-speaker processes across 1000 simulations that have reached the all-B state at least once by a given generation. Across all conditions,N = 50,fA=fB,rA→B = 10⁻⁴, rB→A= 10⁻⁵.

Unidirectionality with exceptions also survives into the multi-speaker setting if it was to occur in the base process. This happens trivially: as long as we have exact back-mutations, the probability always exists for going back to the initial state even after all archaic constructions have been eliminated from the linguistic output of the community. The interesting questions in our framework thus is not whether unidirectionality with exceptions exists, but how probable those exceptions are under different conditions, and especially why we do not seem to observe many such exceptional backshifts in empirical studies of language change.

First, in a high-drift setting, backshifts become less and less probable as the bias towards the innovative state increases, Fig. 2(a) vs. 2(c). Therefore one possibility is that we do not observe too many back-developments because we mostly look at highly asymmetrical empirical change processes. Note that high asymmetry does not imply high speed of change: very slow changes caused by very small biases can nevertheless be highly asymmetrical.

The second reason for not observing too many backshifts may be “speaker-grammar effects”.

In our model so far, we have not incorporated the fact that in addition to being affected by the tokens they hear, speakers also have sophisticated grammatical representations of what is and is not possible in their language. In particular, if innovative B is disallowed by a given speaker’s internal grammar, then any heard instance of B will be misperceived by that speaker as another construction, sayA. Only a critical mass ofB-input that cannot be plausibly interpreted otherwise would force the speaker to change their grammar. One brute-force way of emulating this effect in our model is to simply forbid states of the memory pool where one of the variants has a very low frequency. For example, we may treat the states with one or two Bs as impossible, and similarly for one or two As. In such a setting, the bias brought in by an asymettry of mutation or fitness

(13)

will be amplified, with the disfavored, archaic state getting a smaller share of the process’s time in the limit of infinity, Fig. S7.

Third, backshifts could take an enormous amount of time to actually happen. E.g., if the forward shift towards the innovative state takes on average a thousand years for someA→B process, and the backward shift takes on average ten times longer, we just wouldn’t realistically observe such a backshift very often, as the recorded histories of human languages cover only a few millennia even in the best-documented cases.

Finally, in actual change processes, the innovative state B will rarely be the endpoint of an evolutionary chain. E.g., a once new future marker, such as modern English going to, can after some time disappear from the language and be replaced by a new construction. Indeed, the English going to construction itself is currently encroaching onto the territory of the older will and shall future markers. Then the full process is not just A ↔ B, but rather A ↔ B ↔ C, and it is to be expected that many times, the second forward shift B → C would occur on average earlier than the backshiftB → A. In such a case, we would not see backshifts B → A not because they are particularly improbable as such, but rather because the conditions for them are frequently destroyed by further language change.

There are thus plenty of reasons why backshifts would not occur too frequently in particular language change processes. Interestingly, our model also raises yet another analytical possibility:

backshifts so frequent that they have not been noticed by researchers. In conventional language change models, change is often implicitly assumed to be largely monotonic and slow: the innovative state gradually displaces the archaic one, and if reversals occur, they are assumed to be monotonic, too, as a norm. But in our model, this need not be the case. Under some sets of evolutionary parameters, the system would oscillate very frequently between the pure states, with many back and forth movements even on a short time scale. Moreover, such back and forth movements would not need to be uniform across the large overall community of a language’s speakers: different sections of the large community may be in different phases of the frequently oscillating process. In such a case, there would be empirical evidence for frequent backshifts in the empirical data, but given that it would not be easily interpretable in the classical conceptual paradigms on language change, such evidence could have been misinterpreted differently, for example as evidence of dialectal or register differences. Reanalyzing the empirical data in view of the possibility of rapid transitions may produce evidence for such highly volatile processes of change. One example of such non-monotonic and rapid developments could be the evolution of irregular verb forms in English dialects. For example, modern old and young York vernacular speakers show greater use of non-standard past come (instead of came) than those in the middle age cohorts [28]. But despite such irregularity on the short time scale and in a particular community, across all English speakers, irregular verbs have been on the decline over the last millennium, [29]. This suggests that their evolution is an asymmetrical, biased process, yet one which allows for large stochastic fluctuations.

4. Flexibility of the model: how to add more realism

The variant of the multi-speaker setting that we worked with is a relatively simple model compared to the reality of language maintenance and change. An obvious worry is whether our behaviors of interest in this paper, Sapir’s drift and unidirectionality with exceptions, would survive in a more realistic model. Also more generally, is our framework flexible enough to provide us with insight into more complex conditions? Here, I discuss two specific ways to improve the adequacy of the model as an illustration of how we can model more sophisticated conditions within the pro- posed framework: (i) allowing different speakers to have different evolutionary parameters; and (ii) relaxing the assumption that fitness forces are constant across time and the speech community.

Above, we worked with a multi-speaker network that consisted of identical base processes. But all human speakers of a language would hardly be exactly identical. In particular, they may differ

(14)

in their evolutionary parameters that affect the dynamics of their base process — e.g., memory sizes, rates of mutation (that is, reanalysis and innovation), etc.

The trajectory of the expected aggregate share of the innovation for a speaker network is a weighted sum of the expected trajectories for the different individual speakers, and will be bounded by the most extreme of those trajectories. Consider the simple case where there are only two speaker types α and β with different parameters (for example, different rates for A → B mutations.) α speakers pull the overall population share of Bs towards the pure-α trajectory. That pull is proportional to the overall influence of the α type, which in turn is determined by the number of α speakers (the base-process part) and their effect on the other speakers in the community (the travel part). β speakers similarly pull the overall share towards the β trajectory. The result will be in between the α and β trajectories.

The second modification we consider is fluctuating fitness. It is unrealistic to assume that linguistic types have the same fitness over time modulo the linguistic state. Linguistic constructions receive explicit and implicit social evaluation, which would affect how much they are copied. So while e.g. the cognitive ease of processing would depend only on the current linguistic state, the social component of fitness is likely to experience fluctuations. It is straightforward to model such fluctuations in general. Consider the base process. Let f_B be the baseline fitness of B, and f_B⁰ =fB+ be the actual fitness that differs from fB by random variable . If > 0 at a given generation, then B-type potential parents receive a corresponding reproductive boost. If < 0, they receive relative punishment.

Precise analysis of fluctuating fitnesses is challenging, as is known in the biological literature [30].

In particular, analyzing the effect that the probability distribution ofhas on the overall process is non-trivial. But some general observations are easy to make, and for specific simple distributions of , we can straightforwardly compute how they would affect the base process. First, note that changes in fitness can only affect mixed states, where several allele types are present. In such mixed states, the extra randomness added byresults in random boosts and punishments. These add to the randomness due to sampling, that is genetic drift. So the behavior of the process in the mixed states becomes more erratic, with wider jumps.

We can observe how such wider jumps actually affect the behavior of the base process in a simple, but illustrative case. We compare a baseline two-allele process dominated by asymmetric mutation rates and drift to two fluctuating-selection modifications of it. The process with small fluctuations receives symmetric 1% boosts and penalties 25% and 25% of the time, respectively. The process with large fluctuations receives 25% boosts and penalties, with the same frequency. The expected trajectory of the small-fluctuations process almost coincides with that of the baseline, while the trajectory of the large-fluctuations process shows a steeper rise, Fig.S8(a). The added stochasticity thus pushes the process faster towards the preferred innovative state. At the same time, in the limit of infinite time, the fluctuating-selection processes slightly increase the share of time spent in the archaic, dispreferred state, Fig. S8(b-c). Importantly, while fluctuating selection changes the behavior of the process, the changes are not particularly large, and do not change the general character of the evolutionary behavior.

5. Empirical benefits

An important practical consequence of our analysis is that the aggregate share of the innovation in a linguistic community is a meaningful value even when individual speakers show huge deviations from the mean. This is an unintuitive finding to a linguist: if we see speakers with vastly different behavior, doesn’t it mean that there is important linguistic stratification in the society, and therefore the aggregate mean averaged across all groups has no direct value? But we have shown above that a multi-speaker network consisting of uniform individuals starting from the same state would, on the one hand, develop according to the same expected aggregate trajectory, but on the other hand, the individual speakers in the network can behave very differently from each other.

(15)

(a)

the innovation's share

frequency

0%

25%

50%

75%

100%

0% 50% 100%

(b)

the innovation's share 0%

25%

50%

75%

100%

0% 50% 100%

Figure 5. Frequency of speaker types by the innovation’s share in their speech in a sample from an empirical population. The shown distributions may be compared to single-generation crosssections of surfaces in Fig. 4. By-period data on the shown two changes are provided in Fig.S9. (a) The change process: replacement of archaic-t endings of the past tense with innovative-d endings in Scots English, years 1600-1619. Data from [33]. (b) The change process: replacement of archaic subject pronounyewith innovative formyou (which was previously used as the object form of the pronoun) in English, years 1540-1560. Data from [34].

Of course, sometimes linguistic stratification does obtain [31,32]. But our model shows that it cannot be assumed to hold simply because individual observed speakers behave differently. This point is of practical relevance because for some changes in progress, historical sociolinguists have found widely dispersed speaker distributions, as in two examples in Fig.5, from [33,34]. The usual explanatory move in such cases is to assume that even when we cannot pin down exact differences in social and linguistic background based on available historical data, there must have been some such differences. But comparing Fig.5and Fig.3, we can see that this need not be the case. Under some evolutionary conditions, we expect the linguistic output of an underlyingly uniform speaker community to nevertheless look like in Fig.5.

6. The road to inference from empirical data

We have demonstrated that our model can give rise to particular qualitative behaviors that cor- respond to the linguistic Sapir’s drift and unidirectionality with exceptions. But the behavior of the model depends on the evolutionary parameters. The real interest would be in inferring those parameters from the empirical data, and testing the model’s fit to the actual change processes.

Unfortunately, we will see shortly that an unconstrained version of our model is very poorly identifiable. Fortunately, constraints on the model can come from historical-linguistic investigations of particular phenomena. The promising path is therefore to couple classical historical-linguistic analysis and evolutionary inference.

A common problem for inference in biological evolution is distinguishing neutral vs. non-neutral mutations, where neutral means not affecting fitness and thus irrelevant for natural selection. In the linguistic case, this problem is not significant. First, in biology, many single-nucleotide mutations are synonymous, preserving the coded amino acid, and thus are by definition neutral. In linguistics, there are no changes which by definition are unable to affect fitness. So truly neutral linguistic mutations would be a much rarer phenomenon from the start. Second, directional forces in the linguistic case are not limited to fitness. In genetics, most types of mutation have back-mutations so improbable that they are ignored in the widely used models such as infinite sites and infinite alleles. But, for instance, linguistic mutations due to reanalysis do have exact back-mutations.

(16)

Generation

Share of Bs

0 10000 20000 30000

0%

25%

50%

75%

100%

(a)

r_AB=5*10⁻⁴, r_BA=5*10⁻⁶ r_AB=5*10⁻⁴, r_BA=5*10⁻⁵ r_AB=10⁻⁴, r_BA=10⁻⁶

r_AB = 10⁻⁴ , r_BA = 10⁻⁵

Generation

Share of Bs

0 10000 20000 30000

0%

25%

50%

75%

100%

(b)

fA / fB = 1.05

fA / fB = 1.01

fA / fB = 1.005

fA / fB = 1.001

Generation

Share of Bs

0 10000 20000 30000

0%

25%

50%

75%

100%

(c)

N = 100 N = 75 N = 50

Generation

Share of Bs

0 10000 20000 30000

0%

25%

50%

75%

100%

(d)

r_AB=5*10⁻⁴, r_BA=5*10⁻⁶

fA / fB = 1.05

r_AB=10⁻⁴, r_BA=10⁻⁶

fA / fB = 1.005

Figure 6. Expected trajectories of the aggregate share of the innovative form B. (a) Different mutations rates, equal fitnesses,N = 50. (b) Non-equal fitnesses, symmetric mutation rate r = 10⁻⁴, N = 50. (c) Different memory sizes with non-equal fitnesses. fA/fB = 1.01. (d) Poor identifiability of mutation- vs. fitness-driven processes. The plot repeats several trajectories from (a) and (b), in the same colors.

Asymmetric evolution can stem from asymmetric forward and backward mutation rates, not from fitness differentials. Such evolution can reasonably be called “non-neutral” as well. Consequently, there will be very few truly neutral evolutionary processes in language change, and so detecting neutrality is not as interesting as in the genetic case.

To infer evolutionary parameters, we need to overcome two sources of uncertainty. First, we will always only have a small sample of data based on which we guess what the whole linguistic distribution in the speaker population is like. Second, even if we had, hypothetically, the complete knowledge of the observable data, they could still be compatible with multiple sets of parameters of the model (for analogous examples from chemical kinetics, see [35].)

We consider inference from the observed trajectory of the aggregate share of the innovation vs. the old form. For simplicity, suppose that all speakers in the community have identical evolutionary parameters and starting conditions. Then they all have the same expected trajectories. Sampling speakers widely across the community, we get relatively independent estimates of the true aggregate innovation share.

Schematically, a change trajectory is characterized by the shape of its slope and the height of the eventual plateau. In practice, the height of the plateau will often be hard to measure: as we discussed in Section3.6, the innovative variant of one process will often feed into further processes of change, in which case we won’t see a plateau as such in the data. We are thus left with the shape of the steep part of the curve as our cue. What can we and cannot we tell from it?

If we only vary one parameter while keeping the rest fixed, inference is possible. Fig.6(a) shows that the slopes of the aggregate innovation share are quite different for different leading mutation rates, other things being equal. The leading rates should therefore be identifiable from data. The smaller mutation rate, in contrast, does not affect the slope’s curve as much and would be much harder to identify, though its magnitude is crucial for the probability of back-shifts and therefore of interest. Fig.6(b) demonstrates that the magnitude of the fitness differential should also be recov- erable from data, as it affects the shape of the slope significantly. For fitness-asymmetry processes, effective memory size N visibly affects the trajectory, Fig. 6(c), but in mutation-asymmetry processes it does not (not shown). Thus differences along just one dimension can often be reasonably reconstructed from observations as long as we fix the values of the rest of the parameters.

But if more than one parameter is unknown, we immediately get a virtually unsolvable puzzle.

Fig. 6(d) illustrates. It shows two pairs of processes whose trajectories of change are very similar either all the time or for a long initial segment. But in each pair, one process is driven by a mutation asymmetry, while the other, by a fitness asymmetry. Thus if we do not know in advance which asymmetry drives the process and observe such a trajectory, we would not be able to tell whether