• Keine Ergebnisse gefunden

An Interactive Visualization for Semantic Change

3.2 Cross-Linguistic Comparison of Complex Language Features

4.1.3 An Interactive Visualization for Semantic Change

In order to investigate semantic change we designed a processing pipeline con-sisting of automated and visual methods. The automated data processing involved context extraction, vector space creation, and sense modeling. As Schütze [155] showed, looking at a context window of 25 words before and af-ter a target word provides enough information in order to disambiguate word senses. Each extracted context is complemented with the time stamp from the corpus. To reduce the dimensionality, all context words were lemmatized and stop words were ltered out.

For the set of all contexts of a target word, a bag-of-words vector space was created and a global lda model was trained. In the course of the experiments it became clear that the lda topics/senses were much better interpretable than the dimension of the vector space after applying lsa (or mds), see Section 4.1.4 for an example. Due to this, the temporal analysis was limited to the lda topics, where each context is assigned to its most probable topic/sense Contexts for which the highest probability was less than 40% were omitted because they could not be assigned to a certain sense unambiguously. The distribution of contexts across dierent senses and the distribution of senses over time was then visualized.

Visualization

In order to visually analyze the development of word contexts over time, an interactive visualization tool was created, which displays the results of pro-jecting words contexts in 2D using mds, lsa or lda. The tool contains a component with a separate visual representation for each target word

occur-4http://math.nist.gov/javanumerics/jama/ last revised on March 11th, 2013

5http://www.inf.uni-konstanz.de/algo/software/mdsj/ last revised on March 11th, 2013

6http://mallet.cs.umass.edu/ last revised on March 11th, 2013

rence (see Figure 4.3) and a component which provides aggregated views on the data (see Figure 4.4).

Plotting individual contexts

Figure 4.2 shows the initial view of our tool and gives an overview of the possibilities for exploring individual contexts, using lda. The word under investigation is the verb to browse.

In the scatterplot, each context is represented by one dot. The axes corre-spond to lda, lsa or mds dimensions. In this case, 7 senses (dimensions) have been automatically learned and two at a time can be selected to be mapped to the axes for visual inspection. Here, the x-axis sense is characterized by the terms on top (shop, street, book, store, art, etc.). The further to the right a dot is situated, the more the corresponding word occurrence relates to the described x-axis sense. Accordingly, the y-axis is characterized by the terms on the left (software, microsoft, internet, netscape, window, etc.). The further to the bottom a dot is situated, the more the corresponding word occurrence relates to the y-axis sense. Contexts that belong to both senses are displayed along the diagonal of both axes. Yet, in this example screenshot there are no such cases.

As a further visual variable, the color of a dot indicates the time when the context appeared. The bipolar color map ranges from light green (year 1987) to dark purple (year 2007), optimized to contain a large number of distinguishable color tones. The color map on the right allows for arbitrary time intervals to be chosen with the sliders situated on the left (start-slider) and right (end-slider) of the color map, labeled with (a). Figure 4.2 shows contexts from 1987 to 1994. As can be seen, many word occurrences in this time relate to the x-axis sense, but only two strongly relate to the y-axis sense, labeled with (b). This can also be seen in an additional view where for each of the two selected sense axes the word occurrences (red dots) are plotted against time.

The context of a word occurrence can be displayed by mouse-over-interaction.

Apart from these main features, the tool oers more options to optimize the visual display, including zooming, changing dot size and dot opacity, as well as strategies to reduce clutter.

Figure 4.3 shows further examples, where in each subgure dierent

com-Sun Dec 18, 1988: --- made, although the Federal Bureau of Investigation is looking into the case, said a spokesman for the Lawrence Livermore National Laboratory. A computer enthusiast who browsedfor weeks through the unclassified computer system of a top Federal nuclear weapon laboratory has been identified and has apologized, officials said today. No arrests

---Sliders to limit time range

Outliers in different views

a

b

Figure 4.2: The word context visualization tool. Dimension 1 has been selected to be mapped to the x-axis and dimension 4 has been selected to be mapped to the y-axis.

Figure 4.3: Pairwise comparisons of dierent senses for the verb to browse. In each subgure dierent combinations of lda dimensions are mapped on the axes. Reprinted from [145], c2011 Association for Computational Linguistics.

binations of senses of to browse are plotted. In this case the y-axis goes from bottom to top. A random jitter has been introduced to avoid overlaps. Con-texts in the middle (not the lower left corner, but the middle of the graph, e.g., see e vs. f) belong to both senses with at least 40% probability. In cases where the middle of the plot is populated with many data points, the axis senses share many ambiguous contexts can usually be considered to be similar. By mous-ing over a colored dot, its context is shown, allowmous-ing for an in-depth analysis.

With the help of the time sliders analysts can lter the data for arbitrary time ranges to gain a better feeling for the diachronic development of the dierent senses.

to browse to surf

time, library,

Figure 4.4: Temporal development of dierent senses concerning the verbs to browse (left) and to surf (right). Reprinted from [145], c2011 Association for Computational Linguistics.

Aggregated data plotting

While plotting every word occurrence individually oers the opportunity to detect and inspect outliers and investigate the relatedness of dierent senses, aggregated views on the data are able to provide further knowledge on overall temporal developments.

Figure 4.4 provides another view of the tool, where the percentage of word occurrences belonging to the dierent senses is plotted over time. For the verbs to browse and to surf seven senses have been learned with lda. Each sense corresponds to one line and is described by the top ve terms identied by lda. The higher the grey area at a certain x-axis point, the more of the contexts of the corresponding year belong to the specic sense. Each shade of grey represents 10% of the overall data, i.e., three shades of grey mean that between 20% and 30% of the contexts can be attributed to that sense.

This method of presenting the data focuses less on the detection of outliers and more on general trends. It can easily be seen that certain senses appear at particular points in time, e.g., the senses belonging to the lines labeled e, f, j, and k in Figure 4.4. This provides a strong indication that the outlined senses might correspond to new ways of word usage.

4.1.4 Case Studies

In order to be able to judge the eectiveness of our new approach, we chose key words that are likely candidates for a change in use in the time from 1987 to 2007. That is, we explored the contexts of target words relating to the relatively recent introduction of the internet. The advantage of these terms is that the cause of change can typically be located precisely in time.

Browsing and Surng Figure 4.4 shows the temporal sense development of the verbs to browse and to surf, together with the topmost descriptive terms for each sense. Sense e for to browse and sense k for to surf pattern quite similarly.

Inspecting their contexts reveals that both senses appear with the invention of web browsers, peaking shortly after the introduction of the groundbreaking Netscape Navigator (1994). For to browse, another broader sense (sense f) concerning browsing in both the internet and digital media collections shows

a continuous increase over time, dominating in 2007.

The rst occurrences assigned to sense f in 1987 are browse data bases, word-by-word browsing in databases and browsing les in the center's li-brary, referring to physical les, namely photographs. We speculate that the sense of browsing physical media might haven given rise to the sense which refers to browsing electronic media, which in turn becomes the dominating sense with the advent of the web.

Figure 4.3 shows pairwise comparisons of word senses with respect to the contexts they share, i.e., contexts that cannot unambiguously be assigned to one or the other. Each context is represented by one dot colored according to its time stamp. It can be seen that senses d (animals that browse) and e (browsing the web) share no contexts at all. Senses d (animals that browse) and f (browsing les) share only few outlier contexts. In turn, senses e (brows-ing the web) and f (brows(brows-ing les) share a fair number of contexts, which is to be expected, as they are closely related. Single contexts, each represented by a colored dot, can be inspected via a mouse roll over. This allows for an in-depth look at specic data points and a better understanding how the data points relate to a sense.

Figure 4.5 shows the rst and second mds dimension of the browse-contexts.

While Subgure 4.5(a) shows the contexts of the whole time range, Subgures 4.5(b), (c), and (d) show smaller selected time spans. One bias in (a) is that the contexts are plotted in temporal order and newer contexts potentially cover older ones. For this reason the interactive selection of time spans is important.

In (b) it becomes obvious that older contexts are all located on the same spot, whereas in (c) they are scattered across the whole plane and in (d) they are more limited to certain parts of the plane again.

This implies that the two main sense dimensions uncovered with mds are not present in the beginning (1987-1993), then increase between 1993 and 2003 and again lose some of their importance between 2003 and 2007. This observation is backed up by the plot of Figure 4.6, where the same phenomenon becomes visible when each of the two sense dimensions is plotted against the time dimension.

Further experiments revealed that there is no noteworthy dierence between the top-dimensions when performing lsa and mds. Additionally, the dierence

(a) all (1987-2007) (b) 1987-1993 (c) 1993-2003 (d) 2003-2007

Figure 4.5: First and second mds dimension of the to browse-context, dier-ent time ranges have been selected.

Figure 4.6: The coordinates of the rst and second MDS dimension of the browse-contexts plotted against time.

between mds applied on a distance matrix of pairwise Euclidean distances and applied on pairwise cosine distances was only marginal.

While the rst mds dimensions are generally able to show developments over time, it is necessary for the understanding of the developments to read a number of contexts and compare contexts at dierent locations in the plane.

The axes as such are not interpretable. This leads to the main advantage when applying lda: The sense labels give specic hints as to the kind of development observed in the data.

play, bag, god, make, story, music, world, man, year, death

work, year, service, bike, day, company, street, job, time, york case, charge, man, year, police, tree, office, court, kill, yesterday

street, theater, bus, york, bicycle, avenue, john, opera, review, man rna, cell, protein, gene, chemical, dr, make, dna, brain, call

message, people, shoot, god, campaign, blame, american, state, kill instant, aol, message, program, service, user, online, msn, microsoft

a

b c d e f g

Figure 4.7: Development of dierent word senses for messenger.

Messenger Figure 4.7 shows seven senses induced with lda from the con-texts of the target word messenger in the NYT corpus as well as the respective sense developments over time. Not all of the senses can be interpreted easily, yet, the analysis points to some interesting ndings. Most of the less clear senses refer to human messengers in general, and bike messengers in particu-lar. This includes senses referring to messenger bags or shoot-the-messenger quotes. One clear-cut sense is that of messengers in biochemistry (sense e).

Another clear-cut sense has only come up in 1997 and become much stronger in 2001 (sense g): It is about online instant messaging. The appearance of this new sense coincides with the rst release of the very popular AOL Instant Messenger in May 19977.

bite, year, catch, day, time, people, school, work, play, york plant, office, official, line, car, agent, police, find, federal, device bug, spray, long, water, head, night, camp, insect, make, leg year, computer, millennium, company, president, problem, bug, computer, software, system, program, web, bug, fix, windows insect, bug, find, plant, year, beetle, tree, garden, mosquito, kill bug, eye, show, make, film, children, book, insect, play, man

a

b c d e f g

Figure 4.8: Development of dierent word senses for bug.

Bug Figure 4.8 shows seven senses induced with lda from the contexts of the target word bug in the NYT corpus as well as the respective sense developments over time. Several senses can be discerned. Apart from the insect (senses c and f ), a bug can be a wiretap (sense b), and refer to errors in computer software (senses d and e). While sense e is more general, sense d refers mainly to the famous millennium bug, known as Y2K, and consequently peaks shortly before the year 2000.

7http://en.wikipedia.org/wiki/AOL_Instant_Messenger last revised on February 5th, 2013

LSA dimensions

1 web 0.40, internet 0.38, software 0.36, microsoft 0.28, windows 0.18

2 microsoft 0.24, software 0.23, windows 0.13, internet 0.13, netscape 0.12

3 microsoft 0.27, store 0.22, shop 0.20, windows 0.19, software 0.16

4 shop 0.32, netscape 0.23, web 0.23, store 0.19, software 5 0.19book 0.48, netscape 0.26, software 0.17, world 0.13,

communication 0.12

6 internet 0.58, shop 0.25, service 0.16, computer 0.13, people 0.11

7 make 0.39, shop 0.34, site 0.16, windows 0.13, art 0.08 ... ...

15 nd 0.30, people 0.22, year 0.19, deer 0.16, day 0.15

Table 4.1: Descriptive terms for the top lsa dimensions for the contexts of to browse. For each dimension the top 5 positively associated terms were extracted, together with their value in the corresponding dimension.