• Keine Ergebnisse gefunden

Visual Analytics in Linguistic Research

2.1.1 State of the Art . . . 14 2.1.2 Open Issues . . . 16 2.1.3 Goals of this Thesis . . . 17 2.2 Visual Analytics in Time-Oriented Text Mining . 18 2.2.1 State of the Art . . . 18 2.2.2 Open Issues . . . 23 2.2.3 Goals of this Thesis . . . 25 This chapter describes the current state of research with respect to the use of visual analytics in linguistic research (Section 2.1) and visual analytics in time-oriented text mining (Section 2.2). For both elds I will briey describe the state of the art, identify open issues for research, highlight how the work presented in this thesis is embedded in these research areas, and point out which previously existing research gaps it lls.

2.1 Visual Analytics in Linguistic Research

The analysis of large text collections has emerged as a subeld of visual an-alytics only within the last few years. The goal of almost all approaches in this subeld is to enable analysts to gain insight into the topics and content

contained in large document or text collections. Over the years more and more sophisticated computational linguistic methods have been integrated into such visual analytics approaches. However, visual analytics approaches that have the aim of directly supporting theory-driven linguistic research are very rare.

At the same time, until recently linguistic research has only marginally in-corporated visualizations into its investigations. The work presented in this thesis is thus cutting-edge in terms of integrating visual analytics and linguistic research.

2.1.1 State of the Art

There are two centrally relevant works that have so far set the research agenda with respect to the integration of computational linguistics and visual analyt-ics. One is a Ph.D. dissertation by Christopher Collins with the title Interactive Visualizations of Natural Language at the University of Toronto [30], and the other is a Ph.D. dissertation by Daniela Oelke with the title Visual Document Analysis: Towards a Semantic Analysis of Large Document Collections at the University of Konstanz [134].

Collins' Thesis: Collins coins the term linguistic visualization divide in his thesis, which refers to the gulf separating sophisticated natural language pro-cessing algorithms and data structures from state-of-the-art interactive visual-ization design [30]. Through ve design studies he gives examples on how this linguistic visualization divide can be overcome combining sophisticated nat-ural language processing algorithms with information visualization techniques grounded in evidence of human visuospatial capabilities. [30]. Two of the de-sign studies are meant to support computational linguistic research in the area of natural language processing suggesting an innovative use of visualization methods. The further design studies deal with content analysis and also with augmenting real-time computermediated communication.

Oelke's Thesis: Oelke describes a concept related to Collins' linguistic vi-sualization divide. She states that for text analysis Some semantic aspects are too complex to nd good computational approximations. And even if we do have a good approximation, often, there still exists a gap between the

computa-tional eorts and the analysis goals and concludes that the analysis process therefore has to be designed in a way that the user can be incorporated to bridge the semantic gap [134]. In order to enable this Oelke suggests a frame-work for analyzing document collections. The idea is to identify one or more semantic aspects of the text, called quasi-semantic properties, that are relevant for solving an analysis task. This permits to targetly search for combinations of (measurable) text features that are able to approximate the specic semantic aspect [134], these combinations are named quasi-semantic measures. Con-crete implementations of the abstract framework and quasi-semantic measures are discussed and evaluated for the application areas of literature analysis, readability analysis, term extraction, and sentiment and opinion analysis. All examples include visual interfaces and visualizations that support the dierent steps of the analytic process.

Further approaches: Further related approaches using visualization to sup-port linguistic tasks are rather sparsely scattered and have appeared at dier-ent venues of dierdier-ent research communities. On the technical level the ap-proaches are quite diverse and can hardly be compared. For example, Honkela et al. [81] have obtained visual syntactic category clusters by generating self-organizing maps based on word context vectors. Later, Wattenberg and Vié-gas [178] created the Word Tree visualization that was primarily aimed at visualizing the content structure of texts, but can also be used to visualize language features as shown by the example of a tree containing Greek nominal suxes. Further subelds of computational linguistics that have used visual-ization are machine translation [4, 40] and discourse parsing [188], where the output can be interactively explored and corrected. One of the very few ex-amples where visualization is applied to investigate a phenomenon of language change was published in 2012 by Lyding et al. [112]. They use a parallel coor-dinates display to visually explore the distribution of modal verbs in academic discourse. Several academic disciplines can be compared for two points in time in order to detect changes.

2.1.2 Open Issues

Both theses mentioned in the previous subsection share the conclusion that combining computational methods with interactive visualizations enables text analyses that go far beyond what can be achieved with standard methods.

Typically, analysis tasks come from the context of business, marketing, and security applications where large text collections have to be explored. Systems designed to support such analyses usually incorporate linguistic and natural language processing methods to achieve a higher analytic quality and grant deeper insight. In other words, visual text analytics prots from advances in (computational) linguistic research. In the case of Collins's thesis, also the contrary case is given: Visualization methods were used to support linguis-tic research and improve linguislinguis-tic methods. However, the improved methods (machine translation and automatic speech recognition) belong to the subeld of computational linguistics, which naturally interfaces with other computa-tional methods such as visualization. The computacomputa-tional linguistic researchers proting from his novel visualizations, already brought a high anity for com-putational methods: Preliminary discussions revealed that they spent most of their time sitting at a computer, programming. [30].

Apart from Collins' work, the related work with respect to visual analyt-ics systems that support linguistic researchers in their tasks is not quite ad-vanced, often rather vague with respect to the tasks that shall be supported, and also rather supercial on a technical level. The most obvious gap in previous research is that there is a lack of visual analytics approaches that support subelds of linguistics that do not have a long tradition of performing computer-aided research. Still, manual data analysis is widely spread for ex-ample when it comes to the research of historical language developments and cross-linguistic comparisons. Manual data analysis is very accurate on a detail level, but is not scalable for the exploration of large data repositories. On the other hand, mere computational analyses, as they are usually performed within the research eld of corpus linguistics, do not provide enough exibility, because a concrete analysis proceeding has to be determined beforehand and the analytic process cannot be interacted with and guided on-the-y. Thus, the insight is limited and the trustworthiness of results cannot be conrmed easily. At the same time, more and more linguistic data is becoming available

in digital format and waiting to be explored in-depth.

2.1.3 Goals of this Thesis

This thesis opens up a novel area of research that is about to become a new subdiscipline in linguistics and computational linguistics. In that sense some of the presented research is groundbreaking. The Chapters 3 and 4 push the previous state of research, introducing novel visual analytics approaches which support subelds of linguistic research that traditionally rely on manual analyses: Linguistic Typology and Historical Linguistics.

First, in Chapter 3 we integrate an extensive amount of available informa-tion about languages into one visual data analysis environment to support the eld of areal typology in its research. We show how past language change in phonology and morphology can be traced and explored by performing cross-linguistic comparisons of multi-variate language features. To this end, in the rst part of the chapter language features are presented both in areal and genealogical contexts. In the second part of the chapter, we present a novel matrix-based visual analytics method, that enables linguists to compare dif-ferent languages with respect to complex features, i.e. vowel and consonant sequence patterns within words. We show that with our method, languages containing special sound patterns can easily be depicted visually based on pro-cessing limited fragments of text. For both parts of the chapter we show that it is important to arrange visualizations in a way that interesting visual pat-terns are likely to emerge. In the provided case studies we demonstrate that a meaningful spatial sorting or ordering of visual objects, based on their feature values, makes unexpected interesting patterns in the data visible.

In Chapter 4 we introduce novel computational methods for tracking and understanding change in lexical semantics coupled with interactive visual result representations. In the rst part of the chapter we show that methods from topic modeling are well-suited to induce word senses from word contexts. The visualization is generated fully automatically from a large diachronic corpus and reveals the appearance of new word senses. This includes a description of the new word sense, the point in time when it appeared rst and the frequency development over time in relation to other senses of the same word. In the sec-ond part of the chapter we suggest visualizations to investigate the dynamics of

the cross-linguistic spread of new coinages like words ending in the sux -gate.

For feature extraction, in most cases we bear Oelke's framework in mind using concepts similar to her quasi-semantic properties in order to computa-tionally model the linguistic research tasks.

The part of the thesis dealing with visual analytics for linguistic inquiry is cutting-edge in that it brings visual analytics research to a new applica-tion domain. Only after rst common work had been published in 2010 and 2011, rst workshops and conferences in this eld have shown up and led to repeated citations of our work: The EACL 2012 Joint Workshop of LINGVIS

& UNCLH Visualization of Linguistic Patterns and Uncovering Language His-tory from Multilingual Resources, April 23-24, 2012, Avignon, France1, the AVML 2012 Conference on Advances in Visual Methods for Linguistics, York, United Kingdom, September 5-7, 20122, and the Workshop on the Visualiza-tion of Linguistic Patterns at the Annual Conference of the German Linguistic Society 20133, on March 12-15.

Interdisciplinary collaborations between computer science and humanities are an upcoming line of research termed enhanced or digital humanities. This thesis promotes such interdisciplinary research eorts bringing together visual analytics and linguistics. As part of the conclusions (see Section 7.2.1) also best practices, pitfalls, and lessons learned are discussed, which in turn may potentially be benecial for other branches of the digital humanities.

2.2 Visual Analytics in Time-Oriented Text