Visualization of Large Document Corpora

(1)

Visualization of

Large Document Corpora

Dem Fachbereich Informatik und Informationswissenschaften der Universit¨ at Konstanz

zur Erlangung des akademischen Grades eines Dr. rer. nat.

eingereichte Dissertation

von

Herr Dipl.-Inf. Hendrik Strobelt aus

Zwickau/Sa.

Datum der m¨ undlichen Pr¨ ufung: 2012/11/06

Referent: Prof. Dr. rer. nat. Oliver Deussen Referent: Prof. Dr. rer. nat. Daniel A. Keim Pr¨ ufungsvorsitz: Prof. Dr. rer. nat. Michael Berthold

Konstanzer Online-Publikations-System (KOPS)

(2)

(3)

Uqit~s, uqit~s i ew raz uqit~s.

teacher of W.I. Lenin

Profanity sucks. (14)

Be more or less specific.(15)

Analogies in writing are like feathers on a snake.(19)

excerpt from Rules of Writing

by Frank L. Visco (June 1986 in Writers’ digest)

(4)

Lernen, Lernen und nochmals lernen ist der Widmungstext meiner Disser- tation. Vielen Menschen bin ich bisher begegnet und habe von Ihnen gelernt.

Den herausragendsten Anteil daran haben aber zwei Personengruppen, die ich hiermit extra würdigen möchte: meine Eltern und Grosseltern. Sie bildeten die faktische und seelische Grundlage meines Lebens, sind mein Rückhalt und bedingungslose Unterstützer.

Während meiner Zeit in Konstanz habe ich von einigen Personen viel gelernt, habe mit Ihnen gelacht oder mit Ihnen diskutiert, habe mit Ihnen Kaffee getrunken oder nicht-nicht-alkoholische Getränke – Ihnen gebührt mein Dank:

• f¨ur die Betreuung, Unterst¨utzung und Diskussionen – Prof. Dr. (Oliver Deussen, Daniel Keim, Michael Berthold, Ulrik Brandes, Dietmar Saupe)

• f¨ur das Dasein, Zuh¨oren und Entwirren so mancher Abrechnungen – Ingrid Baiker

• als Kollegen und Leidensgenossen – Michael Balzer, Thomas Luft, Andreas Urra, Joachim Böttger, Boris Neubert, Daniel Heck, Thomas Schlömer, Sören Pirk, Joachim Braun

• f¨ur viele Stunden gemeinsamen Schaffens und die resultierenden Freund- schaften – Josua Krause, Marc Spicker, Michael Zinsmaier, Julian Kratt

• Daniela Oelke, Christian Rohrdantz, Andreas Stoffel, Andrada Tatu, Enrico Bertini, die Keimlinge

• Iris Ad¨a, Heather Fyson, Peter Burger, BioMLer, KNIME-linge

• Uwe Nagel, Martin Mader, Roman R¨adle, Brandes-Gruppe, Reiterer- Gruppe

• unz¨ahlige Freunde, die ich hier kennenlernen durfte und die mein Leben bereichert haben.

(5)

Abstract

Documents appear to us regularly in daily life in various designs and lengths to serve different purposes. We are used to read novels, news papers, adver- tisement flyers, instruction manuals, bus tickets, tube maps, etc. In addition, a lot of professional life is based on browsing through and understanding of documents. Methods to reduce stacks of printed paper on our desks and to allow bigger scalability then an office room would offer are the driving research objects of this thesis. As casual as this vision sounds as profound and manifold are the research question related to it.

The thesis at hand covers topics from content acquisition to interaction with visualizations. A compact introduction motivates document visualization from different view points and discusses former efforts. As preliminary for later use, specific methods for content extraction from document files are depicted. Document Cards use this content to represent a documents textual and image highlights as rich representatives of small scale. The cards are in- tended to be used in larger application to replace dots in collection browsers.

For higher abstraction, tag clouds can summarize document collections. How CDTE Tag Clouds can reflect content and context changes of dynamically evolving collections is depicted in the corresponding chapter.

A common and important visual variable which is used in all visualizations in this thesis is position. Positions of data representatives can express close- ness, reveal groupings, and help building mental maps. When dimensional objects like text snippets or Document Cards represent entities at specific positions, overlap can occur resulting in visual clutter. A review and evaluation on practical methods to remove overlap leads to the invention of Rolled-Out Wordles, a simple but effective method in dense visualization scenarios.

The last chapter describes a design study, an interaction paradigm, and challenges of interdisciplinary work. HiTSEE for KNIME allows biochemists to observe structure activity relationship for high throughput screening experiments as integration into the KNIME platform. Although being based on biochemical data and tasks, the fundamental methods for visualization and interaction are applicable to a wide range of systems of large data visualization, including document collection browsing.

Finally, a conclusion summarizes insights and describes future work ideas.

(6)

(7)

Zusammenfassung

Dokumente begegnen uns tagtäglich in unterschiedlicher Länge und vielfälti- ger Art um unterschiedliche Aufgaben zu erfüllen. Wir sind es gewohnt Roma- ne, Zeitungen, Werbeflyer, Bus-Tickets, U-Bahn-Pläne, Bedienungsanleitun- gen, etc. zu lesen. Zusätzlich basiert eine grosse Menge des beruflichen Lebens auf dem Durchsuchen und Verstehen von Dokumenten. Techniken zu entwi- ckeln, die den Papierstapel auf unseren Schreibtischen reduzieren und die es erlauben mehr Dokumente zu verwalten als in unsere Büros passen würden, sind die treibenden Fragen dieser Dissertation. So einfach diese Vision klingen mag, so profund und vielfältig sind die damit verknüpften wissenschaftlichen Fragestellungen.

Die vorliegende Dissertation umfasst Themen von Inhaltszugriff bis hin zu Interaktion mit Visualisierungen. Eine kompakte Einführung motiviert das Feld der Text Visualisierung aus verschiedenen Sichtweisen und bespricht zu- rückliegende Anstrengungen. Als Vorraussetzung für spätere Nutzung werden spezielle Techniken für die Inhaltsextraktion beschrieben. Document Cards nutzen diese Inhalte um ein Dokument mittels Text- und Bild-Highlights in kleiner Grösse zu repräsentieren. Die Cards sollten genutzt werden um Punkte als Repräsentanten in Dokument Browser Anwendungen abzulösen. Um eine höhere Abstraktion zu erreichen, können Tag Clouds ganze Textsammlungen zusammenfassen. Wie CDTE Tag Clouds den Inhalt und die Kontextwechsel für dynamisch veränderliche Sammlungen wiedergeben, wird im entsprechen- den Kapitel beschrieben.

Eine allgemeine und wichtige visuelle Variable, die in allen Visualisierungen dieser Arbeit genutzt wird, ist die Position. Positionen von Datenrepräsentan- ten können Nähe und Gruppenzugehörigkeit ausdrücken und helfen mentale Karten zu bauen. Wenn nun dimensionale Objekte wie Textausschnitte oder Document Cards Entitäten repräsentieren, kann es zu Überdeckungen kommen die als visuelle Stördaten wahrgenommen werden. Eine Übersicht und Evalu- ierung von praktischen Techniken um diese Überdeckungen zu beseitigen führt zum Vorschlag von RolledOut-Wordles, einer einfachen aber effektiven Metho- de für dicht gepackte Anordnungen.

Das letzte Kapitel beschreibt eine Design Studie sowie ein Interaktionspara- digma und Herausforderungen bei interdisziplinärer Zusammenarbeit. HiTSEE

(8)

for KNIME erlaubt es Biochemikern innerhalb der KNIME Plattform, die Struktur-Aktivitäts-Beziehungen für High-Throughput-Screening Experimen- te zu untersuchen. Wenngleich das Projekt auch auf biochemischen Daten und Aufgabenstellungen beruht, sind grundsätzliche Methoden der Visualisierung und Interaktion transferierbar auf Systemen für Darstellung grosser Daten- mengen, wie z.B. auch Dokumenten Browser.

Ein Fazit fasst die gewonnen Erkenntnisse zusammen und sich ableitende zukünftige Forschungsthemen werden besprochen.

(9)

Chapter 1 Motivation

Large document collections are essential resources for a wide variety of pro- fessionals, such as scientists, lawyers, analysts, etc. As nowadays documents can be easily accessed and the amount of documents continuously increases at high rate, it is a tedious task to process and curate them. This thesis describes methods to reduce this information overload when being confronted with large (document) data. Our research is based on the consideration that documents are sets of texts and images. We introduce text visualizations first and later approaches integrating figurative and textual content.

1.1 Text Visualization

A serious introduction to text visualization has to state that it cannot be complete. Why? When starting to work in the field, researchers are already confronted with the main problem itself: a large collection of documents cover- ing many different aspects related to the subject text. Psychological research e.g. investigates the perception and cognition of letters, the psychology of spoken and written language, or the psychology of reading. Linguistics describe inter alia models on language structure, language function, language features, etymology, and linguistic transformations. While both disciplines already fill books and would require introductions by themselves, so far we have neither considered visual appearance (typography) nor the evolution of sign systems. As practical approach, we limit this introduction to key aspects in the development of text and text visualizations taking the historic tour (Sec- tion 1.1.1), describing psychological backgrounds (Section 1.1.2), and describe landmarks in text visualization (Section 1.1.3). As further simplification we consider written text to stem from an alphabetic system.

1.1.1 The historic trail

This section relies widely on facts taken from text books by Andrew Robin- son [Rob09] and Donald Jackson [Jac81]. Both references are recommendable for further reading.

Early humans started representing and saving information as sequential paintings on cave walls, so called proto-writing. The paintings from Chau- vet cave [CDH96] date back at least 21,000 years. They are considered to be

"the oldest and the most elaborate ever discovered" (Sadier et al. [SDB⁺12]).

These paintings represent pictures and written texts at that time, these mostly abstract images already tell a story. Divergence between image and text representations started 5,000 years ago in Mesopotamia where writing systems like Sumerian’s cuneiform evolved from pictographic into logographic form.

While pictograms are stylized symbols of images, logographs represented morphemes as the smallest units of meaning (semantics) within a language. At the same time, Egyptian hieroglyphs were already combining pictographic, morphemic and phonemic elements. Their sign system included 24 signs rep-

(15)

resenting consonants that could be considered as an early form of alphabet.

Several circumstances, like the ease of writing on papyrus vs. writing in stone, prevented simplification to only this subset of signs. While intermediate steps of development from hieroglyphs to an alphabet are a subject of discussion, it is common sense that the Phoenician alphabet is one of the earliest, developed 3,000 years ago. Phoenicians have been traveling salesman, which explains why the roots of their system are a mixture of Mediterranean cultures. Their abjad is the first known only-mapping of one symbol to one phoneme, replacing the one symbol to one syllable association. Consequently, the Greek named their ordered set of letters alphabet as reference to the first entries α and β.

~ 20,000 years ago

~ 3,000 years ago ~ 900 years ago

Figure 1.1: Examples of Chauvet cave painting, Egyptian hieroglyphs, and part of the Nibelungenlied script. (License: Wikimedia Commons Public Domain) In Europe, Romans became dominant, and the Latin (big-)letters were invented, as well as there Italic form. During the times of Charlemagne (8th century) and the medieval times, writing and copying remained a manual pro-

(16)

cess creating sheets of image-text art. While printing was already developed during the 8th century in China, the printing method with moveable letters from Gutenberg (15th century) allowed fast reproduction. The impact on page style was a clearer functional separation of text and image content, although for a long time, initials or Schnörkel remained as decoration. The industrial revolution led to the invention of typewriters (1867) and during WW2 the first electronic calculating machines were invented. The successors of these machines influenced younger history by setting two milestones for text (and image) content creation. Personal computers with word-processor applications (1970/80s) and popularization of the world wide web (1990s) lowered costs of document production and document distribution to a minimum.

1.1.2 The psychological approach

We have already discovered that text is nowadays as rapidly produceable and distributable as never before, but we have not thrown light on how humans "consume" text. Schönpflug & Schönpflug [SS95] and Rayner & Pollat- sek [RP94] provide extensive details on the psychological processes involved in reading, which we summarize in this section.

The consumption of text can be mainly split into reading as the perceptual part and understanding as the cognitive part. For reading, the human visual system performs saccadic eye movement processing lines of text. Each saccade¹ takes on average 20 to 35 ms to bridge a range of 7 to 9 characters. Between saccades, the eye fixates for 150 to 500 ms. While mainly moving forward,10−

15% of saccades areregression saccades re-investigating the already-read text.

Fast readers are trained to lower the number of these regression movements. A second distraction from moving forward are line endings which require return sweeps to move to the next line. These sweeps are combinations of two saccades with the first bridging long distance and second refining the precise position.

During reading, a combination of parallel letter processing and sound processing transforms (partially automatically) the text into words, which are held in working memory. This process does not necessarily involve understanding.

As example, we can read words and clearly pronounce them without gener-

1"Saccades are ballistic movements; once started, they can not be altered." (Rayner and Pollatsek [RP94])

(17)

ating any meaning from their sequence. For understanding, another step of processing is needed, the semantical analysis. Semantical processing extracts meaning and identifies concepts and their relations. Essentially for this acquisition is the correspondence to syntactical analysis. Syntactical analysis investigates the structure of texts, e.g. assigns roles to words within a sentence. Often syntactical analysis is mentioned as happening before semantical analysis, but there are also theories of parallel processing of both. Nonetheless, both are needed to extract meaning of texts. Finally, in a pragmatic step, text is interpreted involving context about the writer or writing style.

Text mining systems reproduce these steps of psycho-lingual analysis as algorithms on lexical, syntactical, and semantical level (see Ward et al. [WGK10]).

During lexical analysis the electronic text is transformed into atomic tokens like words or n-grams (analogous to phonemes or morphemes). Each token is assigned with syntactical attributes, which mark the token’s function within a sentence (part-of-speech analysis), its cardinality (singular, plural), or tokens are grouped to token sequences like person names, time snippets, or phone numbers (Named Entity Recognition). Ideally, semantical analysis discovers the meaning of a sentence or a text snippet. This understanding can not completely be retrieved or represented by an electronic device. For automatic machine-based processing, common text mining systems deliver information which is result of pre-defined tasks. Thesequasi-semantical analysis tasks are e.g. summarization, sentiment analysis, or event detection.

1.1.3 Related Work in Text Visualization

So far, human text consumption and historical development of sign systems have been introduced. This section provides an overview of the landmarks in text visualization.

At first, the question might arise as to wether the term text visualization itself can be considered a pleonasm. The historic trail reveals that text itself is a very abstract visualization for human thoughts and feelings, so text visualization seems to be over-defined. To clarify the terminology, we will consider all visualization approaches that take text (documents) as input and apply analytics on them for later display as text visualization. This textual input is, as mentioned earlier, easily available - and in a large scale.

(18)

Projects like Wikipedia [Wik12], New York Times Corpus [San08], BioMed- Central [BMC12], Twitter [Twi], or Open library [Ope12] are only few examples of nowadays accessible text (and image) sources. As they are not only large but constantly growing, gaining insights becomes more demanding. To circumvent this information overload a form of abstraction is needed to "show the amounts of information that are beyond the capacity of textual display"

(Chaomei Chen [Che05]). Further evidence for the need of text visualization is given by Thomas and Cook [TC05] as they define it to be part of the grand challenges in the field of visual analytics. To demonstrate the grouping of related work, we introduce Figure 1.2, which depicts a hierarchy of text aggregation levels ranging from a single letter to sets of document collections.

The categories are described by agglomeration levels, which strongly relate to them, either as data input or as object of visualization.

letter word word group

sentence paragraph

section chapter document document cluster

corpus corpus of corpora

linguistic visualization

single document visualization document collection visualization

Figure 1.2: Categorization of text visualization approaches w.r.t. different levels of aggregation.

Linguistic visualizations mostly represent statistical linguistic measures.

Christian Rohrdantz presents good examples in his work, like observations about vowel harmonies across languages ([RMB⁺10]). Another important linguistic research question involves different corpora – the visualizations for language comparisons, like the Languages Explorer [RHM⁺12]. Remarkable lin-

(19)

guistic visualizations also refer to Christopher Collins. Collins et al. [CP06] applied visualization to a multi-lingual chat scenario to show uncertainties during machine translation. DocuBurst [CCP09] shows document content by mapping term counts on a the linguistic-based hierarchy WordNet [Mil95, Fel98]. Wat- tenberg et al. [WV08] build a Word Tree from a given root term. Each branch of this tree represents a different text context related to the root term within a text corpus. Daniela Oelke contributed strongly in the linguistic domain.

Keim and Oelke [KO07] generate literature fingerprints from quasi-semantic, linguistic text measures and map them to color in a pixel-based visualization for documents or whole collections. This allows the observation of patterns like peculiarity in authorship or syntactical repetition in bible texts. In their work on readability analysis Oelke et al. [OSSK12] reduced a set of 141 readability features to a set of five features with significant variance. Their tool (VisRa) consists of different views, including pixel based visualizations, to allow investigation of each feature from sentence level up to document level.

Single document visualizations give an overview of one document and its content. Beside the scientific approaches we should keep in mind that the most used, simplistic visualizations for documents are document reader applications or web browsers. Most of them offer simple searching and highlighting functionality. Our approach (Stoffel et al. [SSDK12]) enhances the common highlighting. We adapt the font size of text to a degree-of-interest function² while retaining the overall paragraph layout.

Beside modifications on the document display several methods summarize (textual) content. Prominent examples are tag clouds like Wordle [VWF09] or TagCrowd [Ste06], which will be intensively discussed in Chapter 3. TextArc from Paley et al. [Pal02] presents text lines along an ellipse while important terms are positioned within the ellipse. Interaction allows further inspec- tion, words can be selected and lines including this word are highlighted and linked via strokes to the selection. The former mentioned DocuBurst [CCP09], WordTree [WV08], and VisRa [OSSK12] show special aspects of a document’s textual content and belong to this category as does Frank van Ham’s Phrase Net (van Ham et al. [vHWV09]). Phrase Nets are aesthetic network visualizations of entities and their relationship. A relation (like "is a") and en-

2a simple doi function is: doi= 1if a specified search term is found anddoi= 0otherwise

(20)

tity types (like names) are selectable. As an example of interesting patterns the relationship "X begat Y" was visualized for male name entities in bible texts. The resulting cyclic graph required an explanation which is given in the publication. During content creation, revisions of documents are regularly made persistent. For tracking changes between these reversions, Chevalier et al. [CDBF10] recommend the use of animated transitions to support change awareness. Another category of algorithms, which are related to small scale representations for documents, are the central topic of Chapter 2.

Document collection visualizations are widely addressed in literature. Early attempts relate to structured text like program code. Eick et al. [ESJ92]

introduced Seesoft which displays lines of code as colored lines organized file wise in stripes. The color of each line is determined by a degree-of-interest function which can e.g. represent authorship of lines of code.

For unstructured text, spatial visualizations are common. Kaski et al.

[KHLK98] applied self-organizing maps (SOM [Koh90]) to text documents, arranging them on a regular grid so that with least error similar documents fall into neighboring cells. Olsen et al. [OKS⁺93] describe the VIBE system which positions document representatives at interpolations between points of interest. These points of interest are collections of key terms associated with positions. Wise et al. [WTP⁺95, Wis99] arrange keyterms in Galaxies and ThemeScapes on the 2D plane and reflect semantical dissimilarity by Euclidean distance. The authors call this mapping of informations to aspects of natural environments the "ecological approach" [Wis99]. Miller et al. [MCWBF98] follow this idea and create TOPIC ISLANDS, a 3D landscape of topics extracted from texts. The combination of these island views and further visualizations should allow the user to "browse a document, generate fuzzy document out- lines, summarize text by levels of detail and, according to user interests, define meaningful subdocuments, query text content, and provide summaries of topic evolution" [MCWBF98]. Oesterling et al. [OST⁺10] give an island-like visualization and extensively discuss the problem of measuring distances of high- dimensional data like bag-of-word representations of documents. FacetAtlas from Cao et al. [CSL⁺10] represents textual entities within different facets and encodes links between each other as edge bundles connecting clusters in a landscape of entities. Paulovich et al. [PTT⁺12] describe ProjCloud, a system

(21)

that generates clouds of key terms and arranges them in 2D space so that closely related terms form clusters, which are represented as groups of terms.

Thiel et al. [TDKB07] present methods to analyze topic shifts within corpora over time. Terms are represented as period-frequency-vectors to apply multi- dimensional scaling on their pairwise distances. The resulting map contains terms and their temporal distribution in one visualization.

As already mentioned, document collections can underly temporal changes.

While [TDKB07] strongly relates to the landscape metaphor, the following methods specifically address the temporal evolution in a linear way. The- meRiver [HHWN02] represents the development of topic importance as stacked bar charts along the x-axis. To support the flow metaphor and to ease perception, the charts are vertically centered and connecting lines are smoothed.

Milos Krstajic shows in his work (Krstajic et al. [KBK11]) the use of Cloud- Lines, an online method to display event episodes for more than one time series. A zoom lens can be used within the system to distinguish overlapping events. More related work on time evolving corpus visualization can be found in Chapter 3.

Several approaches facilitate special purposes. JigSaw uses different coordi- nated views for the forensic analysis of criminal records. Stasko et al. [SGLS07]

invented a system to explore connections of entities extracted from full text and meta data. As a business-relevant application, Oelke et al. [OHR⁺09] visualize results of opinion analysis on customer feedback data. Opinion tendency and strength for different product parts and products are summarized in a tabular view. For meta data analysis, citation or co-author networks [KBV04]

are well-known. TileBars from Marti Hearst [Hea95] is a good example for search result visualization. Documents retrieved from a search are represented by title and a bar. The length of the bar indicates document length. Each bar is split into squared compartments which correspond to text segments.

Their coloring and intensity indicate query term occurrence in a particular text segment. She states that "the representation simultaneously and com- pactly indicates relative document length, query term frequency, and query term distribution". We use this encoding for interactive components in Docu- mentCards (see Section 2.2.3). As a last example, text collections and related meta data from databases in libraries. They are object of investigation in HCI

(22)

research. Thudt et al. [THC12] describe an exhibition setup in a library using metadata like book color or book size to facilitate serendipity findings. In this context, Rädle et al. [RWH⁺12] recently focused on collaborative aspects of book searches.

As stated at the beginning, related work can only reflect milestones and examples of branches in this wide field of visualizations for text documents and text document collections. Nonetheless, the reader should now be able to take the given related works as a seed point for further research. Additional sources of information are the books of Marti Hearst [Hea09] and Matthew Ward et al. [WGK10]. A good summary of text stream visualizations is given by Artur Silic [vB10].

1.2 More-Than-Text Visualization

Text visualizations are applied to large corpora of documents but do not often consider the value of additional informations given by images. Both types of content have different properties and fulfill different functions. Text is capable of describing thoughts and feelings as universal as no other representation. It allows the description of abstract concepts like e.g. "freedom" or

"god" [Naj98]. Text is read sequentially, which makes for a good transmitter of sequential data, while processing of parallel data is slow. A texts universality is limited by language dependencies. In opposition, images are processed highly parallel and they can be language independent (think of photographs). While they allow a very detailed description of a fact or a scene, abstract concepts require a common knowledge of representation. Figure 1.3 gives examples of images where abstract concepts require a shared context knowledge among ob- servers to be readable. The circles around the heads in the Madonna painting can be interpreted as sign for holiness by people knowing Christian symbolism.

The formula (a pictogram) depicting that for all x there is at least one y that fulfills y = x+ 1 requires mathematical symbolic conventions. On the other hand, an image like the photograph in Figure 1.3 is a good example of how rich images describe scenes in very compact form. To convey the whole information (including e.g. the blue car) with text would require considerable effort and large space. Both, the photograph and the Madonna image exemplify that

(23)

pictures are good transmitters of empathy.

concept holiness

concept infinity

photograph

Figure 1.3: Examples of figures depicting various concepts: Benois Madonna (∼1478, Leonardo da Vinci) on the left visualizes the concept of holiness by the commonly shared symbol of a halo; the formula top right requires common knowledge of mathematical symbols to be interpretable. The photograph ex- emplifies how expressive images can be in a limited space. (License: Wikimedia Commons Public Domain)

The following selection of statements on images and texts have been largely collected from Collin Wares’ book [War04] and Strothotte and Strothotte [SS97]:

• Images perform better to show structural relationships. Bartram [Bar80]

examined journey planning for bus rides and found that graphical representation worked better than tables.

• Visual information is generally remembered better than verbal information, but not for abstract images. [BKD75]

• Text is better than graphics for conveying program logic.[War04]

• Verbs are awkward to express in presentational pictures. [SS97]

• Presentational pictures are good for communicating structural information.[SS97] Without this agreement, information visualization would not

(24)

be beneficial. Especially important in this context are two references.

The first one from Cleveland and McGill [CM84] on pre-attentive features and how precisely we can depict visual features before really drawing attention to them. As a second reference, Gestalt theory describes laws that effect human perception when looking at spatial arrangements of visual items [Wer23].

After reflecting on aspects that separate images and texts we now search for evidence that their use in combination can be of benefit. Bieger and Glock [BG86] observed that providing pictorial context for an assembly task reduced assembly time and slightly increased correctness. Bock [Boc78] made experiments on the detection and processing of ambiguous words and sentences. As a result "the subject’s awareness of sentence ambiguity, and hence the depth of semantic analysis, was found to depend on the pictorial context in which the sentences were presented. The pictorial context was also found to affect the depth of processing of unambiguous sentences, which, when presented without a picture, were more time-consuming in comprehension and less well recalled than when preceded by a picture". These findings indicate the potential of using images as contextual information. A good example for the intuitive use of images to disambiguate texts is given by Collins et al. [CP06]. For a machine translation scenario they use Flickr³ images as word representatives if the measured word ambiguity between two languages is too high. A classi- cal example of integrating graphics into text is a drawing from Oliver Byrne (1847) visualizing the proof of the Pythagorean theorem (see Figure 1.4).

Before related work is discussed we have a look at how images and texts are linked in corpora of mixed documents. During analysis of such a corpus, we can build a network of images and terms, e.g. via reference decomposition.

When creating such a bi-partite concept graph it is hard to detect what acts as concept and what are its members. E.g., the photograph in Figure 1.3 could be a concept for its members "child" and "netherlands", while "child" can itself could be a concept for multiple images showing children. This problem is known in literature and for further reading see the PhD thesis of Tobias Köt- ter [K¨12] or Kötter and Berthold [KB12]. Adoption of methods for visualizing this duality of images and texts is interesting future work.

3http://www.flickr.com

(25)

13

Figure 1.4: Visual proof of the Pythagorean theorem in "The First Six Books of The Elements of Euclid" (1847 by Oliver Byrne). (License: Wikimedia Commons Public Domain)

(26)

Related work

Systems that use figurative and textual content in combination are often dominated by one type. As examples of image dominated systems we consider image search engines (like flickr), image annotation systems (like CAT- MAID [SCHT09]), or image tagging and commenting functions in social networks (like facebook or imgur⁴). Text-dominated systems have been discussed as text visualization systems in Section 1.1.3. We want to focus on integrated visualizations.

The BioText search engine from Hearst et al. [HDG⁺07] integrates abstracts, titles, and figures in a single list of search results. Pafilis et al. [POJ⁺09]

developed the reflect⁵ system, which tags names of genes, proteins or small molecules in biological websites. When clicking on an annotation it presents a concise summary of image and text information on the specific entity. Jhaveri and Räihä [JR05] describe their method, Session Highlights, of representing webpage thumbnails as user-selective browser history. Currently, rendering page previews has become an essential part of web browsers and search engines. These methods are not selective to content and do not summarize, they just display "first pages". More selective are Marian Dörks projects of Visual Backchannel [DGWC10] integrating text and images for following com- munication structure or his display of linked VisGets [DCCW08]. We discuss advanced approaches of document summarizations further in Chapter 2.

We conclude with the approach of WordsEye from Coyne and Sproat [CS01].

They convert textual descriptions into 3D scenes. After linguistic analysis, 3D models are queried from a large model database and model attributes are depicted. The resulting images are simplistically iconic. For the 3D construction, assumptions have to be made that are not part of the original text and can therefore induce false information (analogous to pandoleia).

4http://www.facebook.comor http://www.imgur.com

5http://reflect.ws

(27)

1.3 Scientific Contributions

This thesis is based on projects with a focus on different aspects of investigating large data sets. Although they have different application domains and different scopes the insights and algorithms support a joint vision towards a system that allows the browsing of large document corpora.

Chapter 2 depicts a method for the automatic creation of small document representatives, which contain important key terms and images. It is the foundation of this thesis and large parts have been published:

H. Strobelt, D. Oelke, C. Rohrdantz, A. Stoffel, D. A. Keim, and O. Deussen; Document Cards: A Top Trumps Vi- sualization for Documents; IEEE Transactions on Visu- alization and Computer Graphics, Volume 15, Issue 6, pages 1145-1152 (InfoVis 2009), November 2009.

*authors’ contributions: HS mainly developed the Document Cards system and was lead writer co-authored by DO. AS focussed on PDF text extraction, CR focussed on text mining for term extraction.

While publications corpora are updated with low frequency (monthly, yearly), Chapter 3 describes an attempt to visualize an evolving news collection which updates with high frequency. We modify text cloud methods in a way that the evolution of term importance and term relation can be observed. This work reflects outcome of a collaboration with Iris Adä, Enrico Bertini, Martin Mader, Kilian Thiel, Michael R. Berthold, Ulrik Brandes, and Oliver Deussen

6.

Chapter 4 evaluates and discusses algorithms to remove overlap between data representatives that are positioned in 2D space. As a method that addresses a general problem it can be widely applied. The work has been published: (see next page)

6contributions: HS developed iteratively prototypes that were based on ideas of HS, IA, MB and OD. HS was lead writer while MM and UB provided details for anchoring.

(28)

H. Strobelt, M. Spicker, A. Stoffel, D. Keim, O. Deussen;

Rolled-out Wordles: A Heuristic Method for Overlap Removal of 2D Data Representatives; Computer Graph- ics Forum, Volume 31, Issue 3pt3, pages 1135-1144 (EuroVis 2012), June 2012

*authors’ contributions: HS wrote main parts of the text, formalized and extended an idea of MS.

Chapter 5 addresses a biomedical task. The interactive metaphor used in the resulting tool named "Project and Expand" is transferable into a collection browser to allow the investigation of document clusters and to support serendipity findings. The HiTSEE Chapter is predicated on two publications:

i E. Bertini, H. Strobelt, J. Braun, O. Deussen, U. Groth, T. U. Mayer, D. Merhof; HiTSEE: A Visualization Tool for Hit Selection and Analysis in High- Throughput Screening Experiments; Proceedings of the 1st IEEE Symposium on Biological Data Visualization (BioVis 2011), 2011

ii H. Strobelt, E.Bertini, J. Braun, O. Deussen, U. Groth, T. U. Mayer, D. Merhof; HiTSEE KNIME: a visualization tool for hit selection and analysis in high- throughput screening experiments for the KNIME platform; BMC Bioinformatics 2012, 13(Suppl 8):S4, May 2012

*authors’ contributions: HS and EB developed and programmed the pro- totype. HS integrated HiTSEE into KNIME. All authors contributed to writing under the lead of HS and EB. Passages that are exclusively map- pable to authors are marked as citations.

(29)

1.4 Additional Scientific Contributions

During my PhD period I contributed to additional projects which are not further highlighted in this thesis. Related to text and summary visualization is the following work together with Andreas Stoffel:

A. Stoffel, H. Strobelt, O. Deussen, D. A. Keim; Docu- ment Thumbnails with Variable Text Scaling;Computer Graphics Forum, Volume 31, Issue 3, pp. 1165-1173 (EuroVis 2012), June 2012

For the following publications I substantially helped with writing and su- pervising students:

i Julian Kratt, Hendrik Strobelt, Oliver Deussen; Improv- ing Stability and Compactness in Street Layout Vi- sualizations;Proceedings of VMV 2011: Vision, Modeling and Visualization, 2011

ii Michael Zinsmaier, Ulrik Brandes, Oliver Deussen, Hen- drik Strobelt; Interactive Level-of-Detail Rendering of Large Graphs;IEEE Transaction on Visualization and Computer Graphics, volume 18, Issue 12, pp. 2486 - 2495 (InfoVis 2012) , November 2012

iii Josua Krause, Marc Spicker, Leonard Wörteler, Matthias Schäfer, Leishi Zhang, Hendrik Strobelt; Interactive Vi- sualization for Real-time Public Transport Journey Planning; to appear at SIGRAD 2012

(30)

Chapter 2 Small-scale, Rich Document Representatives –

Document Cards

2.1 Introduction . . . 19 2.1.1 Related Work . . . 19 2.1.2 Design Considerations . . . 22 2.2 Algorithms . . . 23 2.2.1 Key Term Extraction . . . 23 2.2.2 Image Weight and Image Classification . . . 27 2.2.3 Layout . . . 30 2.3 Application . . . 33 2.3.1 Analyzing the InfoVis 2008 proceedings . . . 33 2.3.2 Interactive tool . . . 34 2.3.3 Large scale system: The Conference Kiosk . . . 35 2.3.4 Application to Small Devices . . . 35 2.4 Evaluation – Preliminary User Study . . . 37 2.5 Conclusion & Future Work . . . 39

(31)

2.1 Introduction

As motivated in Chapter 1 it is an exhaustive task for readers to get an overview of large document collections. In this chapter, an approach for compact visual representation of documents is introduced, called Document Card (DC), that makes use of important key terms and important images (see Fig. 2.1). By using terms as well as images, we follow the idea to combine the informative value of texts with the descriptive nature of images in one view. The visualization aims at compact size to scale for large number of documents on display devices of different resolutions.

We reflect related approaches for creation of small scale document representatives (Section 2.1.1), define design constraints for DCs (Section 2.1.2), specify methods used for content extraction and layout in Section 2.2, give application examples (Section 2.3.4), and describe initially user tests in Section 2.4.

...

Figure 2.1: Document Cards help to display the important key terms and images of a document in a single compact view.

2.1.1 Related Work

A wide range of previous work can be considered for every sub-part of our method. Scientific articles are discussed which specifically aim towards a practical solution, while knowing that methods for finding keyterms, classifying images, applying layout methods, etc. are broadly addressed.

General Approaches

Operating systems integrate common solutions to explore collections of documents within file browsers. E.g., Microsoft Windows Explorer or Apple Finder provide a thumbnail preview of a document’s first page. Setlur et al. [SABG05]

(32)

create document icons that include representative images from a web image database found by key text features of a file. Other thumbnail approaches discuss the use of 3D icons, which map each information on a side of a cube (Henry and Hudson [HH90]) while Lewis et al. [LRFN04] focuses on distinctive icons as graphical solution for the “lost in hyperspace” problem. Previewing technologies like Apple Cover Flow or Quick Look add the capability to browse through document pages in place. Cockburn et al. [CGA06] show that representing all pages of a document in one view (Space-Filling Thumbnails) allows fast document navigation.

Visualizations for small devices aim at compact representations. Breuel et al. [BJPB02] propose to use and rearrange original text snippets from a text image to circumvent OCR¹ parsing problems. Berkner et al. [BSM03] extend this approach and create combined text-and-image thumbnails called SmartNails.

The used images are scaled and cropped to automatically extracted regions of interest. How to find such regions is also described by Suh et al. [SLBJ03]. Suh and Woodruff [SWRG02] introduced Enhanced Thumbnails which overlay and highlight extracted keywords on a scaled and saturation reduced version of a web page. The idea of creating thumbnails of PDF files is discussed by Sauer et al. [BFH05]. They extract images from documents, sort them by their filesize, and arrange the “top few” of them on the frontpage. Berkner [Ber06] discusses an approach of finding the best scale for a document page related to its type of content. Lam et al. [LB05] introduced the concept of Summary Thumbnails, which represent webpages as thumbnail views, enhanced with shortened text fragments in larger font size. The main layout of the webpage remains (as well as the total line count). Erol et al. [EBJ06] use the audio capability of an handheld device to auto-generate a small film introducing a document. The film contains images, and the highly relevant terms are spoken. By using zoom and pan technologies the layout is preserved. Russell and Dieberger [RDCJ03]

describe how to automatically instantiate manually created Summary Design Patterns using texts and images.

Our approach combines images and key terms in a single Document Card. In contrast to other methods, we build a compact representation of a whole document that combines representative images with expressive key terms. These

1optical character recognition

(33)

Document Cards can be used on a wide range of display sizes, e.g. as representatives in a collection browser or as single summarization of a publication on a smart phone (see Figure 2.5(a)).

Term Extraction

Approaches for keyword or key term extraction originate from information retrieval like the prominent TFIDF method ([SJ72], [SWY75]). An extensive survey on Information Retrieval methods was published by Kageura and Umino [KU96]. But also in text mining research key term extraction methods play a role as pointed out by Feldman et al. [FFK⁺98]. Usually, a measure is defined to score terms with respect to a document or a document collection.

A certain number of top scored terms according to the measure are then extracted as key terms. Terms usually get a higher score if they are frequent and/or their occurrence distribution shows to have certain characteristics.

Whereas most approaches require a suitable document corpus for comparison in order to extract key terms out of a single document, Matsuo and Ishizuka [MI04] describe a method that is able to extract key terms out of a single document without further resources. The approach is based on the co-occurrence of terms in sentences and the χ²-measure to determine biased co-occurrence distributions in order to assess the importance of terms.

Our approach also uses an extension of theχ²-measure to identify important key terms. However, we base our extraction method on the structure of the document. The rationale for this is explained in Section 2.2.1.

Image Classification

For image classification Chapelle et al. [CHV99] suggest Support Vector Ma- chines operating on image histograms. Moreno et al. [MHV03] and Vasconcelos et al. [Vas04] suggest to use the Kullback-Leibler divergence as distance measure between two histograms.

Layout

Placing a set of rectangular images optimally into a given rectangular canvas is known as rectangle packing. It is in the class of NP complete problems.

(34)

From the wide range of algorithms which provide an approximative solution, three approaches are referenced here. A method from computer graphics uses efficient packing methods to create texture atlases [Sco12]. Murata et al. [MFNK95] introduced the sequence pairs to transform the problem into a P-admissible solution space problem. The approach generates packings of high quality in reasonable time for offline use (like in VLSI design). Itoh et al. [IYIK04] have shown a fast, geometric algorithm for placing rectangles in short time. The algorithm does not use global optimization, but produces packings of good quality. A survey of rectangle packing is given in Korf [Kor03].

Seifert et al. [SKK⁺08] give an overview of recent approaches generating text clouds. Their approach describes an iterative algorithm for optimizing font sizes and string truncation to place text bounding boxes into given polygonal spaces. We adapt this approach in Section 2.2.3.

2.1.2 Design Considerations

Summarization is necessarily a lossy compression and requires decisions of what can be preserved and what has to be excluded. Document Cards try to address this problem with special foci that are reflected in the following constraints and design decisions:

• Document Cards are fixed size thumbnails that are self-explanatory. Ap- proaches like [BJPB02], [SWRG02], [LB05], and [EBJ06] preserve the main structure of a document on a fixed size view. But these approaches require interaction, like browsing or listening, to get an overview over the whole document. In [Ber06] the optimal scale for pages are calculated which breaks the constraint of a fixed size representation. As Document Cards shall also be applicable on small screen devices like handhelds or mobile phones it is an important feature that they provide meaningful global representations on a given limited space.

• Document Cards represent the document’s content as a mixture of figurative and textual representatives. Erol et al. [EBJ06] evaluated the most important parts of a document for the tasks of searching for it and understanding its content. Namely the top three are: title, figures, and abstract.

(35)

Since we are aiming at a small representation we include the title (as top one feature), a filtered collection of figures, and we extract important keywords as an approximation for the content. Previous approaches, aiming at even smaller size representations, focus either on the semantic content (Semanticons [SABG05]) or the contained images and image texts (SmartNails [BSM03]), but not both. We present novel methods that carefully filter the most meaningful representatives of both categories and combine them in one view.

• Document Cards should be discriminative and should have a high recog- nizability. Summary Design Patterns [RDCJ03] provide an uniform look on summaries of picture collections. In opposition we propose, that Doc- ument Cards are distinguishable and recognizable by the layout of images and texts within a card. Since the elements are layout individually for each DC the outer shape of this layout describes the card uniquely. In addition, the card’s background is color-coded as described in Section 2.2.3.

Nonetheless, it is future work to show how effective these considerations are w.r.t discrimination and recognition.

2.2 Algorithms

In this section we describe the pipeline for creating Document Cards (see Figure 2.2). At first, relevant key terms are extracted (Section 2.2.1). We then discuss how images are weighted, classified, and finally selected for presentation (Section 2.2.2). In Section 2.2.3, we show how the chosen images and terms are assembled as Document Card. A necessary prerequisite is the content extraction from PDF files, which is described in the Appendix Chapter A.

2.2.1 Key Term Extraction

Key Terms are extracted for each document to describe its main content. For biomedical text mining the distribution of keywords in scientific publications has been examined several times. Shah et al. [SPIBA03] searched for keywords in five standard sections and came to the conclusion that “information is un- evenly distributed across the sections of the article, that is, different sections

(36)

PDF

full-text extraction

image extraction

image

packing term

placement key term

extraction

image filtering

§ A.2 § 2.2.1

§ 2.2.3

§ A.2 § 2.2.2

Figure 2.2: The Document Card pipeline. Each step is further explained in the sections indicated by the number in the top right corner of each box.

contain different kind of information”. A study by Schuemie et al. [SWS⁺04]

that also examined five standard sections had a similar outcome, which was that “30-40 % of the information mentioned in each section is unique to that section”. Both studies come to the conclusion that abstracts, while having the highest keyword density, do not even nearly cover all the information (keywords) contained in a full-text scientific article.

Based on these findings we decided not to limit the term extraction to abstracts. Instead, we use full-text articles regarding the section boundaries also as topic boundaries. An author usually starts a new section, when writing about a different topic or sub topic. As a result, non-relevant terms will appear equally distributed over all sections of the document, while the important key terms will not. They have higher frequencies in the sections of their particular topics and a lower frequency in the others. Thus, the non-equally distributed terms are the key terms we are looking for.

As a first step, text has to be extracted and grouped into sections. A detailed description on how to retrieve this information is given in Appendix Chapter A.

This structured text is input to the key term extraction which consists of several steps: the text is cleaned up in the preprocessing and candidate filtering step, noun phrases are especially processed, and finally term extraction based on term scoring is described.

Preprocessing and Candidate Filtering

The preprocessing comprises sentence splitting, part-of-speech tagging and noun phrase chunking with OpenNLP-Tools [Kot12] and a base form reduction of words according to Kuhlen’s algorithm [Kuh77].

(37)

Next, in the candidate filtering step we eliminate stop words and noise.

Verbs are also deleted, a decision that is based on the empirical observation that even verbs which have a characteristic distribution are of a rather general nature. For many papers the salient verbs are e.g. “work”, “show” or “compute”.

Whereas approach-specific verbs mostly also appear in their nominalized form.

For example, for the chapter at hand it would be much more meaningful to get the terms “image extraction” or “term extraction” than the verb “extract”.

Special Noun-Phrase Processing

Compound nouns, noun phrases consisting of at least two terms, have the highest potential to be very descriptive for a certain paper. Among the 130 in- dex terms that the authors of the InfoVis 2008 publications manually assigned to their papers, 92 (about 70 %) correspond to compound nouns, which em- phasizes their importance. This is because they often correspond to technical terms that are very specific and descriptive for a described approach.

At the same time we also consider sub phrases of larger noun phrases. The noun phrase “a term extraction algorithm” has several sub phrases that might be interesting. Our algorithm deletes leftmost articles like “a” and then builds every rightmost sub phrase², e.g. in this case “term extraction algorithm”,

“extraction algorithm” and “algorithm”. In most cases by shortening the noun phrase in this particular way, the shorter representations are generalizations of the longer ones which are likelier to appear more often. In the next section we describe how our method weights noun phrases specially to take in consideration the lower probability of re-occurrence.

Term Scoring and Term Extraction

For term scoring, the occurrence of every term is counted for each section separately. As result, we get a vector for every term where each dimension corresponds to a section and each dimension’s value is the number of occur- rences of that term in the section. We keep only those terms that occur at least seven times in the document. All other terms are considered to be too infrequent to be key terms.

2Sub phrases are conjunctions of adjectives and nouns that can be determined as patterns of POS tags.

(38)

For each of the remaining vectors we calculate how strongly it deviates from an equal distribution using an extension of the χ²-measure:

χ²_sec(t, D) =X

s∈D











(freq(t,s)−freq(t,D)^size(s)

size(D))²

freq(t,s) , if freq(t, s)>0

0, else

,

where D denotes the document,s the section and t the term. Accordingly, freq(t, s) is the occurrence count (observed frequency) of term t in section s, freq(t, D) the term’s count in document D and size(x) means the number of terms in a text unit x. The partfreq(t, D)·size(s)/size(D)thus describes the expected frequency of a termt in a section s, if we assume equal distribution.

For every section, we calculate the squared deviation of the observed frequency from the expected frequency is summed up, after normalizing it by dividing it by the observed frequency. Usually in theχ²-test the normalization is done by dividing by the expected frequency, which is changed here to avoid overestimating terms in very short sections. For example, a term that appears once within a section of 10 words in a paper of 1000 words. The summand for this term and this section would be ((1−1·(10/1000))²)/(1·(10/1000))

= 98 which is inappropriately high and would distort the overall result. With our normalization, the corresponding summand is only 0.98. The modification of the normalization still scores terms with strongly deviating distributions higher but without the undesired effect of potentially over-scoring terms that appear in very short sections. At the same time, sections where a term is not contained do not contribute to the term score. Hence, high scores are assigned to terms that not only have a skewed distribution but also are present in several sections. This guarantees that terms are preferred that not only appear in one section but ideally play a vital role in distinct parts of the document.

Despite their descriptive nature, compound nouns are usually not among the highest scored terms according to the described method. To improve the score of the compound nouns, we boost them by doubling their occurrence counts compared to normal terms.

After scoring the terms with ourχ²_sec scoring function, the top-k terms with the highest scores are extracted. If there are compound nouns in the top- k terms, which are contained by other compound nouns also present in the

(39)

top-k, then the shorter ones are discarded and replaced by the terms with the next highest scores. For example, if the terms “extraction algorithm” and

“algorithm” are present within the top-k terms, we delete the latter one keeping only the longest and thus most specific compound noun. The number k of terms to extract is determined by the available layout space in a DC (Section 2.2.3).

Corpus-independent vs. Corpus-dependent Term Extraction

Our algorithm for key term extraction is corpus-independent. This lowers preconditions for using DC methods because it does not depend on additional data sources and can be applied a single document. TFIDF as a method which relies on a comparison corpus prefers to extract terms that discriminate one document from others, while our aim is to represent each document with its main content which has not to be exclusively discriminative. Additionally, topics that dominate a document corpus would not be extracted because of their lacking discrimination power.

Alternatively, if corpus information is given, the corpus itself could be described with important key terms which can be discarded from the contained Document Cards. For example, the kiosk system described in Section 2.3.3 is applied to a corpus of visualization publications. Terms like "visualization"

or "analytics" can be used to describe the whole corpus. Mentioning these terms explicitly in every DC is unnecessary redundancy while discrimination between the publications is clearly wanted. In this case, TFIDF would be an appropriate term extraction method for DC, while corpus-wise topics are common sense and can be extracted via simple term counting. But before aiming at a hierarchical approach, the question has to be answered if a human is capable to perceive and operate with such a representation (see future work in Section 2.5).

2.2.2 Image Weight and Image Classification

In addition to full-text, figures are extracted from documents (see Appendix A).

We use a figure’s textual context for weighting and ranking, figure content for classification, and finally define a selection strategy based on content and context.

(40)

Image weighting

Image weighting is based on text which is associated with a figure. We specify a figure’s textual context as the concatenation of its caption text and referencing text, i.e. sentences in document full-text that refer to a specific figure. Our method uses regular expressions to find these references. We now consider an image as important if an important key term is found in its dedicated text.

Next to our automatic retrieved importance measure we take into account the authors intension. It is very likely, that an author would embed important images more prominent, thus bigger, than less relevant ones. We combine both intensions by resizing an image with the scaling function:

scale = (1.0 +scale_max·w_max) size_image = size_image·scale,

where w_max is set to the maximum weight of all key terms found in the descriptive text and scale_max is a constant factor that controls the influence of the key terms with respect to the size of the images. We experimentally set scale_max to a value of 0.5. Doing so, we combine the original image size (authors intension) and a semantical size boost (automatic quasi-semantical measure) for later processing (see Selection Strategy).

Image classification

Images in a scientific document can be of different content and fulfill different purposes. We define the following practical categories:

• Atable (T)is a set of facts systematically displayed. Its colors are mostly black and white.

• Animage of category (A) is a diagram, a sketch, or a graph image which shows a concept or has explaining character. It uses a reduced number of colors.

• An image of category (B) is a photography or rendered image which shows a real world scenario or an expressive computer generated scenario. It is characterized by many colors, which have a rather complex distribution across the color space.

(41)

For classification, each image is represented as HSV color histogram with 8 bins per color channel and additional 8 bins for grayscale, resulting in8³+ 8 = 520dimensions. Histogram values are normalized by the total number of pixels and sorted in decreasing order. This allows us to compare different images with respect to color distribution characteristics instead of specific colors that were used. As recommended in [Vas04], [MHV03], and [CHV99] we use Support Vector Machines (SVM) for classification (implementation of LIBSVM library [CL12]) and the Kullback-Leibler divergence as distance function. For our application, we used a radial base function in a SVM and trained it with 57 representative images from the IEEE Vis 2008 proceedings corpus. In the classification step, the most probable class label is assigned to each figure.

Selection strategy

In each Document Card, the number of images selected for display is limited by an absolute maximum: max_im. Images are selected in the following order:

1. Omit tables. Images that have been classified as tables are omitted. The reason for this is that downscaled tables do not provide much information because their contained text in small font size is not readable. Only if no other image is available, a table is shown in the DC.

2. Sort imagesin descending size order. Remember that image size is influenced by textual context related to images.

3. Select first max_im images and check if the following constraint is ful- filled: We want to display at least one image from each category. Thus, if there is no image of category A (or B) included in the list, the last image in the list is discarded and substituted with the the largest image of category A (B, respectively).

4. Filter too small images, i.e. if the area of an image is smaller than 25%

of the largest image area, the image is discarded.

We give an example of such a selection list in Table 2.1. The generated selected image list is input to the layout process described in the following section.

(42)

figure size class selected fig. 1 12000 B yes fig. 3 10200 B yes

fig. 5 9600 T no (table)

fig. 2 8800 B no (again type B) fig. 4 7500 A yes

Table 2.1: Example of a size sorted image list that is used for selecting images for DCs (max_im = 3). Figure 5 and Figure 2 are not selected because they fail the selection criteria.

2.2.3 Layout

In the previous sections we explained how to extract images and key terms.

Images are resized according to their quasi-semantic weight and assigned to one of the image classes before a subset is selected for presentation. In this section we describe how to integrate these bag of terms and set of images into a compact view.

Image placement

Firstly, the filtered set of images has to be placed into the DC canvas. Packing of image bounding boxes to fit optimally in size to a given aspect ratio is an NP complete problem. Therefore, a good approximation is needed, that provides a fast solution producing sufficient results. Itoh et al. [IYIK04] have presented such an algorithm which we adopt and extend. They suggest to use a penalty function for image insertion which penalizes increases of the resulting bounding box and differences from the aimed bounding box aspect ratio. We extend the penalty function by an additional term, that considers the difference of an image position from its position on a particular page. That means, that images appearing in the upper right of their original page tend to appear up right in the summary visualization. Doing so, optimizes transitions for later interaction. After arranging the bounding boxes, the calculated layout is scaled to fit into the DC canvas.

The images are positioned iteratively on the Document Card according to the coordinates that are given by the packing algorithm. At the same time we

(43)

left right top

bottom image 1

2

left right

Figure 2.3: The split algorithm used for finding empty space rectangles: After insertion of image 1 the canvas is split into 4 regions. The bottom region is further split into 3 new regions on insertion of figure 2.

collect informations about free areas in the canvas which will be used later for key term placement. This is done as follows: For each insertion of an image the surrounding free space rectangle is split into up to 4 new rectangles located on the top, bottom, left, and right side. Figure 2.3 illustrates the procedure for insertion of first and second rectangles. After inserting a first image at its position, the DC canvas is split into a left, right, top, and bottom section (left subfigure). The second image is placed beneath the first one in this example. It splits the free space rectangle at the bottom into three new sections: left, right, and bottom (right subfigure). The following algorithm details the process:

a list Li of images with calculated positions;

a list L_r of free space rectangles;

initialize L_r with the DC canvas bounding box;

for all i inL_i do

for all r inL_r that intersecti do split r into r_T,r_B, r_L, and r_R; add allrX to Lr;

remove r fromL_r; end for

end for

By splitting the canvas horizontally we support creation of free space rectangles withwidth/height ratio larger than one, which is beneficial for placing

Visualization of Large Document Corpora