Proactive Visualization of Search Queries in Hierarchical Document Collections

(1)

Proactive Visualization of Search Queries in Hierarchical Document Collections

Master Thesis

submitted by

Arlind Nocaj

to the

University of Konstanz

Faculty of Sciences

Computer and Information Science

1

^st

Referee: Prof. Dr. Ulrik Brandes 2

^nd

Referee: Prof. Dr. Daniel Keim

Konstanz, May 2011

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-147984

(2)

(3)

Abstract

Given a large collection of documents, a normal search interface only helps the user when the desired information is among the top 10 results. Although there is often a hierarchical structure which is an organization paradigm, it is rarely used.

Here we propose an extension to the normal search interface which places search results in a hierarchical document structure to provide the user with a sense of context.

Our search extension is implemented as follows. First, in a preprocessing step, we create mental map positions of the document hierarchy according to document similarities. Next, we use Multidimensional Scaling to ensure that similar documents are close together. By combining Voronoi Treemaps with Stress Ma- jorization we elaborate a visualization which can proactively show the user the important parts of the hierarchy according to a search query.

The similarity is considered and by using the mental map positions as initial layout the overall structure is mostly maintained, as measures show. The available space is used efficiently and the context of the result documents is shown by drawing them as nodes and their dependencies as hierarchically bundled edges.

Our approach is scalable and widely applicable. The Voronoi Treemap is ana- lytically computed in O(k·nlogn) where k is the number of iterations and n the number of nodes in the hierarchy; previous approaches used Monte Carlo based methods and needed O(k·n² +n²logn).

The combination of Voronoi Treemaps and Stress Majorization might be used in any field where hierarchy, size and location of elements play an important role.

i

(4)

Acknowledgement

First of all, I would like to thank my supervisor Prof. Dr. Ulrik Brandes, who always found time for me. I am also grateful to the members of the Algorithmics group for the helpful discussions and to Natalie for providing access to the office.

Further thanks goes to Prof. Dr. Daniel Keim for being the second referee and for providing the data. I am also thankful to Christine, Conrad and Matthias for proofreading.

However, my deepest thanks go to my parents and especially to my wife who supports me all the time.

ii

(5)

1. Introduction

1.1. Motivation

Information management is poor in many domains of today’s society. With huge amount of documents or textual data, it is inadequate to only have a full-text search on this data, since we usually only consider the first few results in the result set, which often do not contain what we are looking for.

Current approaches and visualizations [18, 65, 64, 59, 75] do not offer the user a good satisfactory view of how search results fit into the hierarchical document collection. ResultMaps [18] for example uses nesting rectangles and thus does not have the flexibility to adapt the visualization without a complete reorganization.

A hierarchy is often used to organize documents. The hierarchy is an organization paradigm that one can use to reduce the result set by choosing a certain branch of the hierarchy. In an evaluation of digital libraries, Fowler et. al [19]

noted that in most cases the hierarchy is not used for more then a filtering option.

Although we may have used a search interface hundreds of times, and the data

Figure 1.1.: Normal approach for a search interface. They hierarchy is represented by a navigation menu and the most important results are visible when searching for something. The user does not have an impression where in the search space the results are located.¹

never changed, we often do not know anything about the underlying structure which might help us understanding the results.

1partly from http://www.studium.uni-konstanz.de/studienangebot/, visited on April, 2011

(8)

2 1 INTRODUCTION To get a sense of how knowledge of hierarchical organization can aid the search for information, consider visiting a physical library. When physically going to a library and searching for a book, one learns something each time from the spatial structure of the library. Although you never really know exactly where a book is, you know where in the search space certain topics are. One might e.g. know that on the left side of a corner there are some mathematics books.

In the case of a real library a mental map of the search space is created from the underlying hierarchical structure and their position in the library. Besides the consideration of the hierarchy it would also be beneficial to see the dependencies between the documents, which can be based on meta data or on the textual content.

Goal: Our idea is to extend the normal search interface of hierarchically structured documents with a visualization which helps to us in understanding the search space and the dependencies between the results we get. Our goal is thus to create a stable and flexible visualization which can be used as an extension of a search interface and help the user to understand the underlying hierarchical structure of the document collection. By visualizing the dependencies between the documents the user should understand in which context a certain result is. In addition the available area should be used for the important parts and thus be dynamically adaptable to the search query. The dependencies between documents should also be considered in the layout.

1.2. Thesis overview

In the following section we explain whatproactive in our context means and provide some preliminary information and definitions.

After discussing related work on search query visualization in Chapter 2 , we explain the design decisions for the different techniques in Chapter 3 .

In Chapter 4 we continue with the preprocessing step which creates the necessary indices for the search interface and for the Mental Map.

Then in Chapter 5 we provide detailed information on the layout algorithm which combines Voronoi Treemaps with Stress Majorization, removes node over- laps and creates a graph layout with hierarchical edge bundling. We also intro- duce the analytical approach of the Voronoi Treemap computation and explain the necessary algorithms beside the Fortune Algorithm. This chapter furthermore provides some empirical quality measurements for the quality of the layout.

We then test our approach with some news data in Chapter 6 and finally conclude with a summary and an outlook on possible improvements.

The Appendix contains all figures on the quality measurements and a further visualization example.

(9)

1.3 Proactive Query 3

1.3. Proactive Query

Queries are mostly understood as a sequence of words. The normal behaviour of a query system is that one enters a query and gets the results of this query after starting the search process. But often a reformulation of the query is necessary to get the desired data. Google for example responds to each entered character with a list of results, when the user is logged in. A more general formulation of a query would thus be the understanding as a sequence of characters with variable time intervals between the characters.

When the user has not yet completed his query one could either be lazy and wait for the query to finish or one could be eager and try to use the currently available information. We define a query q as a sequence of characters with time intervals between the characters;

q = (t1, c1, t2, c2, . . . , tk, ck),

whereti ∈R⁺for i∈ {1, . . . , k} is the time duration and ci is thei−th character.

We can assume that t1 = 0 since we cannot be active before the first character is entered. What advantage we might get by proactive consideration of the individual characters? What consequence does this behaviour have on the resulting visualization?

We see two possibilities for handling the query so that the user gets a proactive response while entering it:

• Separate handling

• Dynamic handling

Separate handling: In this situation we compute the visualization for each query from scratch. This has the advantage that the resulting visualization is deterministic for a certain query. When the user starts entering a query the computation is started in the background. If the computation is finished before the user enters the next character it is shown. If it is not finished when the user enters a new character the current computation is stopped and the new query is used for visualization.

This method has one big advantage. The sequence of characters and the operator used for the search between individual words in the query is flexible. It might be any operator which the search system supports. This is preferable for many users since they are used to this kind of behaviour from search engines. They just enter some words and do not want to think about the operators used between these words.

Dynamic handling: The idea of the dynamic handling is to start the computation of the visualization as soon as a character is entered. When the next character is typed by the user, the current available layout is used as a starting point for the final layout. Depending on the iterative algorithm, the new search character is not

(10)

4 1 INTRODUCTION allowed to increase the result set and thus the hierarchy. This would cause a lot of problems, if the used technique does not easily adapt the extended hierarchy.

If the size of the result set is monotonically decreasing with each character in the query one could continue the computation of the final visualization by either removing parts of the hierarchy or changing their weights according to the smaller result set. A consequence is that no or-queries with several words can be handled dynamically because the result set could be increased by the next word.

The following points have to be considered when using the Dynamic Handling:

• non deterministic visualization result

The user could get two different visualizations depending on the way and speed he entered the search query. This is not necessarily a disadvantage.

It might also be beneficial if the layout is optimized more on the characters and words that stayed longer in the search query.

• not every sequence of characters can be handled if the technique require monotonic decrease of the result set

1.4. Preliminaries

1.4.1. Hierarchically Clustered Graph

A Graph G is defined as G = (V, E), where V is the set of nodes and E is the set of edges. E ⊆V ×V is a set of pairs of nodes. We call the Graph undirected if the pairs of nodes are unordered. If the pairs of nodes are ordered it is a directed Graph. For our purpose it is enough to consider undirected graphs. A hierarchically clustered graph GH is an extension of G.

The hierarchical clustering is defined by using another set Cof nodes and another set H of edges between nodes n1, n2 ∈ (C ∪V). C represents the clusters and H ⊆ (V ×V)∪(C×C) represents the hierarchy. An edge between two clusters denotes the inheritance of one cluster to the other. An edge between two nodes v ∈V andc∈C denotes that v is contained in cluster c. Since it is a hierarchy it must have a unique root r∈C.

Definition 1.1 (Hierarchically Clustered Graph). A hierarchically clustered graph (HCG) is given by the triple GH = (V ∪C, E ∪H, r), where the subgraph T = (C∪V, H) is a rooted tree with root r∈C.

Fig. 1.2 shows an example of a graph and a hierarchically clustered graph. The edges of H in Fig. 1.2 are only directed for easier understanding of the hierarchy.

Note that it is not necessary that the edges in H are directed, since the direction is clearly defined by giving the root r. For more descriptions on the graph terminology like subgraph and tree, see [53].

Definition 1.2 (height(n)). The height of a node n ∈(V ∪C) is defined as the graph theoretic distance of n from the root r. The root has thus the height zero.

(11)

1.4 Preliminaries 5

(a) (b)

Figure 1.2.: (a) Example Graph G= (V, E); (b) Example of a hierarchically clustered graph GH

Definition 1.3 (parent(n)). The parent of a node n ∈(V ∪C) is defined as the node which lies on the path from r to n and for which the height is one smaller then height(n):

parent(n) := n

x∈C(x, n)∈H and (height(x) =height(n)−1)o .

Definition 1.4 (children(n)). The children of a node n ∈(V ∪C) are the nodes of which n is the parent:

children(n) :=n

x∈(V ∪C)p(x) = no .

Definition 1.5 (ancestor(n)). The ancestors of a node n∈(V ∪C)are the nodes which lie on the path from r to n.

ancestors(n) :=n

x∈C∃{n1, n2, ..., nk} ⊆C : (n1 =r) ∧(nk =n)

∧(p(ni) =ni−1 for i∈ {2, ..., k})o (1.1) Definition 1.6 (descendant(n)). The descendants of a node n∈(V ∪C) are the nodes for which n lies on their path to r.

descendant(n) := n

x∈(V ∪C)∃(n1, . . . , ni, . . . , nk)⊆(V ∪C) : (n1 =r)∧(n=ni)

∧(p(nj) =nj−1 for j ∈ {2, . . . , k})o

(1.2)

(12)

6 1 INTRODUCTION Definition 1.7(neighbour(n)). The neighbours ofn are the nodes which have the same parent as n.

neighbour(n) := n

x∈(V ∪C)\ {n}p(n) =p(x)o

Definition 1.8 (CA(n, m)). The common ancestors of two nodes n, m∈(V ∪C) are the nodes which are ancestors of both nodes n, m.

CA(n, m) := n

x∈Cx∈ancestor(n)∧x∈ancestor(m)o

Definition 1.9 (LCA(n, m)). The least common ancestor of two nodes n, m ∈ (V ∪C) is a node which is an ancestor of both nodes and has the shortest path to the nodes.

LCA(n, m) :=n

x∈C x∈CA(n, m)

∧ ∀y∈CA(n, m)\ {x}:height(x)> height(y)o (1.3) Definition 1.10 (path(u, v)). A path from u to v in a hierarchically clustered graph GH = (V ∪C, E∪H, r), where u, v ∈ (V ∪C), is an ordered sequence of nodes which describe the path on the hierarchy of GH.

path(u, v) := (n1, n2, . . . , nk) : (ni ∈(V ∪C), for i∈ {1, . . . , k})∧(n1 =u)

∧(n_k =v)∧((n_j, n_j + 1)∈H for j ∈ {1, . . . , k−1})

(1.4) Uniqueness of LCA

Proof. LetGH = (V ∪C, E∪H, r) be a hierarchically clustered graph. Note that the least common ancestor of two nodesn, m∈(V∪C) is unique. This results from the fact that the subgraph T = (V ∪C, H) is a rooted tree with root r. If there would be two least common ancestors a₁, a₂ ∈ C with height(a₁) = height(a₂) and a1 6=a2 then there would exist two different paths from r ton, which build a circle in T. Since T is a tree there cannot be a circle.

1.4.2. Multidimensional Scaling

Multidimensional Scaling (MDS) is concerned with geometrical positioning of objects whose pairwise similarities (or dissimilarities) are given. The positioning should be in such a way that the euclidean distance between two objects represents their similarity (or dissimilarity). Looking at such a visualization then helps in analyzing the objects and also understanding the measure used.

The commonly used techniques which may be useful for our thesis are classical scaling and distance scaling. Classical scaling is based on spectral decomposition

(13)

1.4 Preliminaries 7 and yields a unique solution, while distance scaling optimizes given initial positions by changing positions to fit certain distances. The favoured Stress Majorization is classified to the distance scaling techniques and seems to be superior to the often-used force-directed techniques [14].

(14)

(15)

2. Related Work

There are different approaches to visualizing search results or to give the user a feeling of where in the search space the results are located. In a study of educational digital libraries Clarkson et al. [19] note that there is great potential for using the underlying hierarchical structure of the libraries and that the current systems do not go beyond simple filtering.

There are different approaches [10, 67, 44] which try to visualize search results in 3D and help the user in exploration and navigation, but it is rather difficult for users to understand these systems and visualizations.

The related work can be categorized as follows:

• Approaches using similarity for layouting:

All these approaches have the disadvantage that they do not consider a given hierarchical structure in the visualization.

• Approaches using hierarchy for layouting:

The similarity between documents is not considered.

• Approaches using similarity and hierarchy for layouting:

Although they are a good starting point, they have disadvantages which will be explained in the following.

2.1. Placement by similarity

Websom [48]: Websom uses self-organising maps [74] to create a thematic map.

In a preprocessing step, document vectors are created and used for creating a self- organized map. For the different map regions, labels are automatically selected and shown to characterise the map regions. After that, a document vector of a search document is used to return the best matching map locations.

The underlying neural networks need extensive training for achieving good results. The area use does not change for a certain query and thus the important parts are not brought in the foreground.

SPIRE [75]: SPIRE is similar to WEBSOM but offers a much more detailed view on the data. There is an overview with statistical derived key topics and also detailed views for a region in which the documents represent stars in a galaxy.

(16)

10 2 RELATED WORK

Figure 2.1.: Websom example which uses self-organizing maps to mark important result regions (from [48])

Documents near each other are similar and peaks in a region describe high con- centration of similar documents. It has the same disadvantages as WEBSOM but offers more possibilities in mapping meta data to the visualization.

Figure 2.2.: SPIRE uses self-organizing maps to create maps which consider document similarities. Different attributes can be mapped to the visual properties.

Galaxy of News [64]: Galaxy of News constructs relations between news articles and visualizes them by showing topical keywords which can be used to show more details on this topic up to the news article level. On each selection the view changes, which makes it hard to maintain an orientation. It also has no query possibility which would help in searching for wanted information.

(17)

2.2 Placement by hierarchical structure 11

2.2. Placement by hierarchical structure

Figure 2.3.: Cat-A-Cone visualization of the hierarchy by using cone trees. By rotating the cone tree elements on the background get visible. (see [44])

Cat-A-Cone [65, 44]: A navigation interface is provided by using cone trees in a three dimensional view. For seeing hidden categories, one can rotate the cones. Different branches can be opened up to document level. The problems of this representation are clear. Space is not used efficiently and navigation in large hierarchies is very slow. Even for small hierarchies the visibility can be a problem since broad hierarchy levels lead to a lot of hidden objects which become visible as soon as one rotates the tree. An example of this approach is shown in Fig. 2.3.

The Hyperbolic Browser [56]: The Hyperbolic Browser represents a tree in two dimensions by using hyperbolic arcs, see Fig. 2.4. Important elements can be put on focus by reorganising the tree such that the desired elements are located in the center. This has the limitation that there cannot be two completely different parts of the tree in the focus. By always showing the whole hierarchy a lot of space is wasted.

H3 Browser [59]: This approach is similar to the hyperbolic browser. It extends the hyperbolic method to a 3D sphere which the user can use to navigate the hierarchy. The similarity is again not taken into account and just one element can be in the focus.

(18)

12 2 RELATED WORK

Figure 2.4.: Hyperbolic Browser which changes the focus on the queried parts of the hierarchy. Elements which are in the middle are clearly visible.

ResultMaps [18]: ResultMaps is a treemap based visualization which is used in addition to the normal ranking based list representation, see Fig. 2.5a. This goes farthest in the direction of our goal since it uses the whole space provided by the treemap approach. By using Squarified Treemaps [16] a good aspect ratio is achieved but it is not possible to take similarity or neighbourhood relations into account. The space is not used for the important parts of hierarchy but for showing the entire document space at all times.

(19)

2.3 Placement by similarity and hierarchical structure 13

(a) (b)

Figure 2.5.: ResultMaps which visualizes the document space and the corresponding results for a query by using a Squarified Treemap (from [18])

2.3. Placement by similarity and hierarchical structure

InfoSky [1]: InfoSky uses Voronoi diagrams to partition the space in galaxies where documents are represented as small stars. They also take document similarity into account but do not use the space for the parts of the hierarchy which is important for the current query. The results of a query are marked in a pre- computed layout. Another problem is that the document relations are not clearly visible because of the space partitioning. Two congruent documents which are in different categories might get into completely different areas, which makes the user think that these documents are not similar at all.

Information Pyramids [2]: Information Pyramids uses a 3D treemap view to represent the hierarchy and the corresponding elements, see Fig. 2.7. The rectangular approach has the disadvantage that either aspect ratio is good or similarity is considered. Combining both results in long and thin rectangles. A reorganization of the visualization for better space use is not done when a query is entered.

WebMaps InternetMap [50]: WebMaps is similar to a treemap. It creates a map and preserves the hierarchical structure. It also tries to optimize the distance between similar categories and offers the zooming into regions. This view of the data is a static view where the regions do not change. The results are marked or shown as icons in the corresponding region. One also cannot see the relations between the items. WebMaps was a commercial attempt to visualise the search space and the results, see Fig. 2.8.

(20)

14 2 RELATED WORK

Figure 2.6.: InfoSky: Visualization where the results of a search query are high- lighted in a hierarchical search space representation by using yellow markers. (from [1])

General Problems: In general one can say that at least one of the following points is not considered the previous approaches:

• Document Similarity

• Hierarchy

• Area usage for important parts of the hierarchy

• MentalMap

• Document Relations

In the next sections of this thesis we will try to pay attention to all these points to create a solution which gives the user a view to help him better understand the data.

(21)

2.3 Placement by similarity and hierarchical structure 15

Figure 2.7.: Information Pyramids: Hierarchy grows in the third dimension which results in a pyramidal representation of the hierarchy. (from [2])

Figure 2.8.: Webmap which creates zoomable regions to visualize the corresponding hierarchy. The regions can get quite complex and thus hard to recognize. View is not changed for different queries. (from [28])

(22)

(23)

3. Design

Although the previously described techniques have a lot of advantages, they have some drawbacks which we will try to overcome. In the following we will explain the goals we have and to which combination of techniques they lead.

3.1. Requirements

The requirements to the visualization are:

Hierarchical structure preservation

Since the underlying hierarchical structure mostly does not change at all, it is beneficial to use this in the visualization and preserve the hierarchy as much as possible. This supports the orientation of and interpretation by the user. The results of a search query should then somehow be visible in this visualization of the hierarchy.

Mental-map preservation

Noticing that many changes cause confusion and make it hard for the user to recognize objects, it is beneficial to avoid the change of relative positions of e.g. two categories of the hierarchy.

Similarity Consideration

When similar objects are near each other it is easier for the user to recognize such a similarity. This way clusters become easier visible. Thus the similarity of the documents and also the similarity of the categories has to be taken into account in the layout.

Efficient space usage

Since this visualization supports the normal search query, it cannot be expected to have the whole part of the visible area. Even if it would have the whole visible area available, it would not be enough because of the huge amount of data. We thus have to try to use the space efficiently for the parts which are important for a certain point in time.

Representation of document relations

Since it is not possible to perfectly solve all of the above requirements in an optimal way, its important to represent the document relations such that the user can easily perceive the relation of two documents.

(24)

18 3 DESIGN

Ideally real-time

The visualization is part of a search query response and thus its computation must not take too long. So it is important to shift as much as possible into a pre-processing step and speed up the response time.

The goal is to help the user by visualizing the result space while still preserving as much of the structure of the search space as possible.

Motivated with the above goals an area-inclusion method seems to be a good choice for efficient use of the available space and representation of the underlying hierarchical structure.

3.2. Techniques

Combining two techniques is often useful since one can profit from the advantages of both. In the following we describe the methods we decided to combine for our visualization.

3.2.1. Treemaps

(a) (b)

Figure 3.1.: Small hierarchy and its treemap. The nodes are represented by nesting rectangles which preserves the hierarchy relations. Node weight (number in brackets) corresponds to the area of the rectangle

The area-inclusion metaphor is widely used for information visualization. In Software Engineering UML [37] uses the area-inclusion metaphor to represent certain relations. Treemaps were invented from Shneiderman [71] who used them to explore the file structure and their size on his disk. They extend the area-inclusion metaphor by completely using the available space to represent an attribute by the area of nested rectangles. Fig. 3.1 shows an example of a hierarchy and its corresponding treemap. There are different layout algorithms for nesting the rectangles to overcome e.g. aspect ratio problems where the rectangles get too thin.

Voronoi Treemaps [6, 5] are similar to the rectangular approach but use polygons instead of rectangles which results in many aesthetic and perceptual advantages.

(25)

3.2 Techniques 19 Voronoi Treemaps are more flexible since changes can easier be adapted without complete reorganisation of the layout. They also have good aspect ratio which is preferable to the user for easier perception of the area size. This leads to the decision of using Voronoi Treemaps to represent the hierarchical structure.

(a) (b)

Figure 3.2.: (a) Rectangular Treemap (b) Voronoi Treemap (from [6] )

3.2.2. Multidimensional Scaling

Looking at Fig. 3.3a one can see two filled regions representing two categories which belong together because they contain many similar documents. If the regions of these two categories were closer to each other, the user could more easily understand the data or mentally divide the whole space in abstract regions which are concerned with certain topics. Using MDS to represent similar objects with low euclidean space leads to Fig. 3.3b. This also reduces the clutter of the visualization since possible edges between objects in the regions are shorter. There are two cases where MDS is useful.

Preprocessing time: The preprocessing gives us the possibility to layout the hierarchy according to all document similarities. For this purpose PivotMDS combined with Stress Majorization seems to be a good choice [14]. Note that this is done only once and that a high iteration count can be used.

Query time: At query time we get a fixed number of results after entering enough keywords. For these results we have positions which were created in the preprocessing step by considering all documents. But now a small part of the document graph induced by the results is visible. By using Stress Majoriza- tion we can adapt the layout to more strongly reflect the induced graph while simultaneously maintaining the previous location using anchoring, which will be explained in Chapter 5. The layout is thus adapted to the query but still

(26)

20 3 DESIGN

(a) (b)

Figure 3.3.: The filled regions are similar and thus should have low euclidean distance. (a) normal Centroidal Voronoi Tesselation without MDS. (b) Centroidal Voronoi Tesselation combined with MDS

keeps the mental map which was created in the preprocessing step. Since the number of results shown is not very high, this computation should be fast enough for a quick response.

3.2.3. Dependency Visualization

Although the hierarchy is an important part of the visualization it is also beneficial for the user to see the direct relations between documents. The obvious use of straight lines does not solve this problem sufficiently since it creates a lot of confusion. Orthogonal edge routing also does not seem to be a good choice. It would simply create more chaos in the visualization because there are no other orthogonal lines.

Holten’s edge bundling method [47] instead routes the edges using nice-looking curves and still considers a given hierarchical structure. The idea is to use the centers of the objects which lie on the shortest path between two connected nodes of the hierarchy to draw appealing curves. If the center of an element in the hierarchy is used by several edges, this leads to a bundling of these edges.

Highlighting certain edges when hovering over a node with the mouse further compensates the unclarity of certain edges, which would not be clearly visible in a bundle.

(27)

3.3 Design & Color scheme 21

Figure 3.4.: Hierarchical Edge Bundling, (a) control polygon with LCA (Least Common Ancestor, (b) control polygon without LCA

3.3. Design & Color scheme

Designing good visualizations is not an easy task. There are a lot of things which can be done wrong and lead a viewer to draw wrong conclusions. Even when one takes the human perception into account it is still difficult since the interpretation of a visualization depends on the cultural and personal background of the viewer.

We decide to take a blue color for the nesting polygons because we think that blue is a more neutral color then e.g. red, yellow or green. We take a pleasant scheme from ColorBrewer [42]. Instead of using overly striking colors for the edges we propose the use of semi-transparent and lightweight colors. The user’s attention should not be dominated just by the edges.

(28)

22 3 DESIGN

(a) (b)

Figure 3.5.: Representation of the hierarchy with nested polygons using blue color scheme. (a) polygons without alpha blending. Lower edge shows highlighting color. (b) polygons with alpha blending and black background is used. Lower edge shows highlighting color.

Figure 3.6.: Using a darker color of the higher hierarchy for the border makes the contour visible even for objects which are too small and thus not sharp enough.

(29)

4. Preprocessing

Figure 4.1.: The XML input is preprocessed and the corresponding index struc- tures are stored. The documents are cleaned using different filtering steps and then the document graph is created using cosine similarity of the tf-idf vectors. By using graph drawing techniques the positions are determined such that similar documents are close together. The dotted boxes represent the data which is stored after preprocessing for later usage in query time.

4.1. Data Input

Since document collections can have different formats, an input interface has to be defined which describes the expected format. This scheme is defined by using XML. Figure 4.2a shows the wanted scheme as XSD [78] and Fig. 4.2b shows an example of a small data set. The hierarchy is represented by nesting category nodes and the text of a document is represented as content of a document node.

(30)

24 4 PREPROCESSING

(a)

(b)

Figure 4.2.: (a) XML-Scheme of the desired input. (b) Example of a data set with two documents.

4.2. Document Similarity

Determining document similarities requires the combination of many different fields like Natural Language Processing, Information Retrieval and Data Min- ing. Proper understanding of textual documents and their similarity depends on many factors. World knowledge plays a big role since e.g. anaphora resolution often requires more information on the textual background. The understanding of similarity can also vary on the personal or cultural background of the reader.

There are different approaches which model document similarity [22, 23, 57, 66].

As an example, we decided to use the simple word based Vector Space Model. Al- though credit often goes to Salton for the development of the vector space model for information retrieval by citation of an non existent article from 1975, it was developed over a longer time period, see [26] for details.

Other approaches such as Latent Semantic Analysis [23] could be used as well and possibly yield better results in terms of human similarity measure; they could easily be integrated.

In our case documents consist of a sequence of words. Layout information, which may be used in e.g. PDF files, is not considered.

(31)

4.2 Document Similarity 25

Vector Space Model: A document d_j ∈D is represented as a vector dj = (w1j, w2j, . . . , wmj),

where wi ∈R⁺, i∈ {1, . . . , n} is a weight for a word (also called term) ti ∈T and m=|T| is the size of the vocabulary, which is the amount of different words over all documents. Each term is represented by one dimension.

The weight describes the importance of a term in the document. One could use e.g. zero for the weight if the term is not contained in the document and one if it is contained. A better method for the weights is the tf-idf method, which tries to make use of the fact that words which are contained in every document do not contribute in getting different similarities. Rare terms may be more informative than frequent terms. We thus define the weights as the product of the term frequency tf and the inverted document frequency idf:

wij = log (1 +tfij)×log |D|

df_i

,

where tfij describes the frequency of a term ti in the document dj and dfi the amount of documents in which term t_i is contained. The logarithm is used such that the non frequent terms

The weights define the so called term-document matrix which is a |T| × |D| matrix. Although one could use the euclidean distance to define a similarity for two document vectors, it is not done because the euclidean distance may be large although the term distribution of the documents is nearly the same. Thus the angle between two document vectors is used to describe the similarity. Later we will use a threshold for determination of the relations. Since the cosine monotonically decreases from 0 to π, we get for two document vectors d, ~e~ ∈D:

cos d, ~e~

= d~·~e

|d~||~e| =

P_|T| i=1diei

qP_|T|

i=1d²_iqP_|T| i=1e²_i

.

Note that the term-document matrix is very sparse, which means that most entries are zero. The amounts of different terms and thus the dimensionality of the vector space is also very high. This tends to produce low variance in the similarity measure. To at least partly reduce this problem, noise is removed by applying filtering techniques to the data, Fig. 4.1 describes when this step is done.

Filtering steps:

• Transformation of upper case to lower case

• Remove punctuation marks

• Remove stopwords

• Identify and merge different word forms by using Stemming [17]

• Remove terms which occur only in one document

(32)

26 4 PREPROCESSING Runtime Analysis: First the tf-idf matrix has to be created and then the pairwise document distances have to be computed. The creation of the matrix clearly needs O(|D| · |T|) time and space. Although theoretically by word composition the amount of words could be unbounded practically we can assume that |T| is a constant. The computation of the pairwise distances then need O(|D|²) time and space. The maximal space usage might be improved by storing only important distances and ignoring non similar relations.

4.3. Search Index

For being able to respond very quickly to the user an index has to be build. We solve this by using Apache Lucene [3] which is a high-perfomance full-text search engine library. It is used in many applications which require full-text search. The fact that it is written in Java is also useful since it works on different platforms.

It is well structured so that one can easily reimplement some components and use them for special needs. See [40] for more on Lucene.

Runtime Analysis: Creating a search index for|D|documents needsO(|D|log|D|) time and O(|D|) space by using balanced trees.

4.4. Mental Map creation

Let GH = (V ∪C, E ∪H, r) be a hierarchically clustered graph. The document graph G= (V, E) is determined by using document similarity. Note that it could also be given by meta data which define that two documents have a relation.

Although it surely does not capture the complexity of the mental map, we define it as the relative positions of the hierarchical clusters in C to the centroid of their siblings. There are two possibilities for the Mental Map:

Mental map as manual input: It may be the case that there are already some predefined relative positions which were manually created for the clusters.

If this is the case, these positions could be directly used for the layouting which is described in the later.

Mental map based on similarity: If there is no concrete mental map given, we propose to determine relative positions and thus the mental map according to the document Graph. If a relation is given for two documents which are in different clusters, then these two clusters are somehow related to each other. Since we want to use area-inclusion for representing the hierarchical structure, we compute relative positions for the children of each cluster in C.

These relative positions are later used to determine a layout which positions the clusters and thus the hierarchy preferably in the same way for each query.

Algorithm 1 describes the computation of the relative vectors.

(33)

4.4 Mental Map creation 27

Algorithm 1: Computation of relative positions Input: GH = (V ∪C, E∪H, r)

Output: relative Vector −→rvv for a node v ∈V ∪C

1 for c∈C do

2 for v, w ∈children(c) do

3 dist_vw←0

4 for e = (v, w)∈E do

5 Determine v^∗, w^∗ for which v^∗, w^∗ ∈children(LCA(v, w)) and path(LCA(v, w), v) = (n1, v^∗, . . . , nk) and

path(LCA(v, w), w) = (n1, w^∗, . . . , nl)

6 distv^∗w^∗ ←distv^∗w^∗+ 1

7 for c∈C do

8 Compute positions of children(c) using MDS and _dist¹

vw as pairwise distance for v, w ∈children(c)

// for dist_vw = 0 use high distance and a low weight

9 centroid← average of positions of children(c)

10 for child ∈children(c) do

11 −−−→rv_child←position of child −centroid

Theorem 4.1. Algorithm 1 can be implemented to run in timeO(k·n²+mlognc), where k is the maximum amount of iterations for the MDS step, n =|V ∪C\r|, m=|E| and nc =|C|.

Proof. The first for-loop in line 1 iterates over all nodes in C and then over all children in a quadratic way. Looking at all children means looking at all nodes for the hierarchically clustered graph. Let C = {c1, c2, . . . , ck} and n = |V ∪C\r|. Further let xi ∈N be the amount of children for node ci ∈C. We then have that

Xk i=1

x_i =n. (4.1)

Note that the square of a sum with positive summands is bigger than the sum of the square of each summand:

(x1+· · ·+xk)² =x1(x1+· · ·+xk)+· · ·+xk(x1+· · ·+xk) = Xk

i=1

x²_i+ X

i∈{1,...,k},j∈{1,...,k}\i

xixj

| {z }

≥0

(4.2) The for-loop in line 1 thus needs O(n²) operations:

x²₁+x²₂+· · ·+x²_k^Eq.≤^(4.2)(x1+· · ·+xk)² ^Eq.=^(4.1)n² ∈ O n²

(4.3)

(34)

28 4 PREPROCESSING The second for-loop in line 4 clearly needs O(mlogn_c) time where m = |E| and nc =|C|. Note that we assume that the hierarchy is of logarithmic height, otherwise the hierarchy would not be useful for navigation. The third for-loop in line 7 needs O(k n²), where k is the amount of iterations of the MDS step, which will be explained in Chapter 5. The proof is the same as for the first for-loop.

Overall runtime analysis: The overall runtime is dominated by the similarity computation and the creation of the relative positions. The preprocessing thus needs O(k·n²+mlognc) time where n is the sum of document and hierarchy elements, m the amount of relations between documents, n_c the number of documents and k the number of iterations in the MDS step.

(35)

5. Layout

Figure 5.1.: Representation of the layout steps. The search query yields the corresponding results which are then visualized in the adapted search space visualization.

(36)

30 5 LAYOUT In contrast to the preprocessing time we have to give fast responses at search time.

This section describes the main steps of the layouting procedure which as a result gives a Voronoi Treemap representation of the important parts of the hierarchy combined with a graph view of the resulting documents. The layout tries to keep the positions of the mental map and thus give the user a view of the search space adapted to the query. When the amount of document results is small enough, the corresponding document graph is created. The nodes are placed according to their mental map positions and Stress Majorization is used to place them nearer nodes with whome they are connected but still keep the layout stable. By handling node overlap, the clearness is improved and as a last step the edges are bundled by using the underlying hierarchical structure which reduces clutter, see Fig. 5.1.

5.1. Multidimensional Scaling with Anchoring

Multidimensional Scaling (MDS) is concerned with geometrical positioning of objects whose pairwise similarities (or dissimilarities) are given. The positioning should be in such a way that the euclidean distance between two objects represents their similarity (or dissimilarity). This technique is used several times in our visualization. In the pre-processing step it is used to get initialization layouts in which similar objects have similar positions and in a post-processing step it is used to influence the layout according to the results of the search query. We will more precisely use the so called Stress Majorization approach [38] to reach the goal of MDS. Techniques for drawing of special graph classes like directed acyclic graphs cannot be used in our case since we cannot constrain our data to have such properties.

Let V ={1, . . . , n} be a set of n objects and let D ∈ Rⁿ^×ⁿ be a square matrix of dissimilarities for each pair of objects in V, thus

D:=







d1 1 d1 2 · · · d1n

d_{2 1} d_{2 2} · · · d₂_n ... ... ... ...

dn1 dn2 · · · dn n





.

The goal of MDS is to find a matrix X = [x1, . . . , xn]^T ∈ R^n×d of d-dimensional positions x₁, . . . , x_n∈R^d such that

kx_i−x_jk ≈d_ij (5.1)

for all i, j ∈ V is met as closely as possible. Note that since we are interested in a two-dimensional layout, it is enough to consider d = 2. In the case of graph drawing the distancesdij reflect the graph-theoretic distance of two nodes iand j.

Since finding an optimal solution in graph drawing with fixed edge lengths is N P-hard in general [29] and other graph drawing problems are as well [51], different approaches exist to find local optima. More on MDS can be learned from [9, 21].

(37)

5.1 Multidimensional Scaling with Anchoring 31 Brandes et al. [14] did an experimental study on distance-based graph drawing which showed that using classical scaling as initialization for the graph layout and then improving the layout by minimizing the stress by Stress Majorization yields the best results in general. While the classical scaling in the first step creates layouts with good representations of long distances, the second step improves local details of the layout. For the first step one should use an approximation of the classical scaling which is called PivotMDS [13] and scales very well to large graphs since it needs just linear time for layouting.

In the following we will describe Stress Majorization which was invented by Gansner, Koren and North to minimize the stress in graph drawing [38]. Note that the study of Brandes et al. [14] also showed that this technique is superior to force-directed methods since it converges faster and yields better results with the same implementation effort.

As in the general MDS for each pair of nodes i, j ∈V there is an ideal distance dij ∈R⁺. A d-dimensional layout is given by a n×d matrix X. A node i has the position X_i ∈R^d. The axes of the layout are defined by X⁽¹⁾, . . . , X^(d) ∈Rⁿ. The deviation of the ideal distances of the nodes causes the so called stress [54]:

stress(X) = X

i<j

wij(||Xi−Xj|| −dij)². (5.2) Citation attribution for the stress in Eq. (5.2) has also to go to Kruskal [54] who already used it in 1964. The challenge is to find a layout which minimizes the stress. wij is the weight for a pair i and j of nodes and is normally defined as wij = _d¹2

ij. This definition leads to better local details because long distances are weighted less than short distances. dij is in general chosen to be the graph-theoretic distance, thus the shortest path between two nodes. By expanding (5.2) we get

stress(X) = X

i<j

wijd²_ij +X

i<j

wij||Xi−Xj||²−2X

i<j

δij||Xi−Xj||, (5.3) where δ_ij =w_ijd_ij.

By using the Cauchy-Schwarz inequality, see [38] for details, one can bound the stress function (5.3) with a quadratic majorant such that stress(X) ≤ F^Z(X), where Z is a constant n×d matrix.

F^Z(X) = X

i<j

wijd²_ij +T r(X^TL^wX)−2T r(X^TL^ZZ) (5.4) Then deviation forXleads to a solvable equation, whose solution is at least optimal for the current majorant

L^wX =L^ZZ (5.5)

Eq. (5.5) can be solved separately for each of the axes: L^wX^(a) =L^ZZ^(a), where a= 1, ..., d describes the dimension. A new layout is determined iteratively up to convergence. Each layout has the minimal stress for the corresponding majorant.

The local minimum, which is reached at the end, is of course determined by the initialization layout.

(38)

32 5 LAYOUT

Figure 5.2.: Blue nodes represent documents which where anchored on their original position (dotted line). The original positions of the mental map are adapted to the query results. Similar documents thus move closer together but still try to keep their original position.

Intuitive Interpretation Of course solving Eq. (5.5) requires an equation solver which is much more complicated than using the intuitive movement of a node to the equilibrium of some forces like the force-directed (Spring Embedder) methods do.

Gansner et al. [38] also gave an intuitive interpretation of the Stress Majorization which results from fixing all nodes except some node i. Each node j votes for its desired placement of node i. The final position of i is then given by

X_i^(a) ← P

j6=iw_ij

X_j^(a)+d_ij

X_i^(a)−X_j^(a)

inv(kX_i−X_jk) P

j6=iw_ij , (5.6)

where a = 1, . . . , d. This realization is also called Localized Optimization. The advantage of this method is that if wij = 0 for a node j, then it does not have to be considered in the computation. This leads to faster computation for sparse weight matrices.

5.1.1. Anchoring nodes

The idea of anchoring is to stabilize the iteratively improving layout by inserting dummy nodes and linking them to existing nodes in the graph. By fixing the dummy nodes on their positions, the graph is anchored on these points. Depending on the given weights of the anchoring edges it is possible to control how much the positions of the linked nodes and thus also the other connected parts of the graph are allowed to change.

When minimizing the stress, it is important to fix the dummy nodes, otherwise we would just change the structure and not stabilize the layout. Fig. 5.2 shows an effect which the anchoring can have. The resulting layout is a combination of the original layout which was created in the preprocessing step and the influence of the query results. It can be understood as a combination of the search space with the query space. Note that the stabilized MDS layout is only done when the result space is small enough to be handled.

(39)

5.1 Multidimensional Scaling with Anchoring 33

Figure 5.3.: Illustration of the scaled gradient projection for an iteration of the Stress Majorization. p⁰_i is the voted position of the i−th node. The vector is scaled according to the constraining region. p⁰⁰_i is the point which decreases the stress the most for the majorant in the current iteration.

5.1.2. Constrained Stress Majorization

As we want to maintain the hierarchical structure, it is also important to constrain the Stress Majorization to certain regions. Each node is only allowed to be positioned in its respective Voronoi region. This is done by extending the iterative stress majorization with a step which projects the layout back to a valid state, as Dwyer et. al [27] propose. This technique is also called scaled gradient projection.

Algorithm 2: Stress Majorization with region constrains

Input: Initial layout P = [p1, . . . , pn]^T ∈Rⁿ^×^d, distance matrix D, weight matrix W, regions Ri for i∈ {1, . . . , n}

Output: final layoutP⁰⁰ = [p⁰⁰₁, . . . , p⁰⁰_n]^T ∈Rⁿ^×^d

1 repeat

2 P⁰ ← proposed positions by stress majorization

3 foreach i∈ {1, . . . , n} do

4 α← determine scale factor for −−→

pi, p⁰_i such that constrain Ri is not violated

// projection of p⁰_i to valid region

5 p⁰⁰_i ←pi+α·−−→

pi, p⁰_i

6 until relative change negligible or maximal iteration reached

As Algorithm 2 describes, in each iteration of the Stress Majorization the positions of the nodes are projected to valid regions of the corresponding Voronoi cell. The projection only works if the given initial layout does not violate the constraint, which is guaranteed by placing the nodes according to their relative positions. The scaling factor is determined by computing the intersection point with the corresponding region, see Fig. 5.3 for an illustration.

(40)

34 5 LAYOUT

5.2. Voronoi Treemap

Figure 5.4.: Voronoi Treemap of a software hierarchy (from [6])

Balzer et al. [5, 6] introduced the Voronoi Treemap as a visualization technique which has some nice advantages compared to e.g. rectangle-based techniques.

Note that Andrews et. al [1] used already three years before a similar technique by nesting Voronoi diagrams. They positioned the cells with a higher importance on the outer of a region; the results are Voronoi cells with more area. Thus the contribution of Balzer et. al was to combine Lloyd’s method and the weighted Voronoi diagram to get a more flexible technique with better aspect ratio.

The goal of a Voronoi Treemap is to map a hierarchy in a plane and use the available space completely to represent the information. By nesting polygons in each other, the hierarchy is represented with the area-inclusion metaphor. But it is not enough to just nest the polygons arbitrarily. The area of the polygon has to reflect the desired importance of the corresponding node in the hierarchy. In this section we will describe in detail how a Voronoi Treemap can be constructed using a Monte Carlo method and the analytical approach.

(41)

5.2 Voronoi Treemap 35

(a)

(b) (c) (d)

Figure 5.5.: A hierarchy (a) which is used to create a Voronoi Treemap (d). Initial positions of the first hierarchy layer (blue nodes) are used to generate a Voronoi diagram (b)-(c). Each resulting region is used for the child nodes in the second layer.

(42)

36 5 LAYOUT

5.2.1. Voronoi diagram

Figure 5.6.: Voronoi diagram of a set of sites in the plane. Outer cells are not bounded.

A Voronoi diagram consists of a set of points S in the plane which divide the plane in non empty regions with certain characteristics for the points in a region.

The points in S are also called sites in the literature. Each site has exactly one continuous region and each region belongs to exactly one site. For every point in a region the corresponding site of the region is the nearest site of all sites in S. There are a lot of ways to formally define a Voronoi diagram. Since we will compute it in an analytical way with the Fortune Algorithm [36], we use Fortune’s definition.

Let S ={p1, p2, ..., pn} be a set of n distinct points in the plane. Each site p ∈ S has Cartesian coordinates; p = (px, py) ∈ R². The set of points is finite and has to divide the plane into at least two non empty regions, so 2 ≤ n ≤ ∞. Although it might already be clear from the fact that S is a set, it is important to notice that we have distinct sites inS, so∀p, q ∈S, p 6=q: (px 6=qx)∨(py 6=qy).

The Euclidean distance between two points p= (px, py)∈R² andq = (qx, qy)∈ R² is defined as

e(p, q) = q

(p_x−q_x)²+ (p_y−q_y)². (5.7) Forp∈S and z ∈R² let the functions d, dp :R² →R be

d_p(z) = e(p, z) (5.8)

and

d(z) = min

p∈S d_p. (5.9)

The bisector Bpq of p, q ∈S is Bpq =

z ∈R²|dp(z) =dq(z) . (5.10)

(43)

5.2 Voronoi Treemap 37

The Voronoi region of a site p∈S is thus given by Rp =





\

q∈S\p

Rpq



. (5.11)

The Voronoi diagram V of a set S can be described as V(S) =

([

p∈S

Rp

)

(5.12) or in a more direct way as

V(S) =

z ∈R²|∃p, q ∈S : p6=q ∧ d(z) = dp(z) =dq(z) . (5.13) This means that all points which have at least two closest sites and thus are part of a bisector are part of the Voronoi diagram. As we can see in Fig. 5.6, a Voronoi diagram consists of lines, half-lines and segments.

Theorem 5.1. The complexity of a Voronoi diagram is in O(n).

Proof. Having a Voronoi diagram withn sites, one might think that there could be O(n²) bisectors in the worst case. To show that this is not the case we can consider the Voronoi diagram as a planar Graph and analyse the complexity. Unfortunately there are have half-lines which do not exist in graph theory. To fix this, we connect all the half-lines to a new node which is infinite far away, see Fig. 5.7. Note that if two half-lines intersect arbitrarily far away from the sites, they were no half-lines but just two segments which lead to one half-line after their intersection. Given

Figure 5.7.: Connecting the half-lines of a Voronoi diagram to a vertex, which lies in infinity, to get a connected planar Graph with linear edges in n a Graph G = (V, E) which is connected (no node is isolated) and planar, we can draw it on a plane without intersection of two edges. We have v =|V|ande=|E|. Let f be the number of faces, which is just the number of regions we would get by cutting the plane along the edges. Euler’s formula tells us that the following holds:

v−e+f = 2 (5.14)

(44)

38 5 LAYOUT For more Information on Euler’s formula and planar graphs we refer the reader to [60].

Having constructed the Graph from our Voronoi diagram we see that the amount of faces f = n. Let nv be the amount of nodes which are automatically created from the Voronoi diagram because the edges intersect at the same point. Since we added one extra node in infinity we have that v = n_v + 1. n_v is the amount of vertices in the Voronoi diagram. A vertex is an intersection point of at least 2 bisectors. Each edge belongs to two sites and each node (except for the infinity node) belongs to three edges. This implies 2e ≥ 3n_v = 3(v−1) = 3n−3. Using Euler’s formula we get

v−e+f = 2 3v −3e+ 3f = 6 3v−3 + 3−3e+ 3f = 6 2e+ 3−3e+ 3f ≥ 6

−e+ 3f ≥ 3

−e+ 3n ≥ 3 e ≤ 3n−3.

This means that the amount of edges in a Voronoi diagram is linear in the amount of sites.

5.2.2. Additive weighted Voronoi diagram

Let S be a set with n sites. The weight of a site p ∈ S is given by wp ∈ R⁺. It turns out that the additive weighted (AW) Voronoi diagram is identical to the Voronoi diagram of circles. For the Voronoi diagram of circles the metric is defined as the euclidean distance to the circle if the point is outside the circle and as the negative euclidean distance to the circle if the point is inside the circle. The radius r_p of the circle C_p with center p is defined as r_p =W −w_p, where W =max

q∈S (w_q).

The distance of a point z ∈R²\S to a site pis defined as

e_p(z) =e(p, z)−r_p. (5.15) Definition 5.2. Letp, q ∈S, p6=q. We say thatpdominates q ife(p, q) +wp ≤ wq. If neither p dominates q, nor q dominates p, then there exists a bisector Bpq. The Bisector B_pq and its two regions R_pq and R_qp are

Bpq ={z∈R²|ep(p, z) = eq(z)} (5.16) Rpq ={z ∈R²|ep(z)≤eq(z)} (5.17) Rqp ={z ∈R²|ep(z)≥eq(z)} (5.18)

(45)

5.2 Voronoi Treemap 39 Note that if e_p(q) =w_q we define B_pq =∅. Otherwise we would have a half-line with endpoint in q and not a region, which is what we want. In the case of p dominates q, we have Bpq =∅, Rpq =R² and Rqp =∅.

The Voronoi region of a site p∈S is thus given by R_p =





\

q∈S\p

R_pq



. (5.19)

The Voronoi diagram V of a set S can be described as V(S) =

([

p∈S

Rp

)

(5.20) or in a more direct way as

V(S) =

z ∈R²|∃p, q ∈S : p6=q ∧ d(z) = dp(z) =dq(z) . (5.21) Cenroidal Voronoi Tesselation A Voronoi diagram is also called Voronoi tesselation. A Centroidal Voronoi tesselation is a special Voronoi diagram where the sites are coincident to the centroid of the corresponding Voronoi regions.

The centroid for k points in the plane p₁, p₂, . . . , p_k ∈R² is defined as Centroid({p1, ..., pk}) = p1+p2+· · ·+pk

k (5.22)

For a region R_p which is a subset of R² the centroid can be formulated as Centroid(Rp) =

R

Viρ(x)xdx R

Viρ(x) dx , (5.23)

where ρ is the density function which in our case is constantly 1. A Centroidal Voronoi Tesselation with V(S) = {∪p∈SRp} gives local minima to the following energy function [39]

E(V(S), S) =X

p∈S

Z

Rp

ρ(x)e(x−p)² dx. (5.24) Centroidal Voronoi tesselations can be derived iteratively using Lloyd’s method [43], which for a given set S of sites works by repeating the following steps:

• Compute V(S)

• S ← S

p∈S

Centroid(Rp)

Fig. 5.8a shows the initial site configuration and the movement of the sites by applying Lloyd’s method. As we can see in Fig. 5.8b the movements of the sites are not arbitrary but local, which we will use to get our layout.

Proactive Visualization of Search Queries in Hierarchical Document Collections