Independent Variables - Evaluation of the visualizations

4. INSYDER

4.3. Evaluation of the visualizations

4.3.2. Independent Variables

From the various possible settings and combinations of components, the following user interface configurations were tested:

• HTML-List only

• ResultTable only

• ScatterPlot + ResultTable

• BarGraph + ResultTable

• SegmentView + ResultTable.

For a long time during system development and preparation of the evaluation, the HTML-List was not planned to be implemented or tested. It had therefore not been mentioned in earlier publica-tions like [Mann 1999] or [Mann, Reiterer 1999]. The ResultTable as traditionally presented in form of a list had been considered as the standard against which the visualizations would be com-pared. During the final preparations of the evaluation, the idea came up that the usability of an interactive JAVA-Table with possibilities for configuration and sorting might be quite different from a really traditional HMTL Result List. The HTML version was therefore quickly imple-mented. It was included in the evaluation as a baseline for the usability values.

A special case in the user interface dimension is the SegmentView. As described in Chapter 4.2.3, we had implemented five variants of TileBars and StackedColumns with the idea of comparing the different implementations. Initially, the plan had been to perform first an evaluation of the differ-ent Segmdiffer-entView versions, and then to select the most successful one for the comparison with the other components. Due to time and resource restrictions this intermediate step was skipped and the users had the possibility of using the version(s) they wanted.

It is important to keep in mind for the later discussion of the usability results for the various com-ponents, that what was tested is the INSYDER implementation of a ScatterPlot or the INSYDER implementation of a TileBar. Studies show that even the wording used in a user interface may in-fluence the success of a user interface [Shneiderman, Byrd, Croft 1997]. Accordingly, the results for the INSYDER implementations of certain components may or may not be comparable with the evaluation of other implementations.

To avoid side effects caused by additional functions of the INSYDER system, it was modified in such a way that all functions were suppressed that were not needed to perform the task or allow refinement steps other than view transformations. All functions used to create new searches or to access watch or bookmark / news functionality were removed or deactivated. When using the visualizations, the subjects had functions like zoom or mark / unmark documents, but they did not see functions that the INSYDER system normally offers such as generating new queries; using relevance feedback; or re-ranking existing result sets by changing, adding, or deleting keywords.

In addition, special configuration files ensured that only the Sphere of Interest and the components for their current task were available to users. (See Figure 133 for an example.)

Figure 133: Example user interface conditions for the first three tasks of group 1.

4.3.2.2. Target User Group

As described above, the target user group for the INSYDER system was business analysts from small- and medium-sized enterprises. The decision to choose visual structures common to standard business graphics was motivated by this fact. During the INSYDER projects, members from the target user group were involved in the definition of requirements and formative evaluations of mock-ups and prototypes. There are several known problems performing studies with students as subjects. Nonetheless the summative evaluation of the visualizations was done with students from different disciplines and university staff. This decision was possible because searching the Web is an activity not restricted to the special target user group of the INSYDER system. For many peo-ple, especially students, it is nowadays an everyday task. Moreover, because the evaluation had been focused on the visualization components, most of the special functions of the INSYDER sys-tem created for the usage in the context of business intelligence played only a marginal role. Last but not least, business graphics are quite common in everyday life, and the visualizations imple-mented in the INSYDER system are simple compared to many other ideas found in the literature.

That the initial target user group of the INSYDER system was not tested, does not imply that we believe that the success of the visualization components is independent from the individuals or groups using the system. Our idea was to perform the study differentiating between “beginners”

and “experts” as had other studies before us. Using an approach comparable to that of [Golovchinsky 1997a] we had an expert group, characterized by having at least received the for-mal training of a Faculty of Information Science Information Retrieval course, and a beginners group without this formal training. According to the headings from our 5T-Environment, this dif-ferentiation could also have been theoretically classified as changing the Training-dimension.

“Training” in the 5T-Environment focuses, however, on the actually used system and not on the general characteristics of the user group. To eliminate other variables as much as possible only subjects were used for the study who had at least some practical computer usage experience on the one hand, and some basic knowledge about the World Wide Web as well as browsing and search-ing it on the other hand. (For a discussion about the influence of expertise on search success, see Chapter 2.3.2.)

Another important factor possibly biasing the results is the language used for the study. As already mentioned, the ranking engine of the INSYDER system works best for English and French. The tests were performed in Germany with German users. A German version of the thesaurus was not available. We therefore decided to use English keywords to rank the documents. In Chapter 3.5, a study by [Morse 1999] was mentioned. In this study conducted in the United States of America and Norway, non-native English speakers performed more slowly in each of the visualization con-ditions except for the table display¹³⁵. Similar results appeared in an earlier study by [Morse,

135 The “table display” used by [Morse 1999] is completely different from the ResultTable of the INSYDER sys-tem. Whereas Morse’s table contains text only in the cell headings and “+” / “-” signs or digits in the cells, the INSYDER table is a text table.

Lewis, Korfhage et al. 1998]. All subjects in the INSYDER study were non-native English speak-ers, which might have biased the results. In order to minimize those influences only subjects with a sufficient level of English language skills were chosen. The questions were formulated in German to eliminate problems in understanding the task. In addition the tasks were restricted to specific and extended fact-finding tasks, performable with only basic knowledge of English.

The four users for a pre-test and the 40 additional volunteer subjects, 20 beginners and 20 experts, for the main study were all recruited at the University of Konstanz, Germany. A movie voucher was offered as motivation for participating in the main study. Figure 134 and Figure 135 show the characteristics of the user population of the main study.

The experts were in most cases either students of Information Science or staff from the Informa-tion Science Group including research assistants and one professor. Most users classified as begin-ners were students from other University departments including mathematics, physics, law, and psychology.

Figure 134: User characteristics: Age, Gender, and Profession

Concerning computer and software experience the users had to classify themselves as beginners (little experience), advanced users (some), or experts (considerable). In a second question the users were asked how much they depend in their work on information from the Web: very much, some-what, or none. Finally, they were asked how often they use search engines or other Information Retrieval systems: seldom to never, several times per week, or daily. Whereas computer experi-ence and Web-dependency showed the expected differexperi-ences between users classified as experts and beginners, the values for the usage of search engines or IR-systems showed surprisingly little variation.

Computer Experience

Figure 135: User characteristics: Computer Experience, WWW Dependency, Search engine/IR-system Usage

4.3.2.3. Type and number of data

Visualization components for Web search results should be tested using Web search results. For our study we therefore used real data collected from the World Wide Web. In Chapter 2.3 several findings were presented about how people search the Web. Besides other factors the number of keywords used to formulate queries and the number of documents examined in the result set were discussed. To summarize it was found that the average length of a query is around two keywords, with an increasing tendency, and that only a small number of hits is examined by the users. For the purposes of the summative evaluation of the INSYDER visualizations we planned to perform the test with varying numbers of keywords and varying sizes of result sets. The initial plan for the number of keywords was to use one, two, or three keywords. This corresponds to common values when searching the Web. Discussing the visualization of search results and the number of concepts displayed [Cugini, Laskowski, Sebrechts 2000] report their experience that the resulting display became complex and difficult to interpret, when the number of concepts reaches seven or eight.

John V. Cugini had reported the same information in personal communication with the author in 1999. In view of this boundary we changed the plan and used queries with one, three, or eight keywords. For the number of results displayed we wanted to compare the effects of small and large result sets. We ultimately settled on two different sizes of result sets: 30 and 500 hits. A 30-document border is discussed in several papers such as [Koenemann, Belkin 1996], [Eibl 1999], [Cugini, Laskowski, Sebrechts 2000]. The 500-hit border emerged when preparing the result sets for the evaluation. The INSYDER system and its visualization components had been tested during development with result sets of up to 2000 hits. The time needed to load into the visualization component a locally stored result set with 30 hits was about one second on the machines used, for a 500-hit result set about three seconds, and for a 1000-hit result set about six to seven seconds.

This loading time occurred for every switch from the ResultTable to a Visualization. The other way around it was always less one second. Tests by the development team revealed that the three-second waiting time seemed tolerable, but that six three-seconds was considered definitely too long. 500 hits was therefore chosen as the value for large result sets. Figure 136 to Figure 139 give some impression as to how the visualizations looked like with one, three, or eight keywords and with 30

or 500 hits. To improve recognition of details the surrounding parts of the INSYDER user inter-face are clipped in the reproductions for this thesis.

Figure 136: Bargraph with 30 hits: one, three, or eight keywords

Figure 137: Bargraph with 500 hits: one, three, or eight keywords

Figure 138: SegmentView (TileBars 3 Steps): one, three, or eight keywords

Figure 139: ScatterPlot: 30 or 500 hits

Another important aspect of the data sets used for the evaluation is their quite heterogeneous con-tent. The datasets that were prepared for the evaluation by searching the Web with different key-words for 12 topics showed a great variation, especially in top-30-precision. The first 30 docu-ments had a very low precision in particular for queries with three or eight keywords. This low precision was a product of a speciality of the crawling algorithm of the INSYDER system. For

multiple keywords queries with n keywords and s search engines as starting points, INSYDER sends (n+1)*s queries to get the first seed documents to start the analysis. The important factor is the “n+1”. As mentioned above, keywords are automatically OR-ed from INSYDER. To broaden the range of seed files every search engine used as a starting point is not only queried with all keywords in one query, but also with every single keyword in additional queries. Thus a query such as “visualization search results internet” leads to the n+1 = 5 queries: “visualization OR search OR results OR internet”, “visualization”, “search”, “results”, “internet”. Theoretically this is redundant, but this holds only true when crawling time is not considered. The top-ranked links will be dependent on the ranking algorithm of the search engine(s) used as a starting point. In or-der to bypass this dependency and use the full power of the INSYDER analysis engine, the n+1-approach is used. The analysis engine and the knowledge base with all its synonyms and semantic connections are also used to detect promising links to follow in seed documents. This crawling mechanism is a science on its own and could not be changed in the project. The approach is quite powerful, but it has the side effect that the first documents crawled and analyzed contain too often only one of the keywords. The second “generation” of documents crawled from this seed files are in general much better ranked for the overall query. To counterbalance this effect for all the docu-ment sets prepared INSYDER was run until more than 500 docudocu-ments per query had been crawled and ranked. The resulting document sets had then been clipped at 30 or 500 documents.

The local storage of the documents together with the fact that the machines had been disconnected from the Internet during the evaluation meant that the document sets presented to the users con-sisted of pure HTML-documents without pictures.

4.3.2.4. Task

In order to observe possible influences caused by the task to be done, we decided to use two of the four different types of information-seeking tasks described in [Shneiderman 1998] and listed in Table 2 on page 20. Half of the tasks that the users had to fulfill were of the type “specific fact-finding (known-item search)”; the other half were of the type “extended fact-fact-finding”. For several reasons including potential problems with the English language and the question of how to meas-ure effectiveness we did not include tasks of the types “open-ended browsing” or “exploration of availability”. The general concept behind the evaluation was to concentrate in the information-seeking process on the phase or step named variously review of results, evaluate results, or exam-ine results. (See Figure 140 for the position of this step in the whole information seeking process.)

Information Need

Formulation Action Review of Results Refinement

[Marchionini 1997]

Formulation Action Review of Results Refinement

[Marchionini 1997]

[Hearst 1999]

[Shneiderman, Byrd, Croft 1997]

Figure 140: Selected tasks and their position in the information seeking process

The situation so evaluated is somewhat artificial. We created an information need for the user by asking a question. The user than had to skip several steps, because we already performed them for

all users so as to eliminate influences from these phases. Even in the review of results we re-strained the user by not allowing steps like reformulation of the query or selection of other sources.

In addition, we forbad browsing. The goals of the user may differ from real-world information needs; the Information-Seeking Strategy (ISS) is biased; and we cut off important mechanisms for information seeking in the Web.

What are the participants’ goals when working with the information seeking system in this ex-periment? In most cases, a genuine information need does not stand behind the goals. Maybe the question asked waked the interest of the participant, and the assumed information need really con-tributed to the goals he pursued. In many other cases, the goal may merely have been to answer the question as quickly as possible so as to come back to the cafeteria as soon as possible, to do a fa-vor for the questioner, or to get the promised movie voucher with minimum effort.

In light of the Information-Seeking Strategy defined by [Belkin, Marchetti, Cool 1993] / [Belkin, Cool, Stein et al. 1995] (See Table 3 on page 22), we assumed a situation that may be character-ized as ISS15 (Method: Search, Mode: Specify), but we tested a task that may be typified as ISS5 (Method: Scan, Mode: Recognize). The common elements for both strategies are Goal: Select, and Resource: Information.

In Chapter 2, the importance of the iterative nature of the information-seeking process in general, and the following of links in result sets of Web searches in particular, was explained. For example, [Hölscher, Strube 2000] reported when documenting information-seeking strategies of twelve internet experts, that in 47% of the cases in which the experts used a search engine browsing epi-sodes of varying length occurred. Nevertheless, we decided not to allow the following of links from documents of the result set to other documents. Our test setting allowed the machines used to be disconnected from the Internet because all documents in the result sets had been locally stored.

Consistency in the systems answer times could thus be guaranteed for all users and all conditions.

If browsing to the Internet had been allowed, this controlled environment condition would have been defeated.

The preparation of the tasks and the corresponding result sets turned out to be really hard work.

Points discussed included whether tasks should be included that tend to favor certain visualiza-tions, for example, a question like “What was the gross national product of Germany in 1999”.

Using the ScatterPlot with its default dimensions date / relevance seemed to have in this case been an advantage. All documents with last modified dates before 1999 and not having 1970 could be excluded from examination. For the questions finally chosen all documents of the result set had to be examined manually to create lists of correct answers and to eliminate all documents from the result set that would allow the extended fact-finding tasks to be completed by referring to a single document. With the latter we tried to ensure that the extended fact-finding tasks really were differ-ent in nature from the specific fact-finding tasks. The main difference between these two types is that in the latter case, there is a clear stop criterion, when the user finds a document that answers the question. In the former case, there is no such clear criterion to stop the examination of a result set, and therefore the investigation process of a result set will be much broader in scope and possi-bly of longer duration. For example, if the task was to find all books by John Irving, we eliminated documents listing all books by John Irving. Otherwise by finding this document very early in the process there would have been no difference in scanning effort compared to a specific fact-finding task. This step did not influence the size of the result sets. When we eliminated a document from the set prepared to be presented to the users, it was substituted by the first document not included

so far, i.e. number 31 or 501. Two tasks using document sets with 500 hits but none of the tasks with 30 hits were manipulated in this way.

The example of [Dempsey, Vreeland, Sumner et al. 2000] shows also how difficult it is to find neutral questions, when designing a study. They found, that in two from four of their questions, which contained per chance proper names, an unguided Web search outperformed their carefully designed subject gateway. The authors’ explanation was that in the two tasks the proper names

Im Dokument Visualization of search results from the World Wide Web (Seite 160-167)