Functions of the INSYDER system - The INSYDER project

4. INSYDER

4.1. The INSYDER project

4.1.1. Functions of the INSYDER system

The INSYDER system comprises three main functions: Search, Watch, and Bookmark / News.

These functions can be organized in Spheres Of Interest that can be saved and loaded as user envi-ronments. Figure 101 and Figure 102 show two examples of the INSYDER user interface and pre-defined SOIs provided by the project partners responsible for the content of the system.

The Search function is the part of the system that is the precondition for the visualization of search results. It will be described below. The Watch function allows a monitoring of URLs and docu-ments. Any modification of the documents or the emergence of user-defined terms is monitored and registered in user-defined time-intervals. The idea is to support market or technology surveys in order to detect trends or discover strategic movements. The Bookmark function allows normal bookmarking functionality for URLs. Figure 102 shows a bookmarked page. The Bookmark func-tion was also integrated in the system as the basis for a planned News funcfunc-tion. Special Web

por-tals can be integrated in the SOIs as bookmarked Web pages. The porpor-tals have been designed as an edited service, to be provided by some of the project partners. The pages are structured as collec-tions of predefined links to national and international daily news. The source for this information is the Internet with its electronic newspapers, magazines, and press agencies.

Figure 101: The INSYDER system, example of SOI building and construction

Figure 102: The INSYDER system, example of SOI CAD

As described, the main functions of the INSYDER system cover the areas search, monitoring, bookmarks, news, and administration. Because the visualizations discussed in this thesis focus on the representation of search results, the subsequent list of the system’s functions will focus on this area. The following features are implemented in the INSYDER system:

• Searching and loading of HMTL- or TXT-based information from Internet or Intranet.

• Entered search terms are automatically logically ORed. No string search, no Boolean op-erators, and no proximity functions in the standard search modus (for special functions see [Mußler, Reiterer, Mann 2000], and [Mußler 2002]).

• Any search engine or groups of search engines with URL-controlled interfaces may be used as starting points for a search. In addition, direct specification of URLs or URL lists is possible. In contrast to search engines used as starting points, the directly entered URLs are loaded and analyzed, but in the current implementation are not used for further crawling.

• Own crawling of all links returned by the search engines and further crawling of all links in analyzed documents except documents directly entered as URLs. The exception is a bug rather than a feature (See above).

• Local storage of all crawled documents to allow off-line inspection (without images)

• Concept matching by parsing the entered natural language query in order to extract con-cepts to match against concon-cepts in the crawled documents.

• Relevance ranking with the use of a thesaurus-based content analysis. No use of rankings from the used external search engines. Thesaurus is at present English and French.

• Automatic classification of host or site type according to the rules of a control file: e.g.

“academic” for *.edu, .uni-, .fh-, *.ac.at, …; “european” for *.de, *.fr, *.it, *.at ...; or

“competitor” for www.mycompetitor.com, www.meinkonkurrent.de, ...

• Automatic classification of the type of document as “catalog”, “bookmark list”,

“text/images”, “frameset”, …

• Determination of the document date (last modified) through analysis of the relevant HTML tags (<META name=“date”... etc.) and the last modified value of the HTTP-protocol.

• Representation of the search results in an interactively configurable and sortable table with the following attributes:

o Title, URL, document date (last modified), language, size in kB, size in words;

o Site type (academic, European, …), Document type (catalog, bookmark list, …);

o Relevance for query, relevance per keyword (concept) in query, 255 characters, query-dependent document extract as a mix of abstract¹³² and keywords in context (KWIC)¹³³;

o Select flag, relevance feedback flag.

• Visualization of the search results as ScatterPlot, BarGraph, or SegmentView (TileBars or StackedColumn).

• New ranking of already obtained result set after a modification of the query

• Automatic generation of a new query through relevance feedback is possible (find similar, no preference, do not find similar)

• Storage of queries, starting points, and results sets

• Export function for result sets as HTML files with all attributes shown in the table like title, URL, date, extract etc.

• Monitoring of HTML- und TXT-documents for changes or the occurrence of keywords or concepts

• Bookmarking functionality

• Administration of queries and results sets, monitoring jobs, and bookmarks or news in topic-oriented Spheres Of Interest

132 “An abstract summarizes the main topics of the document but might not contain references to the terms within the query.” [Hearst 1999]

133 “A KWIC extract shows sentences that summarize the ways the query terms are used within the document.”

[Hearst 1999]

Content is in the case of the INSYDER system:

• Country- and industry-specific preconfigured Spheres Of Interest, for example with se-lected Bookmarks:

• Country- and industry-specific preconfigured lists of search engines and URLs, that can be used as starting points for queries;

• Country- and industry-specific created thesauri for the improvement of the relevance rank-ing of hit pages;

• Country- and industry-specific preconfigured control files for an automatic classification of hosts or sites in categories.

INSYDER supports the process of collecting, analyzing and classifying unstructured data in documents. For the document analysis and ranking the INSYDER system uses a knowledge base (thesaurus, semantic network). It enables a semantic content analysis of the documents. The INSYDER system thus can find and correctly rank documents also in cases where these do not contain the search words themselves, but contain similar concepts (e.g. synonyms). At present, the thesaurus exists in two languages: French and English. This permits a bilingual analysis of the documents. Accordingly, the system can with an English query, for example find and evaluate documents in French. In addition to the thesaurus for different languages topic-specific thesauri are possible (e.g. for CAD, pharmacy). This enables INSYDER to be adapted to different enter-prise needs.

The Spheres Of Interest are representations of the areas in which the user is interested. The user can define various SOIs, e.g. technology, marketing, or competitors. Each SOI is shown as a folder. Within these folders, various searches, watches, and bookmarks can be defined and as-signed to a specific interest area. Inside the SOIs, previous searches and watches as well as the current searches and watches and their current status are displayed. Each search or watch is indi-cated as being currently executed in the background, or as already finished. Moreover, it is possi-ble to create predefined SOIs for the user and deliver them with the INSYDER system. A further possibility discussed has been the subscription to SOIs. An INSYDER system with particular SOIs could then automatically be updated as soon as the provider updates the SOIs. The SOIs represent as well the topic-specific thesauri a further personalization possibility of the INSYDER system.

The SOIs also support processes described by [Spink, Bateman, Jansen 1998] as the successive search phenomenon, a process of repeated, successive searching over time.

The INSYDER search is based on a dynamic search approach. The idea is to use an online search to discover relevant information by following links. The main advantage is that the system is searching in the current structure of the Web and not in a possibly outdated index of a search en-gine. The dynamic search is based on special crawling agents. They use different heterogeneous sources (like search engines, Web directories, Web sites, documents) as starting points for follow-ing links. For example, the query terms are submitted to selected search engines and the hyperlinks in the search results are used for further crawling in the Web. Taking the returned hits as a starting point, INSYDER conducts an active search in the WWW. All documents found are analyzed in-crementally to find out how well these documents match the query. In this way, the documents presented in the result list can be guaranteed to be up-to-date. Unlike other search systems INSYDER is not designed to crawl the entire WWW and store its contents. Instead, it only crawls

selected parts of the Web that seem to be relevant to a given user-query. Every crawled document is then ranked by the INSYDER system. This way of specializing the search by specializing the crawling and ranking is intended to increase the precision compared to other meta-search engines which only rely on the results from the search engines indices.

To start a search with the INSYDER system, the user enters or creates a Sphere Of Interest. In the next step, which is the first phase of the four-phase framework from [Shneiderman, Byrd, Croft 1997] shown in Table 1 on page 12, the user formulates his information need as unstructured text (often called a some what misleadingly “natural language”) and chooses from a list sources as starting points for the search (e.g. Web sites, search engines). In the subsequent action phase, the search is launched and run until the user stops it. During or after the search, the user may do a re-view of the results, i.e. look at the documents. A Web-query, even when well-focused, can pro-duce so many potentially useful hits as to be overwhelming, i.e. several hundred or more. Recent work in visual information-seeking systems, capitalizing on general information visualization re-search, has dramatically expanded the limited traditional display techniques (e.g. ranked list of hits). Accordingly, a variety of information visualization techniques displaying search results has been integrated in the INSYDER system. All visualizations simply try to make the result set of documents easier to handle. The refinement of the search is supported by relevance feedback. A detailed description of this and other used information retrieval approaches in the INSYDER sys-tem (e.g. weighted search terms, semantic analysis) can be found in [Reiterer, Mußler, Mann et al.

2000], and [Mußler 2002].

Im Dokument Visualization of search results from the World Wide Web (Seite 129-133)