• Keine Ergebnisse gefunden

Creating the XTREEM-SL Index

10.2 XTREEM-SL Procedure

10.2.1 Creating the XTREEM-SL Index

Group By Path XHTML

Conversion

XHTML Web Documents Web Crawling

WWW

Lanugage Recognition

Web Document Collection

Raw Sibling Groups

Character Filtering and Normalization

Token Length Filtering

Character Length Filtering

TagPath Cardinality

Filtering

Filtered Sibling Groups

Indexing

Index Web Document Collection

Figure 10.2: Dataflow diagram for creating a XTREEM-SL

10.2 XTREEM-SL Procedure

Step 1 - Web Crawling: A necessary precursor is to have large amounts of Web documents locally available. Performing Web crawls is not only a conceptual problem but also something which has a significant engineering aspect. Those circumstances of Web crawls have already scratched in section 4.1.1. Here we rely only on the circumstance that a Web crawl was conducted and that the Web documents have been fetched locally. This Web crawl has to cover sufficient amounts of relevant pages. The size of such a Web document collection will typically range from thousands of pages up to billions of pages. For our experiments we crawled about 20 million Web documents, approximately 1/1000 of the amount of Web documents indexed by major Web search engines at the time when the crawl was performed2.

Step 2 - Language Recognition: In principle, the presented procedure is language independent and, therefore, documents from arbitrary languages can be indexed together. In our earlier experiments we did not incorporate language identification which was often not problematic regarding the obtained results. For the sake of building a compact index, which reflects the target domain, it can also be important not to index documents from languages different from the target language. If flexibility regarding the indexed languages is desired, it is also possible to put information about the recognized language within the index, so that a restriction on languages can be done on retrieval time. For language recognition the language recognizer3 provided by the Nutch Web crawler was used.

Step 3 - XHTML Conversion: A necessary requirement for performing the Group-By-Path operation is that the potentially non XHTML conformant Web documents are converted to well-formed XHTML documents.

Step 4 - Group-By-Path: On each XHTML document the Group-By-Path operation described in chapter 3 is applied. For each Web document several sets of text spans are obtained.

Step 5 - Character Filtering and Normalization: The aim of this step is to normalize the input text sequences. The input text spans from the Web documents were processed as follows:

• Characters have been converted to lower case.

• Non alphabetic characters have been replaced by whitespace.

• Multiple occurrences of whitespace are replaced by one whitespace.

• Leading and trailing whitespace is eliminated.

2September / October 2005

3http://wiki.apache.org/nutch/LanguageIdentifier,

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/lang/LanguageIdentifier.html

This filtering and normalization of the raw character sequences cleans the input text content from the manifold presence of whitespace in marked-up Web documents. The character filtering can also be done on other criteria, by allowing only alphabetic characters of the target language. By doing so, numerical characters, as they appear on numbered list items or headings, are removed.

Punctuation is filtered out as well. For an index which should not be targeted to any particular language or language family (such as Latin languages), this step can be relaxed towards only eliminating numbers and punctuation and normalizing the whitespace occurrences.

The textual content of semi-structured Web documents can be of arbitrary character length. Subsequently, a text span can be constituted by an arbitrary number of tokens, whereas a token is a character sequence without whitespace characters. For many text spans, which do not contain whitespace, the situation is straightforward; the text span is likely a valid term. Other text spans composed of several tokens often correspond to multiword terms of several words. In principle, Group-By-Path groups text spans of arbitrary length. But practically, only small text spans are likely to be a term expression, which is worthy to be indexed and to be presented to a human user. For example, when looking at figure 3.4, there are text spans such as “Dangerous Sharks”, “There are some shark species . . .” “Great White Shark” and so on. Some of them are valid term expressions, for example, “Dangerous Sharks”, whereas other text spans such as “There are some shark species . . .” are more complex linguistic constructs. The terms of our natural language tend to be constituted by a rather small number of words (tokens). For an increasing number of words, there are much fewer terms in the vocabularies of natural languages. For example there is only an extraordinarily small number of terms which are constituted by more than 5,6,7,. . . words. This observation can be used to create a more efficient index structure. If all text spans would be indexed, the index would grow (at least) to the memory space of the textual content of the entire Web document collection in the whole. By limiting the maximum number of tokens a text span to be indexed is allowed to consist of, unlike term expressions can be eliminated to create an index which is more memory compact without losing much of the potentially valuable information.

Step 6 - Text span Token Length Filtering: As already mentioned (for example, in section 7.4.2 of the XTREEM-T chapter), text spans consisting of large numbers of tokens are unlikely to be valid term expressions. By means of themaximum text span token length parameter, longer text spans can be rejected regardless of their occurrence frequency. This practically eliminates long passages of textual content, the unstructured parts of Web documents. A long paragraph of text is neither a desirable sibling term nor a term at all. In computational linguistics a length of up to 4 tokens (quadgrams) is often set for the maximum length of term expressions to be found or processed. Since the number of terms with more than 5 words length is exceptionally low, we used a maximum text span token length of 5 tokens. Allowing for longer token lengths is relatively computationally cheap while using

XTREEM-10.2 XTREEM-SL Procedure

SL, whereas for standard n-gram based systems higher numbers of n-grams are more computationally expensive.

Step 7 - Text span Character Length Filtering: Only text spans which have a character length between a minimum and a maximum character length are preserved. For the minimum text span character length it makes sense for many languages to require at least a length of two characters to focus on terms. This can be different for other languages where even a single character can be a valid and useful term. For those languages, a minimum text span character length of 1 may be chosen.

Only terms with a length up tomaximum text span character length are indexed.

This parameter is to a certain degree correlated with the filtering performed in step 6 since long passages of text may also contain some whitespaces. This parameter will additionally eliminate long text spans where no whitespace is included, which have passed step 6 erroneously while not being promising term candidates. Practically, a maximum length of up to 50 characters should be sufficient.

Step 8 - Tagpath Cardinality Filtering: Whereas the former filtering steps were applied on single text spans, there is also a filtering according to the number of text spans occurring on a tagpath.

By requiring aminimum tagpath cardinality of at least two, text spans which have no sibling text spans are discarded. By means of amaximum tagpath cardinality, the size of text span groups to be indexed can be limited. Only groups of a reasonable size can be regarded as useful sibling groups. If there are more text spans with the same tagpath, the tagpath is unlikely to be a good separator among a semantically coherent sibling group. The group is likely to be created by a not well-structured Web document and, therefore, terms will be mixed in an undesired way. Such large groups of sibling text spans can be excluded from the index since they are likely to introduce noise.