THE BOYER-MOORE FAST STRING SEARCH ALGORITHM

B. O GENERAL DISCUSSION

9. DOCUMENT COMMANDS

10.0 THE BOYER-MOORE FAST STRING SEARCH ALGORITHM

The Boyer-Moore algorithm, by eliminating the need to look at each successive character in order to find an occurrence of a particular character sequence in the text, provides an optimal method for performing fast string searches.

When presented with a character from the text, the algorithm determines the distance to the next significant character

position. The characters between the current character and the next significant character need not be analyzed and are skipped.

To make decisions about significant character positions, the algorithm depends upon information from a preconstructed pattern table, discussed next.

10.0.0 The Pattern Table

The pattern table is a 256-byte table. There is one byte-length entry in the table for each of the 256 possible ASCII

characters. The pattern table is constructed prior to the start of a search with buildtable. The data required to build the pattern table are

1. the address of the character string (pattern) being searched for

2. the length of the pattern

3.

the direction of the search

The address of the pattern table is kept in the ptable integer.

The address of the pattern string is kept in the pattern integer, and the length of the pattern is kept in the patlen integer. If a forward search is being used, the direction integer will hold a true flag.

The data to be used for this discussion are shown below. We are searching for the 3 character string "hij". The text we are searching through contains the first 16 characters of the alphabet:

Pattern: hij

Pattern Length: 3

Text Being Searched: abcdefghijklmnop

Search Direction: Forward (start-of-text to end-of-text) Given this information, buildtable will construct the pattern table shown below. Note that the entries for all characters not included in the pattern contain either a 3 (the length of the pattern) or a "1". In a forward search, if a character is the first character in the pattern, its entry will hold the value length-1. In this case, ~ is the first character in the pattern so its entry contains "3-1=2". The second character in the pattern,

i,

receives length-2 or 3-2=1. The last character in the pattern, which turns out to be the most important character in a forward Boyer-Moore search, receives the value FF.

Second

10.0.1 The Character Equivalence Table (maptable)

c

Character Equivalence Table:

Second First Hex Digit ->

hex 0 1 2 3 ⁴ 5 ⁶ 7 ⁸ 9 ^A ⁸ ^C ⁰ ^E ^F digit

0 0 0

93

⁰ ⁰ ^{70 0} ⁰ ⁸⁷ ⁰ ⁰ ⁰ ⁰ ⁰ ⁰ ⁰

1 0 0 0 0 61 71 0 0 0 0 0 0 0 0 0 0

2 0 0 0 0 62 72 0 0 0 91 0 0 0 0 0 0 3 0 0 0 0 63 73 0 0 84 0 0 0 0 0 0 0

4 0 0 0 0 64 74 0 0 0 0 0 0 0 0 0 0

5 ⁰ ⁰

a a

65 75 0

a a

⁰ ⁰ ⁰ ⁰ ⁰ ⁰ ⁰

6 0 0

a a

66 76

a a a a a a

0 0 0 0

a

⁰ ⁰

a

67 77 0 0 0

a a

⁰ 0 0 0 0 8

a

⁰ ⁰

a

68 78

a a

a

⁰ ⁰ ⁰ ⁰ ⁰ ⁰

9 ⁰ ⁰ ⁰

a

69 79 ⁰ ⁰ ⁰ ⁰ ⁰ 0 0 0 0 0

A 0 0 0 0 6A 7A 0

a

0 0 0 0 0 0 0 0

a

⁰

a a

^6B

a a

a a a a

0 0 0 0

C OB 0 0 0 6c 0 0 0 0 0 0 0 0 0 0 0

a

⁰ ⁰ ⁰ ^6D ⁰

a a

^8c ⁰ 0 0 0 0 0 0

a

0 6E 0

a

0 0 0 0 0 0 0

a

⁰

a

0 6F 0

a a

⁸⁶ ⁰ ⁰ ⁰ ⁰ ⁰ ⁰ ⁰

10.0.2 A Step-by-Step Explanation of the Algorithm

Pattern: hij

Text Being Searched: abcdefghijklmnop

1. The search starts, more or lass, at the current cursor position.

abcdefghijklmnop

2. Since our string is three characters long, the first

character to be examined by the search routines will be the third character in the text, the c.

abcdefghijklmnop

3.

The ptable data for the character ~ determines the location of the next character to check. The entry for c in the ptable contains a

3.

What we know at this point:

We know that c is not the last character in the pattern because its ptable entry does not contain a $FF.

Since we are on the third character in our text and it is not the last character in the pattern, there is no way any of the

characters we skipped over could contain the pattern.

Because we do know that we could not have skipped over a possible match, we can skip ahead another full pattern length number of characters (3).

4. The next character we encounter is an f.

abcdefghijklmnop

Its ptable entry contains a 3 also. For the same reasons described in Step 3 above, we will skip ahead another three character positions.

5.

The next character we encounter is an i.

abcdefghijklmnop

The entry for

i

in the ptable contains a 1. Whenever the search routines encounter an

i,

which is the next to last character in the pattern, they must be sure to check the character which follows the i.

6. After advancing by one character, we encounter a 1.

abcdefghijklmnop

1 1S the last character in the pattern because its ptable entry contains a $FF. Now the search routine knows it has a possible match. Only at this point will it take the time to explicitly compare each character in the pattern with the possible match in the text.

7. The text string matches the pattern so the search is finished. The cursor is placed over the h.

If a normal search -- a "compare-each-character-in-the-text-to-the-pattern" search -- had been used, eight character comparisons would have been performed before the match was located. With the Boyer-Moore method, only three numeric comparisons and two

character comparisons were required. On the negative side, the Boyer-Moore search does require extra time to create the ptable.

As the length of the search pattern increases, the speed of the Boyer-Moore search surpasses the speed of the conventional string search, even when the table-building time is taken into account.

10.0.3 Handling Accent Characters

In the ptable you will notice that all of the entries from $80 up contain a 1. In this range the only entries which correspond to characters found in the text are the entries $80 -) $B8 and $CO -) $C8. These are the entries for the accent characters. As you may recall from previous explanations of accents and accented characters, an accented character such as ~ is stored as a 2-byte value in the text. The first byte holds the character code for the main character, the ^~,and the second byte holds the code for the accent.

The search routines will only pay attention to the main

characters in the text unless an accent is specifically included in the pattern. If the search routine happens to land on the data for - part of ~t the 1 in the ptable entry will cause the search to be advanced by one, effectively skipping over the accent character. This means that both the word "Canada" and

"Canada" will be found with the pattern "can."

If the more specific pattern "can" is used, only "Canada" will be found by the search routines.

Im Dokument Cat Editor (Seite 186-192)