Lexical Ordering - Digital Technical Journal

ignoring prefixes and abbreviations; there is no lex

ical rule to determ i ne what part of the character

puter has only the kanji character in the character string, it cannot order or look up in the same way p ronunciation represen ted with kana characters.

Or the co mpu ter might have strings such as (Mc)Millan with the convention that the parenthe

ses indicate p arts to be ignored for ordering and searching.

Modern dictionaries and telephone directo ries use lex ical tech niques as m uch as possible, which is better in a mu lticu ltural env i ro n ment. It is m uch easier to understand and apply lexical ru les for searching than to acquire intui tive knowledge of an unfamil iar c u l ture.

11>1. 'i No. 3 Sl/11/lller 1993 Digital TeciJIIical ]our11al

Words, Not Individual Letters

It is importan t to understand that people order words and names, not just the individual letters and symbols. Consequent ly, good-qual ity lexical order

ing that comes close to how people work cannot be achieved by looking at a l l the characters i n a string only once, from the first one th rough the last one.

This concept can best he i l lustrated with alphabetic scripts, and some Engl ish examples are given below.

W11en one looks for SOS i n a modern English d ic

tion ary, one expects to see it between sort and soul. Now, to find SUS between sort and soul, one must ignore that SOS is in uppercase letters and sort and soul are i n lowercase. This type of lookup is achievable by looking at all the letters once.

Now consider the abbreviation CAT, meani ng clear air turbulence. CA l is l isted between casual and catalyst. In this case, we cannot ignore the d if

ference between CAT and cat. The dictionary l ists both words, ancl some dicti onaries consistently l ist lowercase words before uppercase words (or vice versa), so the order using lowercase first wou l d be and C4 Tmust be placed in the correct order in a sec

ond step, in which uppercase and lowercase make a difference.

Dealing with uppercase and lowercase is not the only issue fo r alphabetic ordering. Many languages use letters with diacritical marks such as accents.

Words and names may also contain spaces or spe

cial symbols, such as hyphens, apostrophes, and points. Examples arc big bang, !Jest-seller, rock 'n ' roll, ami PS. When ordering is strictly alphabetic, as is the case in many d ictionaries, then accents on let

ters, spacing, and special symbols are ignored in the first step, but they are taken into accou nt to resolve a tie. For example, the correct order in French might be denier; denier, dernier; or Nb, NB, N.B., Nd, n.d., N. IJ. in English.

Table-driven Multilevel Ordering

The heart of ordering methods is the comparison of two character strings. If we have an algorithm to determine whether one string should preced e, fol

low, or bt· considered equal to a second string, then arranging a list of strings in the correct order is straight forward .

Digital Tee/mica/ jounwl 1.-'IJI . .5 No . .> Stntlllll:'r 19'J3

The Ordering of Uniuersal Uwracter Strings

Single-step or One-level Compare

The single-step compare or one-level ordering algo

rithm is known by most readers:

Compare the first characters of the two strings; if equal, then compare the second characters; con

tinue until a d ifference is f(>tmd or until at least one s tring is exhausted. If a difference is found, t hen the character-col lating sequence determines which string precedes the other. (Example: words pre

Multiple-step or Multilevel Compare The state-of-the-art computer method for compar

ing character strings is a general izat ion of the s ingle

step compare. If, after using the above algorithm with the first col lating sequence, both stri ngs are found to be equal, then in tlw second step the algo

rithm is repeated. Both strings are compared again, starting from their first characters, now using the the second step, t he accents make a difference, bur there is no distinct ion bet ween lowercase.: and uppercase variants. That dist inction is made in t he third step: the col lating sequence for W.') is < I. C > .

< U C > .

Product Internationalization

The weight matrix and the collating sequences can be placed in tables that are used by the ordering algori thm, hence the name table- driven m u ltilevel ordering.

If this example matrix is extended in a simi lar way, then the m u l t ilevel algorithm wou ld pl ace the following words (most of which are real French words) i n this correct order: denie, DlfNil:', denier, DENIER, denier, DEVIER, dimie1; dernier.

The method that is described here is also used

in ^POSIX (!SO/IEC 9945-2. 2 Shell and Utilities,

LC_COI.LATE Definition). ' Rol f Gavare was among the first to publ ish a paper on m u l tiple-step com

parisons 2 Al ain LaBonte was the first to describe it as explained ⁱⁿthis paper, and he also implemented it as a Canadian Standard ^(\.SA 2243.4.1-1992).

LaBonte devisee! a compl ete and pred ictable order

ing method that corn.:sponds to very fine detail with the best examples of French and Engl ish di c

tionary ordering. ¹

Generate Comparison Key

With the multilevel method, it is also possible to have the algorithm generate a comparison key for a specific character string rather than always com

pare two strings. These comparison keys can be stored with the character strings; a one-level com

parison of keys then gives the same resu lt as a m ulti

level comparison of the original character strings.

For example, and again extending the example m atrix given above, the comparison key for denie cou ld be a convenient n u merical representation of

< N O N E > < N O N E > < n i i > < LC > < LC><LC>< LC> < LC> . The <nil> precedes all other weights. Its pres

ence at the end of the comparison key subfields guarantees that shorter stri ngs precede longer strings. Efficient compression techn i ques exist for such comparison keys.

Variations oftbe Multilevel Method

The fo l lowing section expands upon the multilevel method and gives examples of changes necessary to accom modate cu ltural d ifferences in word order.

Special Symbols

Wit h a small extension, the multilevel method can also handle special characters such as the hyphen and the apostrophe to mimi c traditional human alphabetic ordering. Another weight col u m n m ust be added to the matrix given above to distinguish letters from special characters:

L A T I N S M A L L

L E T T E R E < E > < N O N E > < L C > < L T R >

H Y P H E N - M I N U S I G N O R E I G N O R E I G N O R E < H P H >

The ^IGNORE indicates that the character is

skipped in the comparison algori thm in the first three steps. A collating sequence for W4, in which

<LTR> precedes all symbols for special characters such as <HPI:-1 > , guarantees that words and names without special characters precede the ones with exactly the same letters, but with special characters.

A four-level ordering such as the one suggested here ^issufficient fo r a good- quality, complete, and pred ictable al phabetic ordering with the Latin alphabet.

Additional Letters

For most languages writ ten i n Latin characters, the correct order of words wou ld be senior, seiiorita, sentimental, separable. To achieve this order, W l would be ... , < M > , < N > , <0>, ..^., and the matrix wou ld i n clude LATIN SMALL LETTER N \VITH T I I. D E , where W I is < N > , W2 is <TI LDE> , and W3 _is<LC>.

In Span ish , the N WITH T I LDE is considered a let

ter to be ordered between N and 0 and the correct order ^issenior, sentimental, senorita, separable. To achieve this type of ordering, W l would be ^... , <M>,

< N > , <NTILDE>. <0>, .. . , and the matrix wou ld add

LAT I N SMALL LETTER N WITH TILDE, where WI is

<NTI LDE>, W2 is <NONE>, and W3 is <LC> .

Ligatures

The mu l t i level method can also handle l igatures by allowing each matrix element to be a sequence of weights, rather than one weight. For /f. in French, the matrix would i nclude LAT I N SMALL LI GATl R E A E , where W l is <A><E>, W2 _is<LG><LG>, and W.) is <LC><LC > . In these languages, LI GATU R E A E is equivalent to two letters when orderi ng words. In Norwegian, the /f. is a letter on its own . WI is ^... ^,

< Y > , <Z> , <AE>, <OSTRO K E> , < A R J N G > . For the matrix element, LATIN SMALL LIGATU R E A E, W I is

<AE>, W2 is <NONE>, and W3 _is<LC>.

Logograms

Some special symbols, someti mes cal led logograms, can be seen as short notations for words: & + %. ^A cul ture-specific ordering may replace such symbols by the corresponding words. If the language is Engl ish, for example, then Research & Development can be ordered as Research and Developmen t. As long as a fixed rule exists for replacing symbols by

1'<>1. 5 No. J Summer f')'J3 Digital Teclmicaljourual

equivalent words, the extension t hat was intro

duced for /E can be applied in a similar way to obtain the desired ordering. On the other hand, if the replacement word depends on the language used in the rest of the string, then lexical ordering cannot do the job properly without more informa

tion coded in the character strings.

Fine Tuning for the Accents

The table-d riven multi level method, as explained so far, wou ld place French words i n this order: cote, cote, cote, cote, mar;on, macon. In a traditional, correct ordering, they shou ld be in the fol lowing order: cote, cote, cote, cote, macon, mar;on. On gen

eral, accents at the end of a French word are more i mportant for understanding than other accents.) To obtain the desired ordering, another exten

sion of the multiple-step method is needed: for the second step, the one that d iscriminates between quasi-homographs (words that d iffe r only in their diacritical marks), the comparison algorithm shou ld start from the end of the stri ngs rather than from the beginning. For the other Western languages that use the Latin alphabet, this reverse processing for the accents is not needed . On the other hand, it does not hinder either, so the French method is acceptable as wel l .

French is not the only language with such quasi

homographs. In new-Greek, with the modern monotoniko spel ling, all multisyl labic words have one accent that indicates the stressed syl lable.

New-Greek has many quasi-homographs, including the following examples, which use a simple tran

scription of Greek letters to Latin letters: arguros, argur6s, diakonia, diakonia, metro, metro, para, pm·a. The French method of reverse processing produces acceptable results for new-Greek as wel l .

Fine Tuning for the Special Symbols

With the tables extended as explained in the section Special Symbols, the multiple-step algo

rithm wou ld order words as fol lows: unionized, union- Lzed, un-ionized. For the exceptional cases such as this one, in which two words are identical except for the placement of a special symbol, the order unionized, un-ionized, union-ized may seem more appropriate. Usual ly, the hyphen is per

ceived as a word break, not on the first level, but on

a subsequent level, and with word breaks, shorter words always come first.

To obtain the latter ordering, one coul d use the same technique as for the diacritical marks: have

Digital Tecbtzical jourtzal Vol. 5 No. 3 ^Summer1993

The Ordering of Universal Character Strings

the algorithm start from the end of the strings for the level that deals with the special symbols. POSIX has a small extension to the mu ltilevel method that gives similar resu lts while still moving forward.

This extension adds the position of the symbol to its table weight during comparison.

Special Symbols in Combination with Uppercase and Lowercase Characters This section does not introduce a new extension but reconsiders the extension for the special sym

bols. This method adds a fourth weight column:

L A T I N S M A L L

L E T T E R E < E > < N O N E > < L C > < L T R >

H Y P H E N - M I N U S I G N O R E I G N O R E I G N O R E < H P H >

W i t h W3 for uppercase and lowercase a n d W4 for the special characters, the distinctions between uppercase and lowercase are considered more important than the presence or absence of spacing and special symbols. In many cu ltures, this is indeed the case with proper names of people. The fol lowing order is desired with names that diffe r i n use of uppercase o r lowercase letters: deGroot, de Groot, Degroot, De groat, DeGroot, De Groot.

For some geographical names, it could be argued that special symbols are more significant than the difference between lowercase and uppercase.

For example, the desired order is Sanssouci,

To obtain the correct order, changes are sometimes necessary to the multilevel method. This section discusses cases in which it is less easy to adapt the table-driven mul tilevel method.

Digraphs and Collating Elements

CH and LL have special placement in the Spanish alphabet. Spanish is not unique i n this respect; com

binations of letters also have special placement i n the Albanian, Hu ngarian, Vietnamese, and Welsh alphabets. The We lsh ordering alphabet, for exam

ple, is A B C CH D DD E F FF G ^{NG H}I J L LL M N 0 P PH R RH s T TH U w Y, and the following l ist of words is correctly ordered in Welsh: acw, achos, adwy,

Product Internationalization

addas, agwedd, angau, almon, allan, anfvnych, anf/(Jdus, antw; anthem.

Before the multilevel method can he applit:d, i t i s necessary to replace the multiple-character combinations by pseudo-characters. In ^POSIX LC_COLLATE, such a mechanism is foreseen. One

lem. One compl ication, however, is that the two let

ters together do not a lways represent the special alphabet letter. In Welsh, for example, the N and G are separate letters in the Welsh words melyngoch, dangos, gw,vn{<alchu, and mwynglawdd. The word mezvngoch then is among words starting with melyn, not after the words with me�vg. More infor

mation must he coded in t he character strings that represent Welsh words to define a correct lexical ordering. complexity will probably also see that the collating

element extension makes the table-driven mu lti

level method less straigh tforward to implement. I f there are only a few col lating- element extensions, then simple workarou nds might help, but what if there are thousands of them? (Improbable? Wait to form your opinion until you read the section Added Complexity with ^UCSCoding.)

Sequences, However Long

Other ordering requirements are difficult to accom

modate with the matrL'< method. For exam ple, the British standard on ordering, BS 1749: 198';, requires that (in the first step) spaces, dashes, hyphens, and diagonal slashes and sequences of them be treated as a single space (which is significant), except at the beginning of an entry, where they should be ignored . Making a space significant for ordering is easy, but the collating-element extension u nfortu

nately does not al low recmsive definitions, so i t cannot i ncorporate the sequences o f spaces, etc.