To bu ild systems that process multil ingual data, such as the one shown in F igure 1, a rich variety of text operations is necessary. This section catego
rizes such operations, bu t a complete specification of their interfaces would consume too much space in this pap er. Text operations require parsing, value mapping, and operational fu nctions, as described earl ier.
Text Manipulation Services
Text manipul ation services, such as those speci
fied in C p rogram ming language standard ISO/IEC 9899: 1990, System V Release 4 Mul ti-National
Vol. 5 No. 3 Summer /'J':J.) Digital Teclmical journal
Jntemational Distributed Systems-Architectural and Practical Issues
PARAMETER PARAMETER
� l
DOCU M E NT PAGE
INTERCHANGE - LAYOUT
r-
DESCRIPTION1-
RENDER-FORMAT LANGUAGE
I J
I I
FONT SERVER
FONT DATABASE
Figure 3 Layout and Rendering Services
Language Supplemen t (MNLS), or XPG4 run-time l ibraries (includ ing character and text element clas
sification fu nctio ns, string and substring opera
tions, and compression and encryption services) need to be extended to mu ltil ingual strings such as Strings(Unicode) and other DI Fs, and to various text object class I ibraries .u.s. 13
Data Type Tramformations
Data type transformations (e.g. , speech to text, image-to-text optical character recogn ition [OCR] , and handwriting to text) are operations where the data is transformed from a representation of one abstract data type to a representation of another abstract data type. The presentation form transfor
mati ons T•--•T_Presentation_Form and the funda
mental i nput and output services are data type transformations. Care needs to be taken when paramete rizing these operations with user prefer
ences to keep the transformation thread-safe.
Again, this is best accomplished by keeping the pre
sentation form preferences attached to the data.
Encoding Conversions
Encoding conversions (between encoded character sets, DIFs, etc.) are operations where only the rep
resentation of a single data type changes. For exam
ple, to support Un icode, a system must have fo r each other encoded character set a function to_uni:Strings(E)-•Strings(Unicode), which con
verts the code points in E to code points in Unicode . 1 1 The conversion fu nction to_uni has a par
tial inverse from_uni:Strings( Unicode) >Strings(E),
D igital Tecbllical ]ounwl Vol. 5 No . . i S11111mer 1993
which is only defined on those encoded text ele
ments in Unicode that can be expressed as encoded text elements in E. If s is in Strings(E), then from_uni(to_uni(s)) is equal to s. Other encoding conversions Strings(E)-> Strings(E') can be defined as a to_u ni operation followed by a from_un i oper
ation, for E and E' respectively. Another class of encod ing conversions arises when the character set encoding remains fixed, but the conversion of a document in one DIF to a document in another DIF is required . A third class originates when Unicode or ISO 10646 strings sent over asynchronous com m u nication channels m ust be converted to a Universal Transmission Format (UTF), thus requir
ing Strings(Unicode)<-> 1T encoding conversions.
Collation or Sorting Services
Another group of computation services, col lation or sorting services, sorts l ists of strings according to application-specific requirements. These ser
vices were d iscussed earlier in t he paper.
Linguistic Services
Linguistic services such as spell checking, gram mar checking, word and line breaking, content-based retrieva l , translation (when existent), and style checking need standard AP!s. Although the imple
mentation of these l inguistic services is natu ral langu age-specific, most can be implemented with the structure shown in Figure 2.
Also, large character sets such as Unicode and other m u l ti lingual structures require a u ni
form exception-hand l ing and fal l back mechanism
59
Product Internationalization
because of the large number of un assigned code points. For example, a system should he able to uniformly hand le exceptions such as "glyph not t<Jund for text element." Mechanisms such as global variables for error codes inhibit concurrent pro
gramming and therefore should be discouraged . Returning an error code as the return value of the procedure ca l l is preferred, and when supported, raising and hand l ing exceptions is even better.
System Naming, Synonyms, and Security
The multil ingual aspect of Unicode can simplify system naming of objects and their attribu tes, e.g. , in name services and reposi tories. Using encoded strings tagged with their encoding type for names is too rigid, because of the high degree of overlap in canonical t{)rm in Unicode according to the fol low
ing definitions. characters fol lowed by their assorted marking char
acters in some prescri bed order. The recom
mended order is the Unicode " priority value :· 1 1 · 2 1 The canonical for m should have the fol lowing prop
erty: When c(u) is equal to c(u), the plain text rep strings used for names are desirable, e.g. , the absence of special characters and tra i l i ng blan ks. In a multi
vendor environment, both the canonical form and the name restrictions should be standard ized . The X.'500 work ing groups currently studying this prob
lem plan to achieve comparable standard ization.
Since wel l-chosen names convey usefu l informa
tion, and since such names are entered ami d is
played in the end user's writing system of choice, it is often desirable for the system to store various
translations or "synonyms" for a name. Synonyms, for whatever purpose, shou ld have attributes such as long_namc, short_name, language, etc . , so that directory fu ncti ons can provide easy-to-usc inter
faces. Access to objects or attribute values through synonyms shou ld be as efficient as access by means of the primary name.
Jn a global network, publ ic key authentication using a replicated name service is recommended 22 One principal can look up another i n the name ser
vice by in itially using a (possibly meani ngless) name for the object in some com mon character set, e.g., {A-Z,0-9}. Su bsequently, the principals can define their own synonyms in their respective lan
guages. Attribu tes for the principals, such as net
tribu ted system is somewhat more complicated than for a monolingual system. The fol lowing is a partial l ist of the services that must be provided:
• Services for various mono I ingual subsystems
• Registration services for user preferences, locales, user-defined text elements, formats, etc.
• Both m u l t i l i ngual and mu ltiple monolingual run- t i me l ibraries, simultaneously (see Figure 2)
• Multili ngual database servers, font servers, logging and queu i ng mechanisms, and directory services
• Mu ltil ingual synonym services
• M u l t i l i ngual d iagnostic services
Since a system cannot provide all the services for every possible situation, registering the end users' needs and the system's capabil i ties in a global name service is essen tial. The name service mu st be con
figured so that a multilingual server can identify the la nguage preferences of the cl ients that request ser
v ices. This configuration al lows the servers to tag or convert data from the cl ient without the mono-1 ingual cl ient's active participation. Therefore, the name service database must be u pdated with the necessary preference data at client instal lation time.
Typical ly, system managers for d ifferent parts of the system are mono I ingual end users (see Figure 1) who need to do their job from a standard PC.
llfJI. 5 No. 3 Summer 19'):) Digital Tecbnical ]Olii'IICII
International Distributed Systems-Architectural and Practical Issues
Thus, both the normal and the diagnostic m anage
ment interfaces to the system must behave as m u l t i
l ingual servers, sendi ng error codes back to the PC to be interpreted in the local language. Although the quality of the translation of an error message is not an architectural issue, translations at the system management level are generally poor, and the sys
tem design should accou nt for th is. Systems devel
opers shou ld consider giv ing both an Engl ish and a local-language error message as well as giving easy-to-use pointers into local-language reference manuals.
Data errors wi l l occur more frequently because of the mi xtures of character sets in the system, and attention to the identification of the location and error type i s im portant. Logging to capture offending text and the operations that generated it is desirable.
Incremental Internationalization
Mu lti! ingual systems and international components can be bu ilt i ncremental ly. Probably the most pow
erful approach is to provide the services to support mul tiple monolingual subsystems. Even new oper
ating systems, such as the Windows NT system, that use Un icode internally neecl mechanisms for such support.25 Multidimensional improvements in a sys
tem 's ability to support an increasing number of variations are poss ible. Some such im provements are ma king more servers multi lingual, supporti ng more mult i l i ngual data and end-user preferences, supporting more sophisticated text elements (the first release of the Win dows NT operating system will not support Unicode's joiners), as wel l as adding more character set support, locales, and user-defined text elements. The key point is that, l ike safe programming practices, multil ingual support in a d istributed system i s not an ·'ali-or
nothing" endeavor.
Summary
Customer demand for multil ingual distributed systems is increas ing. Suppl iers must prov ide systems without i ncurring the costs of expen
sive reengineering. This paper gives an overview of the architectural issues and progra mmi ng practices associated with im plementing these systems.
Modularity both in systems and in run-time l ibrarits al lows greater reuse of components and i ncremental improvements with regard to interna
tional ization. Using the suggested safe software practices can lower recnginecring and
mainte-Digital TeciJnical Jounwl Vol. S No .. I Summer 1')93
nance costs and help avoid cost ly redesign problems. Providing m u ltil ingual services to mono
l i ngual subsystems permits increment al improve
ments while at the same time lowers costs through i ncreased reuse. Final ly. the registration of syn
onyms, user preferences, locales. and services in a global name service makes the system cohesive.
Acknowledgments
I wish to thank Bob Ayers (Aclohe). JosL·ph Bosurgi (Univel), Asmus Freytag ( \1 icrosoft), Jim (;ray (Digital), and jan te Kidte ( D igital) for thl'ir helpfu l comments on earlier drafts. A special thanks to Digital's internationa l i zation team, whose contribu
tions are always understated . In addition. I wou ld l i ke to acknowledge the Unicode Technical Com mittee, whose impact on the industry is pro
found and growing; I have learned a great deal from fo llowing the work of this com mittee.
References
1 . D. Carte r, Writing Localizable Software fo r the Macintosb ( Reading. tviA : Addison-Wesley, 1991).
2. Producing International Products (Maynard.
MA: Digital Equi pment Corporat ion, 1989).
This internal document is unavailable to external readers.
3. Digital Guide to Developing international Software (Burl ington, MA: Digital Press,
1991).
4. S. Martin, " I nternational i zation Made Easy,"
OSF White Paper (Cambridge, MA: Open Soft
ware Foundation, Inc., 199l).
5. S. Snyder et al., "I nterna tion al i zation in the OSF IKE-A Framework," May 1991 . This doc
u ment was an electronic mail mc.ssage trans
mitted on the Internet.
6. X/Open Po rtability Gu ide, Issue 3 ( Readi ng,
U. K. X/Open Company Ltd , 1989).
7. X/Open Internationalization Guide, Draft 4.3 ( Readi ng, U. K . : X/Open Company Ltd. , October 1990).
8. UNIX System V Release 4, Multi-National Language Supplernent (MNLS) Product Overview (Japan: American Te lephone and Telegraph, 1990).
61
Product Internationalization
9. Information Technology- Universal Coded Character Set (UCS) Draft International Standm·d, ISO/IEC 10646 (Geneva: Interna
tional Organization for Standardization/Inter
national Electrotechnical Commission, 1990).
10. A. Nakanishi , Writing Systems of the World, third printing (Rutland , Vermont. and Tokyo, Japan: Charles E. Tuttle Company, 198R).
1 1 . The Unicode Consortium, The Unicode Standard- Worldwide Character Encoding, Version 1 .0, Volume l (Reading, MA: Addison
Wesley, 1991 ).
12. R . Haentjens, "The Ordering of Universal Character Strings," Digital Technical journal, vol . 5, no. 3 (Summer 1993, this issue): 43-52 . 13. Programming Lanf!.uages-C, ISO/lEC 9899:
1990(E) (Geneva: International Organization for Standardization/I nternational Electrotech
nical Commission, 1990).
14. S. Mart in and M. Mori, Internationalization in OSF/1 Release 1. 1 (Cambridge, MA : Open Software Foundation, Inc. , 1992).
15. J. Becker, " Mu ltilingual Word Processing," Sci
entific A merican, vol. 251 , no. 1 (Ju ly 1984) : 96-107
16. Coded Character Sets fo r Text Communica
tion, Parts 1 and 2, ISO/IEC 6937 (Geneva:
62
In ternational Organ ization for Standardiza
tion/International Electrotechnical Commis
sion, 1983).
17. J Bertels and F. Bishop, " Unicode: A Un iversal Character Code," Digital Technical journal, vol. 5, no. 3 (Su m mer 1993, this issue): 21-31.
18. Go Comp uter Corporation, "Compaction Techniques," Second Unicode Implementors' Conference (1992).
19. J Becker, " Re : Updated [Problems wi th]
Unbound (Open) Repertoire Paper" (January 18, 1991) . This electronic mail message was sent to the Unicode mail ing l ist.
20. V Joloboff and W McMahon, X Windozu System, Version 11, Inpu t Method Specifica
tion, Public Review Draft (Cambridge, MA : Massachuset ts Institute of Technology, 1990).
21. M . Davis, (Tal igent) correspondence to the Unicode Technical Com mittee, 1992.
22. M. Gasser et a!., " Digital Distributed Security Architecture" (Maynard, MA: Digital Equip
ment Corporation, 1988). This i n ternal docu
ment is unavailable to external readers.
23. H. Custer, Inside Windows NT ( Redmonc!, WA:
Microsoft Press, 1992).
Vol. 5 Nu. 3 Summer 1993 Digital Tecbnical jonnwl