2.1 Character Codes and Character Sets

(1)

Problems and Solutions to Manage Non-Western Languages in a Multilingual Database and in the World Wide Web

Mitsuo Matsumotol and Satoshi Tsuyukiz

1 Forest management division, Forestry and Forest Products Research Institute P.O. Box 15, Tsukuba Norin, Ibaraki, 305-8687, Japan

machan@ffpri.affrc.go.jp

2 Department of Global Agricultural Sciences, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan

tsuyuki@fr.a.u-tokyo.ac.jp

Abstract

SilvaVoc, which is IUFRO’s clearing-house for —>terminology, is building a forest —->terminology database and has planned to publish it on the Internet. Non-Western languages such as Japanese, however, have some problems in multilingual databases, and present Web browsers have a limited capacity to manage multilingual functions. For example, Web browsers for English can not display French and Japanese on the same window, but new technologies such as Unicode and HTML version 4.0 have been developed to solve these problems. We propose the following solutions to manage multilingualism in a —->terminology database and in the Internet:

— to treat non-Western characters as images

- to display the characters using Java applets on Web browsers

— and to apply HTML 4.0 with Unicode.

The choice of these methods depends on the time of completion of the system. If there is enough time to wait until HTML 4.0 comes into wide spread use, a combination of HTML 4.0 and Unicode will be the best solution. However, other methods should be chosen if this wait is not feasible.

Keywords: multilingualism, Internet, SilvaVoc, HTML, Unicode, Java, World Wide Web

1 Introduction

SilvaVoc, which is IUFRO’s clearing-house for —>terminology, is building a multilingual forest

—>terminology database and has planned to publish it on the Internet. However, there are several problems regarding the development of the database and the World Wide Web publishing system, including the management and handling of non-Western languages on the Internet. This report examines the problems in handling such languages, and proposes appropriate solutions.

2 Background

Building a multilingual database and publishing it on the Internet requires, management and treatment of character codes and fonts. If the wrong codes or fonts are chosen for a text, illegal characters are displayed on computer monitors. Even if correct codes are chosen on a computer, illegal characters may be displayed if the computer does not have proper fonts.

Besides, HTML 3.2, which is the most popular language for Web pages today, does not support multilingualism fundamentally.

(2)

2.1 Character Codes and Character Sets

Generally speaking, characters used in Western languages can be mapped with just one byte, and there are ISO standardized character sets of Western languages in a one-byte system. For example, ISO 8859-1 is a basic character set that defines English, German, French, Spanish, Italian, Portuguese and so on. ISO 8859-2 defines Hungarian, Romanian, etc., -5 defines Russian, and -6 defines Arabic.

On the other hand, non-Western languages such as Japanese, Chinese and Korean usually have many characters. Considering that nearly 8,000 characters are used in Japanese, for example, one byte is not enough to represent them. Thus, two-byte code systems are used for Japanese. In addition, Japanese has several types of coding systems such as JIS, Shift-JIS, EUC-jp, and this causes another problem when handling this language on computers.

2.2 Unicode

Unicode is a universal character code and set including non-Western languages such as Japanese, Chinese, Korean and Arabic. According to the definition of the latest Unicode version 2.0, the length of character codes is variable, and they are coded by two or four bytes.

The character sets are called UCS (Universal Character Set)-2/4. UTF (UCS Transfer Format)-8 is an encoding method to make UCS compatible with ASCII code. Thus, even old operating systems and software that do not support Unicode can display ASCII characters, if

they are coded by Unicode.

The latest Web browser versions such as Netscape Navigator 4 and Internet Explorer 4 support Unicode partially. For example, UTF-8 can be chosen as a character set on Netscape Navigator 4.01. Furthermore, the next versions of main operating systems such as Windows 98, Windows NT 5.0 and MacOS 8.5 will support Unicode, according to announcements from the respective companies.

2.3 Present Internet Technology

Another problem is the present status of Internet technology. Although the currently most popular version number of HTML is 3.2, this version does not support multilingualism fundamentally. For example, HTML 3.2 does not have a command to define languages though it can define fonts. In addition, Web browsers can use only one character set for one window.

This means that German and Japanese cannot be displayed together in a window, for example.

3 Solutions

In order to support the simultaneous display of Western and non-Western characters in a eterminology database and its use in the Internet, we consider applying the following solutions: (1) character image method, (2) Java applet with Unicode method and (3) HTML 4.0 with Unicode. Descriptions of these solutions are as follows.

(3)

3.1 Character Image Method

The Character Image Method treats characters as images by converting character codes to raster images. Advantages of this method are that it is already implemented and that it is independent from operating systems, computers, Web browsers, and font data. Therefore, Japanese characters can even be displayed on computers with no Japanese fonts. In addition, the image data can be managed easily in databases. On the other hand, it is impossible to re- use the search results by “cut & paste” because they are not character codes but images. Also, a character code — raster image conversion routine is necessary.

Figure 1 shows a ﬂowchart of the character image method. First, a user inputs a word or selects a word in a list to query on a Web browser. Next, this query is sent to the Web server.

Then the query is sent to the database server via Common Gateway Interface (CGI). The database returns the result to the Web server. The result is a raster image in this system. Then the result is returned to the Web browser, which shows the word using raster image in the window. The main point of this method is treating images instead of character codes.

Moreover, an advanced system of character imaging is available. This is called a delegate system, and it uses a delegate server for character code - raster image conversion, as shown in Figure 2. The database server and the Web server manage characters as character codes.

When the Web server sends the query results, the delegate server converts the character codes into raster images automatically. Then the images are displayed in the browser. Unlike the previous system, this delegate system does not require the development of a code - image conversion routine.

3.2 Java Applet with Unicode Method

This method needs an original Java applet that can display non-Western characters using

Unicode on Web browsers. Java applets mean programs in Java environment on Web browsers. Figure 3 shows a system of Java applet with Unicode method. Differences between this and the previous method are a Web server and a Java applet on the Web browser. The Web server sends the Java applet and character codes, and font data to the browsers when needed.

This Java applet treats characters in Unicode and displays them on the browser.

Browser _ Browser

Cnem pg, Qhem PC; g .. . . 7 Resultamage)

. . L. .__ .. __§.. i

Query §Result(Image) Query Delegate

,,,,, ----

WWW Web Page WWW l Web Page , ,

Server ; . Server 1 1

~ ‘ i Result Code

.... . . < >

14 ‘4

CGI Query Result(Image) CGI Query : Result(Code)

-.. . ... ...,

DB serve; Data Base ‘ D3 Server Data Base

Fig. 1. Character image method Basic Fig.2. Character image method Delegate system.

system.

(4)

_ l Browser W _ Browser (UTF-8)

Chem PC i 5']"""K" """"""""""""""" Chem PC I 1

5 ava pp e . , y y

s ‘ 1 ‘ l

; JavaA plet

Query §ResultFCode) Query gfjflgggdel

,1_______________,-_._.+.__!...139nLD§_ta_,_,,,,,,_, ..,..,L_._,_____.v.-u...- WWW Server Web Page WWW Server (HTM‘I‘J"j,l?01”§‘5éIfiCode) L

l ...Li.,,,,_L.,,,,,,,,,,, L____ -. MILL?__________________.

A E4

CGI Query Result(Code) CGI Query Result(Code)

-7---1--*~~---7W---, 1--- ---,

"1

DB Server Data Base y DB Serve Data Base t

~ l

1 l

i___L.._q_Mi___._ ....,L_-M~/LM__2_L,_LL_,,,,,___

Fig. 3. Java applet with Unicode method. Fig. 4. HTML 4.0 with Unicode method.

The advantage of this method is that multilingualism can be achieved with no delay even in HTML 3.2 environments by Java applets, and that various languages can be displayed in a browser Window. On the other hand, it requires an original Java applet.

3.3 HTML 4.0 with Unicode Method

This method is used to develop Web pages on HTML 4.0 environment with Unicode. Figure 4 shows a ﬂowchart of HTML 4.0 with Unicode method. In this system, the Web browser returns character codes in Unicode and the font data when needed.

HTML 4.0 is designed for multilingualism. For example, it supports Unicode and provides

language attributes of tags to specify languages. However, HTML 4.0 was just standardized in December 1997, and it is still not widely used. Therefore, there are few Web pages developed on HTML 4.0 with Unicode. It may take a few years for this technology to come into popular use.

4 Conclusion

Three possible solutions have been examined. The question is, which method should be chosen to allow simultaneous display of Western and non-Western languages in a -—>terminology database? The answer depends on the schedule of completion and on the spread of new technologies such as HTML 4.0 and Unicode. If the database must be completed in a few weeks, the only solution will be the character image method. If a few months are available, Java applets are better. If we can wait a few years, the method combining HTML 4.0 with Unicode will be the best solution. If the spread of HTML 4.0 and Unicode is delayed however, the other two methods will have to be considered. In all cases except the character image basic method, the structures of the databases are common. Therefore, after completion of the database, a Web publishing method will have to be chosen by considering the status of the Internet at that time.

(5)

5 Related Web Sites

— WWW Consortium; HTML 4.0.:

<ht:tp: / /www.w3 .org/International/>

— Unicode Inc; Unicode2.0, 3.0.:

<http: //www.unicode . org/>

— The following URL shows Japanese characters on any computer using a delegate server

<htt:p: //www. lfw. org/shodouka/>

(6)