• Keine Ergebnisse gefunden

Umesh Chamling, Rupesh Rai, Rajesha N., Manasa G., Dr. Narayan Choudhary, Dr. L. Ramamoorthy

13.1 I NTRODUCTION

Nepali is one of the official language of West Bengal and Sikkim state. It is one of the 22 schedual languages of India. It is spoken in most of North-Eastern states of India and also other states, similarly Delhi, Uttranchal, Uttar Pradesh, Bihar, Jharkhand etc. Nepali is also an official language of Nepal. About a quarter of the population in Bhutan speaks Nepali. Nepali is written in Devanagari Script. It is written from left to right direction. It also called Nagari. Nagari script has roots in the ancient Brāhmī script family, It has long been used traditionally by religiously educated people in South Asia. The Devanagari script is used for over 120 languages, and those are Nepali, Hindi, Marathi, Bhojpuri, Maithili etc. It is closely related to the Nandinagari script commonly found in numerous ancient manuscripts of South India. The script is also used to write several minority languages of Nepali community such as Gurung, Magar, Bhujel, Thami etc.

Nepali text corpus is collected from various libraries of Darjeeling, Sikkim, Assam, Uttaranchal. Mostly from Kurseong, Mirik, Kalimpong, Silgadhi, Gangtok, Guwahati, Almora and Mussoorie. The greater part of the text has been taken from Desbandhu District Library Darjeeling, Sonada Library, Sarbajanik Sammelan Rural Library Mirik, Sub Divisional Library, Kalimpong, NERLC(North-East Regional Language Centre, Guwahati) Library. LDC-IL tried to cover the entire category in its standard list. Some categories like novel, short stories have huge amount of books but some categories like Physics, Chemistry, Economics, Agriculture, Photography have very less amount of books. Literary texts are easily available in Nepali but getting scientific, Technical text is very difficult. Some categories like Epigraphy, Finance, Oceanology text are too rare in Nepali.

13.2 P ECULIARITIES OF N EPALI TEXT

The Corpus of Nepali text can be broadly classified into two: literary text and non-literary text. These two explicitly show their differences in terms of frequency of word usage and variety that it brings into corpus. Literary texts are texts that are narrative and it contains elements of fiction. Novels, short stories, plays are examples of literary text. Non-literary texts are texts whose primary purpose is to convey information. Examples of non-literary texts are text about various scientific or technical subjects, legal documents, articles in academic journals. In literary text, language has emotional elements, cultural information, dialectical variations, ambiguity etc. But technical or scientific terms, foreign words etc.

have widely appeared in non-literary texts.

13.2.1 Orthographic variation and eyelash ‘ra’ in Nepali

A glyph has no intrinsic meaning, it conveys distinctions in form. Time to time the user or developers made small variation in devanagari script and same changes come into Nepali. These were in अ, झ, ण, श.

It was not unique feature of Nepali, but it made small changes in use of nepali orthography system. We faced problem while inputing data from many text.

Besides that, Nepali has its typical orthography, which is called ‘Shaja’ [ A publisher from Lalitpur Kathmandu, Nepal]

/dz/ ‘

jha’, no other Devanagari script users having the same.

Linguistic Resources for AI/NLP in Indian Languages 105

Nepali has eyelash /r/ ‘ra’. This is ‘ra’ with halanta ( र् ), or half ‘ra’( र ). It has its single Unicode value.

There are more than three ways to type eyelash ‘ra’.

13.2.2 Transliterations in LDC-IL Nepali text corpus

For easy reference and uniformity of metadata, some entries in the metadata file, namely ‘Title’,

‘Headline’, ‘Author’, ‘Editor’, ‘Translator’ are transliterated from Nepali to Roman letters.

The LDC-IL transliteration scheme of Nepali to Roman is given below LDC-IL Transliteration Schema Nepali characters to Roman and Nepali Vowels and Vowel Signs

आ इ ई उ ऊ ए ऐ ओ औ अं :

ा िा ा ा ा ा ा ा ा ा ം ഃ

a A i I u U x e ai o au M H

Consonants

क ख ग घ ङ ka kha ga gha ng' a च छ ज झ ञ ca cha ja jha nj'a

ट ठ ड ढ ण Ta Tha Da Dha Na त थ द ध न ta tha da dha na

प फ ब भ म p

a pha ba bha ma

य र ल व श ष स ह ड़ ढ़ ഺ

ya Ra la va sha Sa sa ha La Za TTTa

Eyelash ra

106 Nepali Raw Text Corpus

13.3 D ATA S AMPLING N OTES

13.3.1 Principles of Data Sampling

Nepali text data sampling strictly followed the generic guidelines of LDC-IL text corpus collection which are noted in the generic LDC-IL corpus documentation.

13.3.2 Field Works Undertaken

Nepali text corpus is collected from various libraries of Darjeeling, Sikkim, Assam and Uttaranchal. The text materials were collected by conducting five field works undertaken in the period from 2009 to 2012.

The greater part of the text has been taken from Khappandas Memorial Library, Soureni Busty Mirik, Sub Divisional Library Kalimpong, North Bengal University Library Darjeeling, and various public library.

Overall, the following libraries served as the source of the Nepali text corpus:

• Mirik Sarbajanik Sammelan Rural Library, Mirik

• Garidhura Public Library, Kurseong

• Sub Divisional Library, Kalimpong

• Nava Yowak Sangha Rural Library, Rungbull

• Gorkha Jana Pustakalaya, Kurseong

• Khappandas Memorial Library, Soureni Busty, Mirik

• Pankhabari Public Library, Pankhabari

• North Bengal University Library, Darjeeling

• Kurseong College Library, Kurseong

• Desbandhu District Library, Darjeeling

• Devkota Sangh Pustakalaya, Silgadhi

• NERLC Library, Assam

• Central Institute of Indian Language Library, Mysore

• Personal Collections.

Collected text materials have been published at various places within Darjeeling and other states of India such as Sikkim, Assam, Manipur, Meghalaya, Arunachal Pradesh, Nagaland, Uttra Pradesh, Uttranchal, Himachal Pradesh, Delhi, Bihar, Andra Pradesh, Karnataka as well as other countries such as Nepal, Russia, Denmark etc.

An attempt has been made to cover the entire category in its standard list. Some categories like novel, short stories have huge amount of books but some categories like physics, chemistry, economics have very less amount of books. Literary texts are easily available in Nepali but getting scientific text is very difficult. Some categories like epigraphy, finance, oceanology text are too rare in Nepali.

Numerals (Malayalam to Hindu-Arabic)

० १ २ ३ ४ ५ ६ ७ ८ ९

0 1 2 3 4 5 6 7 8 9

Linguistic Resources for AI/NLP in Indian Languages 107 Collecting text data from the field is a difficult job. Most of the libraries do not allow to take huge amount of text from their shelves at a time because it is against their rules and principles. For a particular period, they issue a maximum three or four books. Even if the librarian allowed to take many books at a time, the photocopy kiosk had issues as there was a long queue.

Some time Xerox attendents refused to photocopy randomly selected pages because of the long queue waiting and it takes up more time for them to turn the pages compared to continuous page photocopying they are accustomed to. It was another issue that the field worker/linguist had to carry a huge list of photocopy bundles with them which was many a times cumbersome to travel with.

Despite all the issues as above, the linguists working on the data collection had to deal with and get going.

13.3.3 Data Inputting

All the text has been typed in Unicode using the InScript Keyboard directly onto the XML files. The data has been inputted by Ms. Srilakshmi M P, Sithalakshmi M L, Vanamala B H, Rajeshwari R, Vidhyashree M, Padmashree H R, Radhika M, Mamatha, all native speakers of Kannada and Tamil but familiar enough with the scripts of Devanagari.

13.3.4 Proofreading

Nepali text data has been proofread by internal and external resource persons. We conducted corpus normalization workshop with external resource persons on 4thJune to 15th July 2010, 3rd January to 28th February 2013, 5th August to 4th October 2013, 10th November to 7th January 2015. The text has always been kept true to the printed material and typos, if any, occurring at the time of typing have only been corrected.

The printed materials collected for the corpus are contemporary , mainly published after 1990.

13.3.5 Data Extracted from Web Sites

Nepali News cropus data is extracted from News websites of "Nepalsamachar Patra" (https://www.

http://pknewspapers.com/) , " Gorkhapatra" (www.http://gorkhapatraonline.com/). The news content was categorized based on the content of the text and archived. The period of selection of the news corpus ranges from 30,Jan 2009 to 11 Sep 2009.

13.4 O VERVIEW OF R EPRESENTED D OMAINS

LDC-IL Nepali Text Corpus size is: 70,57,524 Words with character count at 46879154 drawn from 1,347 different titles, including the extracts from newspapers. The data can be categorized into two classes of typed+cleaned and crawled. The crawled data has been crawled mainly from news websites and archived using the standard processing of LDC-IL text corpus preparation.

The following table gives a summary of the typed and crawled text of the Nepali Raw Text Corpus.

Text Type Word Count KeyStroke/Character Count Typed+Cleaned 6787918 45104255

108 Nepali Raw Text Corpus

Crawled 269606 1774899

Total 7057524 46879154

Table 13-1 Representation of the Domains in Nepali Text Corpus

The representation of the six major domains covered has been shown in the table below:

Domain Word Count Percentage

Aesthetics 4072977 57.71%

Commerce 30354 0.43%

Mass Media 2271064 32.18%

Official Documents 2426 0.03%

Science & Technology 80306 1.14%

Social Sciences 600397 8.51%

Total 70,57,524 100.00%

Table 13-2: Representation of the Domains in Nepali Text Corpus

As each domain has several sub-domains, the following table shows the representation of the several domains, both within the domain and across all the domains.

Domain Subdomain

Word Count

% (within Subdomain)

Overall Percentage

Aesthetics Autobiographies 24754 0.61% 0.35%

Aesthetics Biographies 307829 7.56% 4.36%

Aesthetics Cinema 3258 0.08% 0.05%

Aesthetics Culture 96596 2.37% 1.37%

Aesthetics Fine Arts-Dance 11002 0.27% 0.16%

Aesthetics Fine Arts-Drawing 740 0.02% 0.01%

Aesthetics Fine Arts-Music 10070 0.25% 0.14%

Aesthetics

Fine Arts-Musical Instruments

6620 0.16% 0.09%

Aesthetics Fine Arts-Sculpture 10525 0.26% 0.15%

Aesthetics Folk Tales 621 0.02% 0.01%

Aesthetics Folklore 27720 0.68% 0.39%

Aesthetics Humour 35026 0.86% 0.50%

Linguistic Resources for AI/NLP in Indian Languages 109

Aesthetics

Literature-Children's Literature

10479 0.26% 0.15%

Aesthetics Literature-Criticism 863007 21.19% 12.23%

Aesthetics Literature-Diaries 307052 7.54% 4.35%

Aesthetics Literature-Epics 200 0.00% 0.00%

Aesthetics Literature-Essays 425981 10.46% 6.04%

Aesthetics Literature-Letters 4835 0.12% 0.07%

Aesthetics Literature-Novels 629468 15.45% 8.92%

Aesthetics Literature-Plays 233675 5.74% 3.31%

Aesthetics Literature-Science Fiction 7178 0.18% 0.10%

Aesthetics Literature-Short Stories 788433 19.36% 11.17%

Aesthetics Literature-Speeches 39681 0.97% 0.56%

Aesthetics

Literature-Text Books (School)

103956 2.55% 1.47%

Aesthetics Literature-Travelogues 92892 2.28% 1.32%

Aesthetics Mythology 27922 0.69% 0.40%

Aesthetics Photography 3457 0.08% 0.05%

Commerce Banking 9416 31.02% 0.13%

Commerce Business 8391 27.64% 0.12%

Commerce Finance 6957 22.92% 0.10%

Commerce Industry 782 2.58% 0.01%

Commerce Tourism 4808 15.84% 0.07%

Mass Media Article 109118 4.80% 1.55%

Mass Media Classifieds 454 0.02% 0.01%

Mass Media Discussions 99652 4.39% 1.41%

110 Nepali Raw Text Corpus

Official Document Police Documents 2426 100.00% 0.03%

Science and

Linguistic Resources for AI/NLP in Indian Languages 111

Social Sciences Anthropology 13856 2.31% 0.20%

Social Sciences Archeology 696 0.12% 0.01%

Social Sciences Economics 7890 1.31% 0.11%

Social Sciences Education 61967 10.32% 0.88%

Social Sciences Fisheries 305 0.05% 0.00%

Social Sciences Geography 1475 0.25% 0.02%

Social Sciences Health and Family Welfare 16799 2.80% 0.24%

112 Marathi Raw Text Corpus

Social Sciences History 228123 38.00% 3.23%

Social Sciences Home Science 612 0.10% 0.01%

Social Sciences Journalism 12733 2.12% 0.18%

Social Sciences Law 7407 1.23% 0.10%

Social Sciences Linguistics 49593 8.26% 0.70%

Social Sciences Philosophy 12488 2.08% 0.18%

Social Sciences Physical Education 549 0.09% 0.01%

Social Sciences Political Science 68929 11.48% 0.98%

Social Sciences Public Administration 467 0.08% 0.01%

Social Sciences Religion/Spiritual 66272 11.04% 0.94%

Social Sciences Sociology 31702 5.28% 0.45%

Social Sciences Sports 12135 2.02% 0.17%

Social Sciences Text Book (Social Science) 6399 1.07% 0.09%

Table 13-3: Representation of Subdomains in Nepali Text Corpus

13.5 C OPYRIGHT C ONSENTS

The Nepali text corpus have been collected from various sources and the copyright for the same stays with different sources. However, for the purposes of this corpus, consents have been sought from all the stakeholders. Most of the copyrights (around 90%) belong to private parties with only 10% belonging to the government agencies, either state or the central.