• Keine Ergebnisse gefunden

Shahnawaz Alam, Sunil Kumar, Rajesha N, Manasa G, Narayan Choudhary, L. Ramamoorthy

4.1 I NTRODUCTION

Dogri, is an Indo-Aryan Language spoken by about five million people in India and Pakistan, particularly in the Jammu region of Jammu and Kashmir and Himachal Pradesh, also in northern Punjab, other parts of Jammu and Kashmir. Dogri was originally written using the Dogri script which is very close to the Takri script.The language is now more commonly written in Devanagari in India, and in the Nastaʿliq form of Perso-Arabic in Pakistan and Pakistani-administered Kashmir.

Dogri has several varieties, all with greater than 80% lexical similarity (within Jammu and Kashmir). Before gaining language status, per the Census of India, Dogri was classified as one of the many varieties of Punjabi, such as Majhi or Doabi.

Western Pahari languages, Punjabi and Punjabi dialects are frequently tonal, which is very unusual for Indo-European languages (although Swedish and Norwegian are tonal also). This tonality makes it difficult for speakers of other Indo-Aryan languages to gain facility in Dogri, though native Punjabi speakers (especially speakers of Northern dialects such as Hindko and Mirpuri) may find it easier to make the transition.

Official recognition of the language has been gradual, but progressive. On 2 August 1969, the General Council of the Sahitya Academy, Delhi recognized Dogri as an "independent modern literary language"

of India, based on the unanimous recommendation of a panel of linguists. (Indian Express, New Delhi, 3 August 1969).

In 2005, a collection of over 100 works of prose and poetry in Dogri published over the last 50 years was made accessible online at the Central Institute of Indian Languages (CIIL), Mysore. This included works of eminent writer Dhinu Bhai Panth, Professor Madan Mohan Sharma, B.P. Sathai and Ram Nath Shastri.

Dogri text corpus is collected from various libraries in Jammu and Kashmir, mostly from Jammu. The greater part of the text has been taken from library of Department of Dogri, Jammu University, Jammu University Library, J&K Academy of Arts, Culture and Languages and Dogri Sansatha-Jammu

LDC-IL tried to cover the entire category in its standard list. Some categories like novel, short stories have huge amount of books but some categories like physics, chemistry, economics have very less amount of books. Literary texts are easily available in Dogri but getting scientific text is very difficult.

Some categories like epigraphy, finance, oceanology text are too rare in Dogri

4.2 P ECULIARITIES OF D OGRI TEXT

The Corpus of Dogri text can be broadly classified into two: literary text and non-literary text. These two explicitly show their differences in terms of frequency of word usage and variety that it brings into corpus. Literary texts are texts that are narrative and it contains elements of fiction. Novels, short stories, plays are examples of literary text. Non-literary texts are texts whose primary purpose is to convey

Linguistic Resources for AI/NLP in Indian Languages 27 information. Examples of non-literary texts are text about various scientific or technical subjects, legal documents, articles in academic journals. In literary text, language has emotional elements, cultural information, dialectical variations, ambiguity etc. But technical or scientific terms, foreign words etc.

have widely appeared in non-literary texts.

4.3 D ATA S AMPLING N OTES

4.3.1 Principles of Data Sampling

Dogri text data sampling strictly followed the generic guidelines of LDC-IL text corpus collection which are noted in the generic LDC-IL corpus documentation.

4.3.2 Field Works Undertaken

Dogri text corpus is collected from various libraries in Jammu & Kshmir, mostly from Jammu. The text materials were collected by conducting one field work undertaken in the period from August – October 2010. The greater part of the text has been taken from Library of Department of Dogri, Jammu University, Jammu University Library and Dogri Sansatha-Jammu

Overall, the following libraries served as the source of the Dogri text corpus:

• Library of PG Department of Dogri, University of Jammu, Jammu

• J&K Academy of Art, Culture and Languages, Jammu & Kashmir

• Dogri Sansatha-Jammu

Collected text materials have been published at various places within J&K and other states of India such as J&K, Himachal Pradesh, Delhi, Mumbai etc.

An attempt has been made to cover the entire category in its standard list. Some categories like novel, short stories have huge amount of books but some categories like physics, chemistry, economics have very less amount of books. Literary texts are easily available in Dogri but getting scientific text is very difficult. Some categories like epigraphy, finance, oceanology text are too rare in Dogri.

Collecting text data from the field is a difficult job. Most of the libraries do not allow to take huge amount of text from their shelves at a time because it is against their rules and principles. For a particular period, they issue a maximum three or four books. Even if the librarian allowed to take many books at a time, the photocopy kiosk had issues as there was a long queue.

Some time Xerox attendents refused to photocopy randomly selected pages because of the long queue waiting and it takes up more time for them to turn the pages compared to continuous page photocopying they are accustomed to. It was another issue that the field worker/linguist had to carry a huge list of photocopy bundles with them which was many a times cumbersome to travel with.

Despite all the issues as above, the linguists working on the data collection had to deal with and get going.

4.3.3 Data Inputting

All the text has been typed in Unicode using the InScript Keyboard directly onto the XML files. The data has been inputted by Mrs. Rajeshwari.

28 Dogri Raw Text Corpus

4.3.4 Proofreading

Dogri text data has been proofread by internal resource persons. The text has always been kept true to the printed material and typos, if any, occurring at the time of typing have only been corrected.

The printed materials collected for the corpus is contemporary , mainly published after 1990. hence The text material available is with the reformed script which came into effect from 1969.

4.3.5 Validation and Normalization Workshops

A 45-day workshop was conducted at Linguistic Data Consortium from 19th Sept. to 31st Oct., 2013 with three resource persons from Jammu. The input data of Dogri text has been cleaned by these external resource persons as well as internal resource persons.

4.4 T RANSLITERATIONS IN LDC-IL D OGRI TEXT CORPUS

For easy reference and uniformity of metadata, some entries in the metadata file, namely ‘Title’,

‘Headline’, ‘Author’, ‘Editor’, ‘Translator’ are transliterated from Dogri to Roman letters. Numeric characters are same as Roman.

The LDC-IL transliteration scheme of Dogri to Roman is given below.

Consonants

क ख घ ङ ka kha ga gha ng'a

च छ ज झ ञ ca cha ja jha nj'a

ट ठ ड ढ ण

Ta Tha Da Dha Na

त थ द ध न

ta tha da dha na

प फ ब भ म

pa pha ba bha ma

य र व श स ह

ya ra la va sha sa ha

4.5 O VERVIEW OF R EPRESENTED D OMAINS

LDC-IL Dogri Text Corpus size is: 8,01,771 Words and character count is 41,25,617 drawn from 183 different titles, including the extracts from newspapers.

LDC-IL Transliteration Scheme

Dogri

characters to Roman

Vowels and Vowel Signs

आ इ ई उ ऊ ए ऐ ओ औ

ा िा ा ा ा ा ा ा ा

a A i I u U e ai o au

Linguistic Resources for AI/NLP in Indian Languages 29 The following table gives a summary of the typed and crawled text of the Dogri Raw Text Corpus.

Text Type Word Count KeyStroke/Character Count Typed+Cleaned 8,01,771 4125617

Table 4-1 Representation of Typed and Cleaned Dogri Text Copus

The representation of the five major domains covered has been shown in the table below:

Domain Word Count Percentage

Mass Media 156,756 19.55%

Science & Technology 2,730 0.34%

Aesthetics 594,609 74.16%

Commerce 1,350 0.17%

Social Sciences 46,326 5.78%

Total 8,01,771 100

Table 4-2 Representation of the Domains in Dogri Text Corpus

As each domain has several sub-domains, the following table shows the representation of the several domains, both within the domain and across all the domains.

4.5.1 Mass Media

The Mass Media category of Dogri text corpus covers 5 sub-categories bearing a total of 156,756 words along with the overall percentage of 19.55%. The representational details are given in the table below.

Subdomain Word Count % (within Subdomain) Overall Percentage

Discussions 947 0.604124% 0.12%

Editorial 74555 47.56118% 9.30%

General News 80828 51.56294% 10.08%

Letters 426 0.27176% 0.05%

Total 156,756 100% 19.55%

Table 4-3 Mass Media Category Representation

4.5.2 Science and Technology

The Science and Technology category of Dogri text corpus covers 1 sub-categories bearing a total of 2730 words along with the overall percentage of 0.34%. The representational details are given in the table below.

Subdomain Word Count % (within Subdomain) Overall Percentage

Agriculture 2730 100% 0.34%

Table 4-4 Science and Technology Category Representation

4.5.3 Aesthetics

The Aesthetics category of Dogri text corpus covers 14 sub-categories bearing a total of 594,609 words along with the overall percentage of 74.16%. The representational details are given in the table below.

Subdomain Word Count % (within Subdomain) Overall Percentage

30 Dogri Raw Text Corpus

Autobiographies 8758 1.472901% 1.09%

Biographies 34892 5.868058% 4.35%

Cinema 18740 3.151651% 2.34%

Culture 11972 2.013424% 1.49%

Fine Arts-Sculpture 2464 0.41439% 0.31%

Folklore 50178 8.438823% 6.26%

Humour 3536 0.594677% 0.44%

Literature-Criticism 32139 5.405065% 4.01%

Literature-Essays 121110 20.36801% 15.11%

Literature-Novels 85273 14.34102% 10.64%

Literature-Plays 77736 13.07347% 9.70%

Literature-Short Stories 138874 23.35552% 17.32%

Literature-Speeches 931 0.156573% 0.12%

Literature-Travelogues 8006 1.346431% 1.00%

Total 594,609 100% 74.16%

Table 4-5 Aeshthetics Category Representation

4.5.4 Commerce

The Commerce category of Dogri text corpus covers 1 sub-categories bearing a total of 1350 words along with the overall percentage of 0.17%. The representational details are given in the table below.

Subdomain Word Count % (within Subdomain) Overall Percentage

Business 1350 100% 0.17%

Table 4-6 Commerce Category Representation

4.5.5 Socical Sciences

The Social Science category of Dogri text corpus covers 6 sub-categories bearing a total of 46,326 words along with the overall percentage of 3.99%. The representational details are given in the table below.

Subdomain Word Count

% (within

Subdomain) Overall Percentage

Food and Wellness 582 1.256314% 0.07%

Health and Family Welfare 3846 8.302033% 0.48%

Linguistics 3673 7.928593% 0.46%

Religion/Spiritual 3610 7.7926% 0.45%

Sociology 2664 5.75055% 0.33%

Sports 31951 68.96991% 3.99%

Total 46,326 100% 5.78%

Table 4-7 Social Science Category Representation

Linguistic Resources for AI/NLP in Indian Languages 31

4.6 C OPYRIGHT C ONSENTS

The Dogri text corpus have been collected from various sources and the copyright for the same stays with different sources. However, for the purposes of this corpus, consents have been sought from all the stakeholders. Most of the copyrights (around 49%) belong to private parties with only 51% belonging to

the government agencies, either state or the central.