ERROR RATE OF AUTOMATED PART-OF-SPEECH TAGGING OF ESTONIAN ACADEMIC LEARNER ENGLISH

(1)

ERROR RATE OF AUTOMATED PART-OF-SPEECH TAGGING OF ESTONIAN ACADEMIC LEARNER

ENGLISH BA thesis

KARL AUGUST KALJUSTE SUPERVISOR: ASSOC. PROF. JANE KLAVAN, PhD

TARTU

2021

(2)

ABSTRACT

Corpora are a great tool for linguistic research and improving learner language. At the moment, there exists the Tartu Corpus of Estonian Learner English (TCELE). However, it is small and lacking academic learner English. Building a corpus of Estonian academic learner English (EALE) could fill the gap in TCELE and provide worthwhile information for students, teachers and researchers alike. Modern corpora include various types of annotation and tagging words for their part of speech (POS) is the most common of them, but manual tagging is an overwhelmingly long and difficult task. Automated taggers can make this process relatively fast and easy. However, while automated tagger performance has been evaluated with both native writing and learner writing, there is a lack of research of automated tagger performance on academic learner writing. This paper aims to study the accuracy of automated POS tagging of EALE. To achieve this, a corpus of EALE was built and tagged using the Natural Language Toolkit (NLTK) POS tagger with the results compared against a sample of manually added tags.

The thesis begins with an introduction, which gives an overview of the motivation behind this paper as well as the structure of the thesis. It is followed by a literature review, describing previous studies and their findings on corpora, corpus building, POS tagging and specialised corpora, including TCELE. The empirical analysis section details the methodology, describing how the corpus was built, tagged and the tags evaluated, followed by the results and the discussion. The thesis ends with a conclusion.

(3)

LIST OF ABBREVIATIONS

CALE – Corpus of Academic Learner English EALE – Estonian academic learner English

ELFA – Corpus of English as a Lingua Franca in Academic Settings ICLE – International Corpus of Learner English

POS – Part of speech

TCELE – Tartu Corpus of Estonian Learner English

(5)

INTRODUCTION

Throughout the history of linguistics, there has been a desire to study language by analysing large collections of text, such as the process of indexing individual words in the Christian Bible dating as far back as the 13th century (McCarthy & O’Keeffe 2010: 3). In the modern day, it is no longer necessary to sift through thousands of pages of text and manually mark the occurrences of each word to study the use of language. In fact, this type of linguistic study is generally conducted with the help of corpora, containing vast amounts of text from many different authors, which can then be analysed using the help of computer software. This thesis focuses on the use of corpora and the related computer software to study learner language.

There are many different types of corpora available, including specialised corpora, making it possible to not only research general language, but language used in specific contexts, such as learner language. One such corpus is the Tartu Corpus of Estonian Learner English (TCELE). At the moment, TCELE consists of university entrance exam essays.

Expanding the corpus by adding texts of academic learner language could provide valuable information about Estonian academic learner writing in English. This could in turn be used to study and help improve Estonian academic learner English (EALE) by comparing the data to native speaker’s academic English corpora. The present study aims to create a mini corpus of EALE, preparing academic texts for future linguistic studies.

Van Rooy and Schäfer (2002: 328) point out that corpora should be annotated to enhance their uses for several different purposes, and according to Reppen (2010: 35), part- of-speech (POS) grammar tags are the most common annotation added to corpora. However, adding POS tags manually to a vast amount of text is a prohibitively long and difficult undertaking. Automated POS tagging can be used to speed up the process greatly, offloading the task to computer software that can rapidly tag large collections of text with relative ease.

(6)

As Manning (2011: 2) explains, human tagging has a limit of around 97% accuracy, while automated POS taggers have an accuracy slightly above 97%, making automated tagging a comparable or even superior solution. Nevertheless, de Haan (2000: 69) notes that automated tagging of learner English can have a higher error rate, with an average of about 5%.

In addition to studies of tagging learner English in general, Aare Undo (2018) has previously carried out a study of calculating the error percentage of Estonian learner English, a topic very similar to the present thesis. However, unlike the previous study, this thesis is focused on academic learner English. The differences arise in the level of English used as well as the writing style. The previous study was based on automated POS tagging of TCELE which consists of BA entrance exam essays (Undo 2018: 5). In contrast, the present thesis will explore the automated POS tagging of the BA theses written at the English Department of the University of Tartu.

The aim of this thesis is to work through the process of building a corpus of EALE and to find out the per token (individual words rather than whole sentences) error rate of an automated POS tagger used on said corpus in order to determine the viability of automated POS tagging of EALE. The hypothesis is that due to the higher level of English used in academic writing, combined with the extra care taken when writing academic texts, the automatic tagging results will be more comparable to the results of native English writing than learner English. One possible reason why there may be a lower error rate in the POS tagging of EALE texts is that there are presumably no spelling or typographical errors made.

However, the non-native status of learners, the academic register and field specific vocabulary may cause the automated tagger to make errors.

The first part of the thesis focuses on the literature review, which explores the different corpora, specialised corpora, what corpora consist of and the use of POS tags. The corpus and its components, including POS tags, are explained. It also explains the necessity

(7)

and uses of learner language corpora, as well as POS tagging and gives an overview of past studies of POS tagging of learner language. Section 1.1 talks about the definition of corpora in linguistics. Section 1.2 explains the use of specialised corpora, especially in a language learning setting. Section 1.3 looks at TCELE in particular, looking for its weaknesses and ways to alleviate them. Finally, section 1.4 describes POS tags and the automated process of tagging, bringing examples of results from previous studies, including those related to learner language.

In the second part of the thesis, the methodology and empirical analysis are detailed.

The subsection about corpus building gives a detailed overview of the process of building a corpus of EALE, followed by the process of adding automatic POS tags and how the accuracy of automatic tagging was determined. At the end of this part, the results of the analysis are presented, followed by the discussion.

(8)

1. LITERATURE REVIEW

The literature review section of this thesis will try to explain what corpora are, what a modern corpus consists of, what are the uses for specialised corpora and identify gaps in existing corpora. In section 1.3, the practice and uses of POS tagging are further explored.

In later sections, the error rates of manual and automated POS tagging are compared, judging the overall viability of automating this process, finishing with previous studies of automated POS tagging of learner English.

1.1 Corpus and its components

Leech (2013: 1) describes a corpus as a collection of texts, which in the modern day is stored digitally and, with the help of computers, allows for various language research.

They can also be a great tool to help with language learning (Dahlmeier et al. 2013: 22).

Regarding the use of corpora in language learning, Timmis (2015: 7) states that the information found through corpus analysis is more plentiful and useful than what could be obtained in the past through other means. However, a corpus cannot be of much use if it contains large amounts of random unrelated or irrelevant text. The design of a corpus is more important than the size, as general language use and language use in specific contexts can be vastly different (O’Keeffe et al. 2007: 4). Therefore, in order to study Estonian learner English for example, it is necessary to collect texts written in English by Estonian students.

Although such a collection of texts on its own can be helpful for linguistic research (word frequency, collocations), manually going through each text and trying to find useful or interesting data points is long and painstaking, if not impossible should the corpus be too big.

Modern electronic corpora are first prepared using computer software (with some manual labour) to make future research with them easier. In addition to adding information about the text as a whole (author, year, etc.), corpora generally have annotations providing

(9)

detailed information about individual elements of the text. Leech (2013: 4) emphasises that information has to be added to a corpus in order to get anything useful out of it, while Van Rooy and Schäfer (2002: 328) point out that annotation can enhance the usefulness of a corpus, but is not necessary. Therefore, in order to build a corpus that proves especially useful for linguistic research, there should be additional data, such as annotation or metadata, added to the texts. Reppen (2010: 35) states that tagging each word with its grammatical category, that is POS tagging, is the most widespread type of annotation for corpora.

However, annotation is not limited to grammar – various different types of annotation can be added as needed, such as phonetic, syntactic, semantic annotation and so on (Leech 2013: 12). As previously mentioned, annotation is not strictly necessary for corpora to be useful in linguistic studies (Van Rooy & Schäfer 2002: 328). Although, it can be more necessary for corpora of transcribed spoken language, as prosodic and phonetic annotation can provide better representation of the data, although it is largely influenced by interpretation (Leech 2013: 3).

Annotation, much like the rest of the content of a corpus, should be decided based on the intended uses of the corpus in question. One type of annotation that Leech (2013: 15) points out in relation to learner corpora is error tagging, that is the marking of grammatical and lexical errors made by non-native language learners. Annotation can be added to corpora at any point, meaning the decision of using annotation does not have to be made before building a corpus. This thesis focuses on POS tagging, as the corpus data is written texts and due to the higher level of English, learner errors are less likely.

1.2 POS tagging

POS tagging is the process of adding grammatical word class information to each word in a text. As mentioned in section 1.1, Leech (2013: 4) emphasises that corpora need information added to them to be useful and Reppen (2010: 35) states that POS tagging is the

(10)

most common information to be added to corpora. For this reason, POS tags were chosen as the focus of this thesis. However, Leech and Smith (1999: 24) argue that annotation is not necessarily required, but agree that the value of a corpus is increased by it. An example of the usefulness of POS tags in corpora is looking at the collocations of certain word classes rather than individual words (Bird et al. 2009: 187). Studying language with POS tags can give a better understanding of sentence structures, as well as information about word class frequencies.

POS tagging can be done manually, but considering the scope of corpora, the task would take a lot of time even with a small team of people. In their study, Marcus et al. (1993:

8) examined that the average speed of manual tagging per person was 1000 words in 44 minutes, with the average error rate determined as 5.4%. By employing the use of computer software, the tagging process can be automated, making the process fast and relatively easy.

As stated by Granger (2002: 18), the automated tagging results should be analysed before any analysis of an automatically tagged corpus can follow. The accuracy of automated POS taggers is generally on average slightly above 97% (Manning 2011: 171).

There are many different automated taggers available. The differences between the taggers can significantly alter the accuracy and usefulness of the results. The tagset used by the tagger is the primary difference between taggers (Van Halteren 1999: 96). Some tagsets may provide more detailed results than others, while adding detail increases the likelihood of incorrect tagging. In addition to tagsets, the methods for tag selection used by the taggers are important, with different methods like rule-based tagging and statistics-based tagging generally used in combination (Van Halteren 1999: 102). Van Rooy and Schäfer (2009: 329) presented an example of three different taggers: Brill, which is an entirely rule-based tagger, TOSCA, which is a hybrid tagger leaning more on probability, and CLAWS, which is a hybrid tagger leaning more on rules. Additionally, the Brill-tagger uses the Penn Treebank

(11)

tagset with 36 tags, the CLAWS-tagger uses the CLAWS7 tagset with 137 tags, and the TOSCA-tagger uses the TOSCA-ICLE tagset with 220 tags (Van Rooy & Schäfer 2009:

328).

Using an automated tagger on learner language can yield very different results from tagging native language. Learner errors can cause the tagger to not recognise words or misinterpret them due to incorrect spelling or grammar, which can cause a significant number of tagging errors and thus increase the inaccuracy of automated tagging (van Rooy

& Schäfer 2009: 334). The type of tagger used is also important, as a tagger that uses n-gram (groups of words that appear together) probability to determine the correct tags can be confused by incorrect sentence structures. De Haan (2000: 69) states that the average percentage of incorrect tags in automatically tagged learner English is about 5%.

Van Rooy and Schäfer (2009: 326) evaluated the accuracy of different taggers on the Tswana Learner English Corpus. The results showed that the TOSCA tagger had an accuracy of 87%, the Brill tagger had an accuracy of 89% and the CLAWS tagger had an accuracy of 96% (Van Rooy & Schäfer 2009: 334). However, after correcting the learners’ spelling errors, the results were 90% accuracy with the TOSCA tagger, 91% accuracy with the Brill tagger and 98% accuracy with the CLAWS tagger (Van Rooy & Schäfer 2009: 334). The results of this study show that, in the case of learner English, the use of hybrid taggers should yield better results than a purely rule-based tagger. The CLAWS tagger specifically had a significantly higher accuracy than both of the other two taggers. This could either be because of the tagger being a hybrid tagger leaning on rule-based tagging, but could also be due to the tagset used by the tagger. In the case of the CLAWS tagger, the results were close to the previously mentioned 5% average error rate of automated tagging of learner English.

In a study of automated POS tagging of Estonian learner English, Undo (2018: 46) initially revealed an error rate of about 10.2% on average, which is significantly higher than

(12)

the previously discovered average 5% error rate of automated tagging of learner English.

However, after making a number of exceptions (e.g., ignoring differences between verb forms) to the evaluation of the tagger, Undo (2018: 49) found the error rate of automated POS tagging of Estonian learner English to be only about 1.06% on average. This is yet again a significant difference in the other direction from the previously stated 5% average.

1.3 Specialised corpora

When setting out to build a corpus, it is first necessary to figure out exactly what is needed and what is the overall purpose. Specialised corpora consist of language use in a very specific field, genre, setting or topic (Timmis 2015: 14). Examples of specialised corpora include International Corpus of Learner English (ICLE), Michigan Corpus of Academic Spoken English, Cambridge and Nottingham Business English Corpus, and many more (O’Keeffe et al. 2007: 287-293). Although building a new corpus can be a lengthy and difficult task, Timmis (2015: 14) demonstrates multiple reasons for undertaking it, such as studying learner language, as corpora can be built with very specific goals in mind where existing general corpora cannot provide beneficial information. A specialised corpus will inevitably be significantly smaller than any general corpora, yet Koester (2010: 67) argues that a specialised corpus is more closely tied to the context of its contents, therefore providing more detailed insight into language use in said context.

For the purposes of researching learner English, only texts written or spoken as part of students’ studies are needed, except for comparison. Native English texts alone do not provide information about learner English and in the case of academic learner English, neither do texts written outside of the classroom context. Of course, corpora with different contexts or native speaker corpora can be used in linguistic research for comparison of language use between said contexts. In terms of additional information added to the texts in

(13)

a learner corpus of English, general metadata (information about the source of the text) and grammatical tags (parts of speech) should be enough as a starting point for studying EALE.

There are many different possible uses of corpora for language learning. They can be used to compile dictionaries by finding common uses of words in context (O’Keeffe et al.

2007: 17), which in itself proves useful for learning purposes. By searching for keywords in context, students can study word collocations and how words are used in sentences. In textbook writing, the language used is mostly based on perception rather than data (O’Keeffe et al. 2007: 21). Corpora can provide the necessary empirical data to accurately compile

textbooks for language learning by analysing a native speaker corpus, a learner language corpus and comparing the two. Significant differences can highlight problematic topics for learners and material can be adjusted to focus more on such areas.

1.4 TCELE

For Estonian learner English, there currently only exists the Tartu Corpus of Estonian Learner English (TCELE). However, Piiri (2020: 19) points out that the TCELE corpus is heavily influenced by the source material since the entry exam essays currently in the corpus rely heavily on the original text set as the essay topic for the exam. As Koester (2010: 69) explains, the samples included in a corpus have to have a wide variety and accurately represent the language in question. Toom (2020: 29) observes additional issues regarding TCELE, such as the short length of the texts compared to other learner corpora. Therefore, some improvements could be made to TCELE, namely new texts could be added from a wide variety of different topics.

As TCELE is still a small corpus, additional sub-corpora could be built to increase the size and usefulness of the corpus. Academic texts could be added as a sub-corpus, creating a mini-corpus for Estonian academic learner English (EALE). Coxhead (2010: 466) suggests that corpora comparison can provide useful insight for learners of academic English

(14)

and teachers alike, as well as for compiling study materials. Therefore, with a corpus of EALE, both the students and English language teachers in Estonian academia could potentially benefit from it.

However, much like the previously mentioned shortcomings of TCELE in its current state, the potential shortcomings of adding academic texts should be kept in mind. Although plagiarism is a serious offence, it is never entirely certain whether the academic texts were written by the students themselves. Additionally, the influence of supervisors correcting students’ writing, even if the corrections are small such as adding punctuation where it is lacking, the reliability of using academic texts to study learner language drops. Without the texts being written in a controlled environment, such variables cannot be entirely eliminated.

The present study makes use of Bachelor’s theses, which may have previously mentioned influences.

Other examples of learner English corpora include the ICLE, Corpus of English as a Lingua Franca in Academic Settings (ELFA), and others (O’Keeffe et al. 2007: 286-287).

According to Kirsimäe (2017: 15), ELFA also has a written sub-corpus, “consisting of 1.5 million words of academic text collected from unedited research papers, PhD examiner reports and research blogs”. In terms of academic learner English corpora, in addition to ELFA, there are the Corpus of Academic Learner English (CALE), the Taiwanese learner academic writing corpus and the Scientext English Learner Corpus among others (Centre for English Corpus Linguistics 2021). As Callies and Zaytseva (2013: 126) point out, general learner English corpora cannot be compared to academic writing, due to creative and argumentative texts along with the lack of academic register being too different from academic writing. The present thesis will focus on building a corpus with Bachelor’s theses written at the English Department of University of Tartu, however the corpus of EALE could be further expanded with additional theses, dissertations and essays written by the students.

(15)

To conclude the literature review section of this thesis, it can be said that corpora have an important place in linguistic research. Modern corpora generally have extra information, such as annotation, added to them to increase their functionality, with POS tagging being the most commonly added annotation to corpora. POS tagging is a large undertaking if done by hand, but modern automated taggers can do it rapidly and are close to – if not exceeding – human accuracy. Although there already exist many large corpora, building a new corpus, no matter the size, is still a worthwhile task, depending on the intended goals. Specialised corpora can be smaller in size, but contain more relevant data for their specific context. A corpus can provide information about word frequencies, grammar use, vocabulary and so on, which in turn can reveal problematic areas in learner language when comparing learner corpora to native corpora. TCELE is a corpus project in progress, aiming to be beneficial for English pedagogy in Estonia, but it is currently lacking in size and is in need of expansion.

(16)

2. EMPIRICAL ANALYSIS

In this part of the thesis, the methodology and analysis are detailed. In section 2.1, the process of building a corpus to be analysed is explained, starting from the collection of texts, then going over the automated cleaning and manual verification of the texts and preparing them for automated tagging and analysis. Section 2.2 explains the process of using an automated tagger to add POS tags to the corpus and the way in which the accuracy of the tagger was determined. The results are given at the end of the empirical analysis section.

As stated in the introduction of this thesis, the hypothesis is that the accuracy of automated POS tagging of EALE is closer to the accuracy of tagging native English rather than learner English, due to the higher level of English used in academia, combined with the extra care taken when writing academic texts such as a BA thesis.

2.1 Building the corpus of EALE

In order to study Estonian academic learner English (EALE), the Bachelor’s theses written at the English Department of the University of Tartu were used. The materials were obtained from the University of Tartu DSpace, a repository for electronic materials. The University of Tartu DSpace contains Bachelor’s theses in digital form, presented in PDF file format. These files were downloaded one by one from the Department of English Studies Bachelor’s theses category. A total of 77 written EALE texts were collected. Some of the texts were omitted from the study for various reasons that made them unsuitable, such as mostly containing text in a third language, leaving very little English text to analyse. The final total number of texts used for analysis in this paper is 74.

The second major step was text cleaning. The cleaning of texts was initially automated using R (version 4.0.3, R Core Team 2020). R is a scripting language generally used for statistical analysis, but it is also an efficient tool for data manipulation (Hornik 2020). In this thesis, R is used for the functionality of automatically reading a list of files

(17)

and using a series of commands to find and replace patterns of text in each file. As all the texts were originally in the PDF file format, the texts first had to be converted to TXT format to make them readable with the different tools used in this paper and carry out further analysis. To achieve that, the R package pdftools (version 2.3.1, Ooms 2020) was used to read the original PDF documents in R as plain text, which could then be processed using R scripts. The conversion process was not perfect and a lot of junk data was added to the converted texts, such as line break symbols and page numbers. The scripts used several combinations of regular expressions to find and remove patterns of unwanted text. Any excess information left over from the PDF format was removed, such as page numbers and unnecessary whitespace. Then, any pages not containing text useful for this study were removed, like the title page, table of contents, and list of references, including everything that follows the list of references. Finally, quotes and citations were removed, again using regular expressions, as they do not reflect the students’ own writing. The final R script used for automatically cleaning the texts can be accessed on the author’s account on GitHub (https://github.com/KaljusteAugust/Baka2021/blob/main/textextraction.R).

Many difficulties arose when writing these regular expression patterns. The students’

writing often contained irregularities, such as switching between different types of quotation marks, which caused the script to delete text beyond the targeted quote. Some citation mistakes also made it difficult to automatically identify sentences. A mistake made in several texts was the placement of the sentence ending period inside citation parentheses, rather than outside. This caused the script to not consider the sentence finished, therefore once again deleting more text than necessary or leaving some unwanted text untouched. In addition to the errors, it was impossible to identify long quotations with regular expressions, as these are written in smaller font and without quotation marks, of which no information was

(18)

retained in the texts upon conversion from the PDF format. It was also impossible to identify tables and figures using regular expressions.

All these issues made it necessary to manually check and clean the texts one by one, by comparing the original PDF files to the output TXT files. Although this was a long and difficult process, all the 74 texts were cleaned and manually verified to only contain text originally written by the students themselves, without any excess data. The final corpus consists of 491,198 words.

2.2 POS tagging of EALE

Once the texts were cleaned, a code was written in a computer programming language Python (version 3.7.6, Van Rossum & Drake 2009) for automated POS tagging.

The final code for POS tagging can be accessed on the thesis author’s GitHub account (https://github.com/KaljusteAugust/Baka2021/blob/main/POS_tagger.py). The Python code made use of the Natural Language Toolkit (NLTK) library (version 3.5, Bird et al. 2009).

NLTK is a set of natural language processing tools for Python with various different purposes, including automated POS tagging. NLTK was chosen for this task as it is open source (anyone can see and edit the original code) and freely available. The POS tagger in NLTK uses a combination of statistical and rule-based tagging (Bird et al. 2009: 209). The code first tokenised the texts – that is, each word, punctuation and symbol were separated.

Then it automatically added POS tags to each token (word) in the previously cleaned texts.

The POS tagger in NLTK used the Penn Treebank Project tagset, which is also the tagset later used when manually tagging the texts. The tagset was chosen for both the automatic and manual tagging because it is the default tagset used by NLTK and by using the same tagset for both tasks, the results can be directly compared. The Penn Treebank tagset consists of 36 POS tags, with an additional 12 for punctuation (Marcus et al. 1993: 5).

The tagset has separate tags for the eight common POS of English: adjective, adverb, noun,

(19)

pronoun, verb, preposition, conjunction, and interjection. The tagset also differentiates between singular and plural nouns (NN vs NNS), common nouns and proper nouns (NN vs NNP), regular, comparative and superlative adjectives (JJ, JJR, JJS) and adverbs, different verb forms (base form, past tense, gerund, etc.) and so on. Considering the fact that the tagset covers all the common POS with further detailed tags for some of them, the tagset is good enough for the purposes of this study and for future research with the corpus. The full list of tags used in the Penn Treebank tagset can be found in Appendix 1.

Ten samples of five consecutive sentences were chosen from each text at random using the random number generator at random.org (Haahr 2021). First, a random number was generated to pick a text and then another random number was generated to pick the first sentence in a sequence of five sentences. This was repeated ten times. If the random number generator gave a number that had already appeared, a new number was generated. The total number of words in the collected sample was 1,195 words, which the previously written Python code split into 1,317 tokens (including punctuation). The automatically tagged sample of texts are presented in Appendix 2.

This sample of texts was then manually tagged without any prior knowledge of the automated tagging results, again using the Penn Treebank Project tagset. In case of any uncertainties, the Merriam-Webster (n. d.) online dictionary was used to check for the correct POS of words, with the help of the Penn Treebank Project tagging guidelines (Santorini 1990) used for ambiguous parts of speech. After the manual tags were added, the automatic tags were looked at to see if there were any obvious mistakes made in the manual tagging of words. Any differences between the manual and automatic tags were closely inspected to decide if the manually added tags were indeed correct. Again, the Penn Treebank Project tagging guidelines (Santorini 1990) were used to help with ambiguity.

(20)

The manual tagging proved to be a great challenge, as ambiguity was very common and even after reading the guidelines, many of the word classes were still difficult to correctly identify. The main issue was differentiating between different verb forms, specifically between past tense and past participle. Through extensive reading of the guidelines and repeatedly going over all the tags, it is believed that all the manual tags were corrected. However, it must be noted that word class identification is a difficult task and even trained annotators can have an average disagreement rate of 7.2% and an error rate of 5.4% (Marcus et al. 1993: 8), therefore it is possible that some mistakes with manual tagging may still be found in this study.

2.3 Results

The results of both the automatic tagging and manual tagging were placed in a table to compare the two and find errors in the automatic tagging, with the manually added tags considered the correct tags. From there, the error percentage of the automated tagging was calculated using the following formula: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑡𝑎𝑔𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑎𝑔𝑠 × 100. The percentages presented in this thesis are rounded to the nearest hundredth.

The automated tagger produced a total of 1,317 POS tags for the sample. No tokens were left untagged by the tagger, meaning the tagger successfully tagged each token. The same number of manual tags were added to the text. By comparing the automated tagger results to the manual tags, a number of 48 incorrect tags by the automated tagger were identified, with 1,265 tags considered correct. This comes out to an error percentage of about 3.64% and determining the automated tagger to have a per-token accuracy of 96.36%. The analysis results of individual texts are given in Table 1.

(21)

Table 1. Automated POS tagging errors in individual texts (including punctuation errors)

However, by looking at the automatic tagger results, it was identified that the tagger had incorrectly identified all apostrophes and some other punctuation (see example 1), and as such, it also misidentified the possessive ending ‘s (see example 2).

1) based (VBN) on (IN) two (CD) binaries (NNS) – (VBP) good (JJ) and (CC) bad (JJ) 2) Oedipa (NNP) ’ (NNP) s (VBD)

This might be due to the text conversion from PDF to TXT, causing the symbols used to be unknown to the NLTK tagger. A total of 15 errors due to misidentified punctuation were discovered, making up a significant amount (31.25%) of all errors. By considering any errors caused by misidentified punctuation as correct instead, the tagger would have an error rate of 2.51%, making it 97.49% correct, which is a considerable improvement.

The most common tagging error (besides punctuation) was found to be incorrectly tagging nouns as something else. A total of 13 nouns were misidentified, which is 27.08%

Text Errors Total tags Error percentage

1 2 107 1.87%

2 4 152 2.63%

3 3 82 3.66%

4 8 136 5.88%

5 4 145 2.76%

6 8 118 6.78%

7 6 185 3.24%

8 4 174 2.30%

9 5 118 4.24%

10 4 100 4.00%

Total 48 1,317 3.64%

(22)

of all errors. The POS tag assigned by the automatic tagger for these incorrectly tagged nouns varied, but seven of the nouns were tagged as an adjective (see example 3).

3) the (DT) past (NN) and (CC) the (DT) present (JJ)

All other tagging errors were found to be mostly unique in their pairings of correct and incorrect tags. However, the next most common errors were incorrectly tagging seven verbs and four prepositions or subordinating conjunctions. In one case, the verb was tagged as a noun (see example 4) and there was also one case where it is believed that one tagging error caused the tagger to also misidentify the following words (see example 5). Other errors that were found include incorrectly tagging an adverb (see example 6) and “one” used as a pronoun (see example 7).

4) to (TO) cross (VB) reference (NN) the (DT)

5) an (DT) indicator (NN) that (WDT) Twain (VBP) preferred (VBN) to (TO) 6) the (DT) character (NN) was (VBD) first (JJ) described (VBN)

7) one (CD) has (VBZ) to (TO) be (VB)

The composition of all incorrectly tagged POS can be seen in Figure 1, showing the correct tags. For possessive ending errors, both the apostrophe and the trailing ‘s’ were tagged separately and both were also corrected as possessive ending. Short descriptions of each tag can be found in Appendix 1.

(23)

Figure 1. Composition of incorrectly tagged POS, the tags shown are the correct tags As shown in Table 1, the text number six contained the most errors, with the error percentage being significantly higher than all other texts in the sample. This was found to be due to this particular text containing a combination of all the most common errors:

misidentified verbs, nouns and possessive endings. Incorrectly tagging one word can also cause the following words to be incorrectly tagged, as the tagger then works with an incorrect understanding of the context of the word. Such an effect can be examined in this text, where the phrase “Twain’s writing” has both the possessive ending and the word ‘writing’

incorrectly tagged by the automatic tagger. Without the possessive ending, the word ‘writing’

appears to be a verb and would likely seem to be a verb to a human tagger as well.

2.4 Discussion

The results of this research show that the error percentage of automated tagging of EALE (3.64%) is lower than the average error percentage of tagging learner English (5%) found in previous studies according to de Haan (2000: 69). In comparison to previous studies with Estonian learner English, Undo’s (2018) work is the best direct comparison as it used the same tagger, the same tagset and it is currently the only known previous study of

Determiner, 1, 2.08%

Plural noun, 1, 2.08% Verb past tense, 1, 2.08%

Wh-determiner, 1, 2.08%

Proper noun, 2, 4.17%

Personal pronoun, 2, 4.17%

Adverb, 2, 4.17%

Adjective, 3, 6.25%

Verb base form, 3, 6.25%

Verb non 3rd person singular present, 3,

6.25%

Preposition/Subordinating conjunction, 4, 8.33%

Punctuation, 5, 10.42%

Noun, 10, 20.83%

Possessive, 10, 20.83%

(24)

automated POS tagging of Estonian learner English. However, Undo’s work did not include punctuation, which previous studies have included and the punctuation has increased the accuracy percentage of the tagger (Manning 2011: 171). The error rate of automated tagging found in the present thesis is much lower than the initial average found by Undo (2018: 46) of 10.2% and much higher than the adjusted error rate found by Undo (2018: 50) of 1.06%

after making a number of exceptions. The exceptions made by Undo are the following:

1. If a word has been tagged as tag set a verb, but an incorrect type of verb form (the tense must be correct), the tagger has been correct

2. If superlative or comparative adjectives have been tagged as “just” adjectives, the tagger has been correct

3. If “there” has been tagged as an existential there, and it is a pronoun, the tagger has been correct 4. If Numbers, even when used as adjectives or nouns, are tagged as “cardinal number”, the tagger has been correct (Undo 2018: 48)

It can be argued that the first exception is too excessive, as important information about verb forms is lost for the sake of better tagger results. In addition, as previously mentioned, Undo’s work did not include punctuation at all. By applying the same exceptions to the texts in the current study, with 1,200 total tags and 30 errors, the combined error rate would come out to be 2.50%, which is still considerably higher than the results of Undo’s previous study. However, despite the language analysed in this thesis being learner language, there were no learner errors identified in the sample of texts. Therefore, unlike the previous studies, the error percentage of the automated POS tagging was not increased by learner errors, which means that the results might not be directly comparable to the results of previous learner language tagging. However, the present thesis only looked at ten texts in the corpus out of 74, identifying POS tags for only five sentences in each text. It could be possible that there are texts in the corpus with identifiable learner errors that would significantly influence the overall error rate of the automatic POS tagging of the text.

The results do show that, at least in the small sample used for analysis, the students’

level of English in academic writing by students at the English Department of the University of Tartu is relatively high. There were no spelling or typographical errors found in the sample.

(25)

Although no learner errors were identified, some of the errors could have been caused by poor sentence structures, which may have gone unnoticed in this study. The statistical method of tagging of NLTK could, in such cases, incorrectly tag words as belonging to a different POS, as it expects a different type of word to appear in collocation with the surrounding words in a sentence. Such a situation was observed in cases where the automatic tagger had incorrectly tagged possessive endings as something else, causing the tagger to then also misidentify the following words. Additionally, the greater use of reported speech and other more academic elements of academic writing could be confusing the tagger if the data it uses comes from different types of texts.

However, as mentioned in the results section, a significant number of errors were caused by misidentified punctuation. When considering all punctuation tagging errors as correct, the accuracy of the automated tagging increases to a percentage similar to the one claimed by Manning (2011: 171) to be found in general automated tagging of English. Such errors could also possibly be prevented in the future by further processing of the texts beforehand (e.g., replacing all apostrophes with a symbol recognised by the tagger as an apostrophe). Alternatively, the tags added to punctuation could be removed after the automated tagging and tags for possessive endings could be easily corrected as well.

Future research could carry out a more in-depth analysis and error checking of the EALE corpus. This study only checked 1195 words out of 491,198, or 0.24% of the entirety of the built corpus – a miniscule part of the whole corpus. Additionally, during the process of future research MA theses could be added to TCELE. As the present study also discovered a considerable percentage of errors made by automated tagging, all the tags would ideally eventually be manually checked and corrected. This would still take a lot of time, but should be more than twice as fast and more accurate than manually tagging from scratch (Marcus et al. 1993: 8). Future studies could also determine the level of language use in all the texts

(26)

in the corpus, to find the theses with the weakest use of English, to then carry out the same analysis as in this thesis on those weaker texts.

(27)

CONCLUSION

Corpus building is a worthwhile task for linguistic research, even if the resulting corpus is small, as building a corpus with specific goals in mind can help with linguistic research in a narrow field. An important part of corpus building is choosing what information to provide as annotations alongside the texts to make future research easier. While POS tags are the most common type of annotation added to corpora, going through an entire corpus and tagging each word by hand can be a time consuming and difficult process. Automated POS tagging significantly speeds up this process and previous studies have demonstrated that automated tagging is also fairly accurate.

However, as the automatic taggers have been trained with native speaker English, the quality of learner English tagging should be looked into. Previous studies (e.g., Van Rooy

& Schäfer 2009; Undo 2018) found that automated tagging of learner English had an error rate varying from five to ten percent. However, the viability of using automated tagging with academic learner English is a much less studied field. The present thesis aimed to determine the quality of automatic POS tagging of EALE in the hopes of speeding up the process of building a corpus of Estonian learner English and providing valuable data for linguistic research and pedagogy in academia.

In order to achieve this, a corpus of EALE was built, using Bachelor’s theses written at the English Department of the University of Tartu. The theses were cleaned up, stripped of any excess data and quotes, which do not represent the students’ own writing. Once the texts were ready for annotation, they were automatically tagged for POS using NLTK. A sample of ten texts with 5 lines from each were manually tagged as a control sample. The automatic tags were compared to the manual tags, with the manual tags considered as correct.

The sample contained 1,317 tokens and the analysis found a total of 48 errors in the sample, making the error rate out to be 3.64%. Already without any exceptions to the counted

(28)

errors, the error rate of automatic POS tagging of EALE is significantly lower than the previously found error rates of automatic tagging of learner English. However, due to an error in text processing, a number of punctuation symbols, which should otherwise be easily identifiable by computer software, were incorrectly tagged. By making the exception of counting punctuation identification errors as correct, the error percentage drops to 2.51%.

By further making the same exceptions to the list of errors as made by Undo (2018: 48), the error percentage comes out to be 2.50%.

The most common error was found to be misidentifying nouns, with 13 errors or 27.08% of all errors made by the automatic tagger belonging to this category. This was followed by ten possessive ending errors and five punctuation errors. Without counting errors caused by misidentified punctuation (that includes possessive ending errors), the next most common errors were incorrectly tagging seven verbs and four prepositions or subordinating conjunctions.

The hypothesis raised in the introduction that the higher level of English used in academic writing will provide an automatic tagging error rate more comparable to that of native English than learner English held true. At least in the sample used in this study, no learner errors were identified that confused the automatic tagger. Potentially, many of the errors made by the automatic tagger in this research could have been due to the academic register and narrow field vocabulary, which the tagger was not trained for.

It must be noted that POS tagging is a difficult task and while guidelines were strictly followed, there may have also been errors made in the manual tagging. In addition, this thesis only checked 0.24% of the entire corpus that was built and automatically tagged. While the sample was picked randomly to assure an unbiased coverage, the results from other texts or even sections of texts could potentially lead to different results.

(29)

This research is a good starting point for future research of EALE and automatic tagging of academic learner English. The corpus of EALE could be used to study academic learner English in various ways, for example identifying common constructions or overused words in EALE writing and comparing them to native writing. The automatic POS tagging has proven to be worthwhile and the same methods used in this thesis can be applied to further expand the EALE corpus and merge it with TCELE. Future research can also focus on the shortcomings of this thesis, such as the limited number of texts manually tagged or not using an automatic tagger trained for academic writing, to fill the gaps and provide insight to the level of English used in Estonian academic papers written in English.

(30)

REFERENCES

Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python. Sebastopol: O’Reilly Media, Inc.

Callies, Marcus and Ekaterina Zaytseva. 2013. The Corpus of Academic Learner English (CALE): A new resource for the assessment of writing proficiency in the academic register. Dutch Journal of Applied Linguistics, 2: 1, 126-132.

Centre for English Corpus Linguistics. 2021. Learner Corpora around the World. Louvain- la-Neuve: Université catholique de Louvain. Available at

https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the- world.html, accessed May 2021.

Coxhead, Averil. 2010. What can corpora tell us about English for Academic Purposes? In Michael McCarthy and Anne O’Keeffe (eds). The Routledge Handbook of Corpus Linguistics, 458-470. London and New York: Routledge.

Dahlmeier, Daniel, Hwee Tou Ng and Siew Mei Wu. 2013. Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational

Applications. 22-31. Atlanta: Association for Computational Linguistics.

De Haan, Pieter. 2000. Tagging non-native English with the TOSCA-ICLE tagger. In Christian Mair and Marianne Hundt (eds). Corpus Linguistics and Linguistic Theory. 69-79. Amsterdam: Rodopi.

Granger, Sylviane. 2002. A Bird’s-eye view of learner corpus research. In Sylviane Granger, Joseph Hung and Stephanie Petch-Tyson (eds). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. 3-33.

Amsterdam and Philadelphia: John Benjamins Publishing Company

Haahr, Mads. 2021. RANDOM.ORG: True Random Number Service. Available at https://www.random.org, accessed March 2021.

Hornik, Kurt. 2020. The R FAQ. Available at https://CRAN.R-project.org/doc/FAQ/R- FAQ.html, accessed March 2021.

Kirsimäe, Merli. 2017. The Compilation and Lexicogrammatical Analysis of an Estonian Spoken Mini-Corpus of English as a Lingua Franca. Unpublished MA thesis.

Department of English Studies, University of Tartu, Tartu, Estonia.

Koester, Almut. 2010. Building small specialised corpora. In Michael McCarthy and Anne O’Keeffe (eds). The Routledge Handbook of Corpus Linguistics, 66-79. London and New York: Routledge.

Leech, Geoffrey. 2013. Introducing corpus annotation. In Roger Garside, Geoffrey Leech and Tony McEnery (eds). Corpus Annotation: Linguistic Information from

Computer Text Corpora. 1-18. London and New York: Routledge.

Leech, Geoffrey and Nicholas Smith. 1999. The Use of Tagging. In Hans van Halteren (ed). Syntactic Wordclass Tagging, 23-36. Dordrecht: Springer Science+Business Media.

Manning, Christopher D. 2011. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In Alexander Gelbukh (ed). Computational Linguistics and

(31)

Intelligent Text Processing: 12th International Conference CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part I, 171-189. Heidelberg: Springer.

Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Technical Report.

Department of Computer and Information Science, University of Pennsylvania, Philadelphia, United States.

McCarthy, Michael and Anne O’Keeffe. 2010. Historical perspective: what are corpora and how have they evolved? In Michael McCarthy and Anne O’Keeffe (eds). The Routledge Handbook of Corpus Linguistics, 3-13. London and New York:

Routledge.

Merriam-Webster. Available at https://www.merriam-webster.com, accessed March 2021.

O’Keeffe, Anne, Michael McCarthy and Ronald Carter. 2007. From Corpus to Classroom:

language use and language teaching. New York: Cambridge University Press.

Piiri, Andreas. 2020. A corpus based study of formulaic language use by native and non- native speakers. Unpublished BA thesis. Department of English studies, University of Tartu, Tartu, Estonia.

R Core Team. 2020. R: A Language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Reppen, Randi. 2010. Building a corpus: what are the key considerations? In Michael McCarthy and Anne O’Keeffe (eds). The Routledge Handbook of Corpus Linguistics, 31-37. London and New York: Routledge.

Santorini, Beatrice. 1990. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision). Technical Report. Department of Computer and Information Science, University of Pennsylvania, Philadelphia, United States.

Timmis, Ivor. 2015. Corpus Linguistics for ELT. London and New York: Routledge.

Toom, Sandra-Leele. 2020. The use of phrasal verbs by Estonian EFL learners: a corpus- based study. Unpublished BA thesis. Department of English studies, University of Tartu, Tartu, Estonia.

Undo, Aare. 2018. Calculating the Error Percentage of an Automated Part-of-Speech Tagger when Analyzing Estonian Learner English – An Empirical Analysis.

Unpublished MA thesis. Department of English studies, University of Tartu, Tartu, Estonia.

Van Halteren, Hans. 1999. Selection and Operation of Taggers. In Hans van Halteren (ed).

Syntactic Wordclass Tagging, 95-104. Dordrecht: Springer Science+Business Media.

Van Rooy, Bertus and Lande Schäfer. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20: 4, 325-335.

Van Rossum, Guido and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley:

CreateSpace.

(32)

APPENDICES

Appendix 1

CC – Coordinating conjunction CD – Cardinal number

DT – Determiner EX – Existential there FW – Foreign word

IN – Preposition/Subordinating conjunction JJ – Adjective

JJR – Comparative adjective JJS – Superlative adjective LS – List item marker MD – Modal verb

NN – Noun, singular or mass NNS – Noun, plural

NNP – Proper noun, singular NNPS – Proper noun, plural PDT – Predeterminer POS – Possessive ending PRP – Personal pronoun PRP$ - Possessive pronoun

RB – Adverb

RBR – Comparative adverb RBS – Superlative adverb RP – Particle

SYM – Symbol TO – to

UH – Interjection/Exclamation VB – Verb, base form

VBD – Verb, past tense

VBG – Verb, gerund/present participle VBN – Verb, past participle

VBP – Verb, non-3rd person singular present VBZ – Verb, 3rd person singular present WDT – Wh-determiner

WP – Wh-pronoun

WP$ - Possessive wh-pronoun WRB – Wh-adverb

Additional punctuation tags

(33)

Text 1

In (IN) written (VBN) language (NN) there (EX) are (VBP) very (RB) few (JJ) grammatical (JJ) differences (NNS) between (IN) Australian (JJ) and (CC) British (JJ) English (NNP) . (.) However (RB) , (,) there (EX) are (VBP) some (DT) distinctive (JJ) features (NNS) of (IN) AusE (NNP) , (,) which (WDT) are (VBP) different (JJ) from (IN) both (DT) BrE (NNP) and (CC) AmE (NNP) . (.) For (IN) instance (NN) , (,) it (PRP) is (VBZ) usual (JJ) in (IN) AusE (NNP) to (TO) use (VB) thanks (NNS) rather (RB) than (IN) please (VB) in (IN) requests (NNS) : (:) Can (MD) I (PRP) have (VBP) a (DT) cup (NN) of (IN) tea (NN) , (,) thanks (NNS) ? (.) Further (RB) , (,) as (IN) it (PRP) has (VBZ) already (RB) been (VBN) mentioned (VBN) in (IN) the (DT) section (NN) 1.2 (CD) , (,) special (JJ) features (NNS) of (IN) Australian (JJ) variety (NN) appear (VBP) at (IN) the (DT) level (NN) of (IN) the (DT) colloquial (JJ) speech (NN) . (.) An (DT) example (NN) of (IN) this (DT) is (VBZ) a (DT) tendency (NN) to (TO) use (VB) she (PRP) to (TO) refer (VB) to (TO) inanimate (VB) nouns (NNS) in (IN) impersonal (JJ) constructions (NNS) . (.)

Text 2

As (IN) Harry (NNP) Potter (NNP) series (NN) ’ (NNP) characters (NNS) are (VBP) very (RB) clearly (RB) based (VBN) on (IN) two (CD) binaries (NNS) – (VBP) good (JJ) and (CC) bad (JJ) but (CC) also (RB) ambivalent (JJ) - (:) it (PRP) is (VBZ) useful (JJ) to (TO) take (VB) that (DT) into (IN) account (NN) in (IN) the (DT) analysis (NN) . (.) Harry (NNP) , (,) Ron (NNP) and (CC) Hermione (NNP) from (IN) my (PRP$) analysis (NN) , (,) for (IN) example (NN) , (,) are (VBP) the (DT) good-willed (JJ) characters (NNS) , (,) Malfoy (NNP) , (,) Dudley (NNP) and (CC) Voldemort (NNP) the (DT) villainous (JJ) ones (NNS) and (CC) Snape (NNP) and (CC) Quirrell (NNP) the (DT) ambivalent (JJ) characters (NNS) . (.) Character (NNP) descriptions (NNS) In (IN) the (DT) following (NN) , (,) by (IN) way (NN) of (IN) introduction (NN) in (IN) the (DT) book (NN) , (,) I (PRP) will (MD) compare (VB) character (JJ) descriptions (NNS) in (IN) the (DT) original (JJ) PS (NNP) with (IN) the (DT) Estonian (JJ) translations (NNS) from (IN) TK (NNP) . (.) For (IN) this (DT) , (,) I (PRP) analysed (VBD) both (DT) books (NNS) and (CC) wrote (VBD) out (RP) the (DT) descriptions (NNS) where (WRB) the (DT) character (NN) was (VBD) first (JJ) described (VBN) , (,) or (CC) the (DT) description (NN) acquired (VBD) more (JJR) detail (NN) . (.) My (PRP$) assumption (NN) is (VBZ) that (IN) given (VBN) the (DT) relative (JJ) simplicity (NN) of (IN) the (DT) book (NN) , (,) complicated (VBN) words (NNS) and (CC) complex (JJ) sentence (NN) structures (NNS) will (MD) not (RB) be (VB) used (VBN) and (CC) the (DT) translation (NN) is (VBZ) very (RB) close (RB) to (TO) the (DT) original (JJ) . (.)

Text 3

This (DT) thesis (NN) analyses (VBZ) Joy (NNP) Kogawa (NNP) ’ (NNP) s (VBD) novel (JJ) Obasan (NNP) in (IN) an (DT) attempt (NN) to (TO) discover (VB) the (DT) main (JJ) factors (NNS) which (WDT) influence (VBP) the (DT) development (NN) of (IN) the (DT) identity (NN) of (IN) the (DT) third (JJ) generation (NN) Japanese (JJ) Canadians (NNPS) .

(34)

(.) It (PRP) also (RB) aims (VBZ) to (TO) determine (VB) the (DT) reasons (NNS) behind (IN) their (PRP$) identity (NN) crisis (NN) . (.) The (DT) thesis (NN) consists (VBZ) of (IN) four (CD) parts (NNS) : (:) the (DT) introduction (NN) , (,) two (CD) chapters (NNS) , (,) and (CC) the (DT) conclusion (NN) . (.) The (DT) introduction (NN) provides (VBZ) general (JJ) information (NN) about (IN) the (DT) thesis (NN) . (.) It (PRP) states (VBZ) the (DT) research (NN) questions (NNS) and (CC) explains (VBZ) the (DT) necessity (NN) of (IN) this (DT) research (NN) . (.)

Text 4

The (DT) phrase (NN) could (MD) also (RB) be (VB) seen (VBN) as (IN) Oedipa (NNP) ’ (NNP) s (VBD) first (JJ) sign (NN) of (IN) being (VBG) paranoid (JJ) herself (NN) , (,) settling (VBG) the (DT) fear (NN) of (IN) others (NNS) having (VBG) suspicions (NNS) about (IN) her (PRP) . (.) Later (RB) that (DT) day (NN) , (,) as (RB) well (RB) as (IN) the (DT) other (JJ) days (NNS) Oedipa (NNP) and (CC) Metzger (NNP) stay (VBP) at (IN) the (DT) motel (NN) The (DT) Paranoids (NNP) work (NN) at (IN) , (,) the (DT) band (NN) members (NNS) constantly (RB) check (VBP) in (IN) on (IN) the (DT) couple (NN) to (TO) see (VB) if (IN) they (PRP) are (VBP) doing (VBG) anything (NN) sexual (JJ) together (RB) . (.) Surveillance (NN) of (IN) this (DT) kind (NN) could (MD) certainly (RB) provoke (VB) a (DT) feeling (NN) of (IN) being (VBG) constantly (RB) watched (VBN) , (,) the (DT) first (JJ) symptom (NN) of (IN) paranoia (NN) . (.) Mike (NNP) Fallopian (NNP) is (VBZ) the (DT) president (NN) of (IN) a (DT) radical (JJ) right (NN) wing (VBG) group (NN) called (VBD) ‘ (NNP) Peter (NNP) Pinguid (NNP) Society (NNP) ’ (NNP) . (.) Peter (NNP) Pinguid (NNP) himself (PRP) did (VBD) not (RB) do (VB) much (RB) : (:) he (PRP) tried (VBD) to (TO) open (VB) a (DT) second (JJ) front (NN) during (IN) the (DT) American (NNP) Civil (NNP) War (NNP) , (,) but (CC) retreated (VBD) shortly (RB) afterwards (NNS) getting (VBG) scared (VBN) by (IN) Russians (NNPS) . (.)

Text 5

Rosenbach (NNP) also (RB) stated (VBD) that (IN) animacy (NN) is (VBZ) a (DT) more (RBR) important (JJ) factor (NN) than (IN) weight (NN) , (,) at (IN) least (JJS) according (VBG) to (TO) these (DT) sentences (NNS) which (WDT) were (VBD) chosen (VBN) for (IN) her (PRP$) research (NN) . (.) The (DT) difference (NN) between (IN) this (DT) research (NN) and (CC) Rosenbach (NNP) research (NN) is (VBZ) that (IN) Rosenbach (NNP) ’ (NNP) s (VBD) respondents (NNS) ’ (VB) were (VBD) native (JJ) speakers (NNS) but (CC) in (IN) this (DT) case (NN) the (DT) respondents (NNS) speak (VBP) English (NNP) as (IN) a (DT) foreign (JJ) language (NN) . (.) The (DT) current (JJ) research (NN) shows (VBZ) that (IN) it (PRP) does (VBZ) not (RB) matter (VB) whether (IN) the (DT) respondents (NNS) are (VBP) native (JJ) speakers (NNS) or (CC) whether (IN) they (PRP) are (VBP) advanced (JJ) learners (NNS) of (IN) English (NNP) , (,) the (DT) same (JJ) choices (NNS) of (IN) genitive (JJ) variation (NN) are (VBP) made (VBN) in (IN) both (DT) cases (NNS) , (,) at (IN) least (JJS) for (IN) the (DT) conditions (NNS) included (VBD) in (IN) the (DT) questionnaire (NN) . (.) However (RB) , (,) this (DT) research (NN) has (VBZ) some (DT) shortcomings (NNS) . (.) For (IN) example (NN) , (,) it (PRP) can (MD) not (RB) be (VB) clearly (RB) stated (VBN) that (IN) Estonian (JJ) learners (NNS) of (IN) English

(35)

(NNP) do (VBP) not (RB) overuse (VB) the (DT) s-genitive (JJ) , (,) since (IN) the (DT) questionnaire (NN) used (VBN) four (CD) different (JJ) cases (NNS) where (WRB) genitives (NNS) had (VBD) to (TO) be (VB) chosen (VBN) . (.)

Text 6

One (CD) possible (JJ) way (NN) to (TO) expand (VB) on (IN) the (DT) work (NN) done (VBN) for (IN) this (DT) thesis (NN) is (VBZ) to (TO) cross (VB) reference (NN) the (DT) sentiment (NN) words (NNS) that (WDT) were (VBD) commonly (RB) used (VBN) during (IN) the (DT) time (NN) period (NN) of (IN) Twain (NNP) ’ (NNP) s (VBD) writing (VBG) with (IN) the (DT) words (NNS) that (WDT) were (VBD) indexed (VBN) for (IN) this (DT) thesis (NN) . (.) This (DT) may (MD) provide (VB) new (JJ) insights (NNS) for (IN) corpus (NN) linguistics (NNS) regarding (VBG) what (WP) kinds (NNS) of (IN) words (NNS) were (VBD) more (RBR) or (CC) less (RBR) frequent (JJ) in (IN) literature (NN) during (IN) a (DT) given (VBN) time (NN) period (NN) . (.) The (DT) analysis (NN) of (IN) the (DT) shape (NN) of (IN) the (DT) stories (NNS) of (IN) the (DT) five (CD) books (NNS) by (IN) Twain (NNP) has (VBZ) yielded (VBN) an (DT) interesting (JJ) insight (NN) . (.) All (PDT) the (DT) books (NNS) ended (VBD) with (IN) an (DT) increase (NN) in (IN) positive (JJ) sentiment (NN) word (NN) usage (NN) . (.) This (DT) might (MD) be (VB) an (DT) indicator (NN) that (WDT) Twain (VBP) preferred (VBN) to (TO) end (VB) a (DT) story (NN) on (IN) a (DT) positive (JJ) note (NN) . (.)

Text 7

I (PRP) picked (VBD) these (DT) categories (NNS) because (IN) I (PRP) have (VBP) always (RB) been (VBN) interested (JJ) in (IN) horse (NN) colours (NNS) and (CC) the (DT) origin (NN) of (IN) some (DT) of (IN) the (DT) terms (NNS) used (VBN) to (TO) describe (VB) the (DT) colours (NN) . (.) Horses (NNS) can (MD) perform (VB) many (JJ) different (JJ) gaits (NNS) and (CC) movements (NNS) , (,) which (WDT) has (VBZ) led (VBN) to (TO) the (DT) creation (NN) of (IN) a (DT) great (JJ) amount (NN) of (IN) terms (NNS) used (VBN) to (TO) describe (VB) them (PRP) ; (:) I (PRP) picked (VBD) out (RP) a (DT) few (JJ) that (WDT) seemed (VBD) the (DT) most (RBS) interesting (JJ) to (TO) me (PRP) . (.) Each (DT) term (NN) will (MD) be (VB) analysed (VBN) as (IN) follows (VBZ) : (:) how (WRB) the (DT) current (JJ) form (NN) of (IN) the (DT) term (NN) was (VBD) formed (VBN) ; (:) what (WP) are (VBP) and (CC) have (VBP) been (VBN) its (PRP$) uses (NNS) and (CC) meanings (NNS) both (DT) in- (JJ) and (CC) outside (JJ) of (IN) equestrian (JJ) terminology (NN) ; (:) a (DT) conclusion (NN) based (VBN) on (IN) the (DT) gathered (JJ) evidence (NN) whether (IN) the (DT) term (NN) originates (VBZ) from (IN) in- (NN) or (CC) outside (IN) of (IN) equestrian (JJ) terminology (NN) . (.) If (IN) a (DT) term (NN) has (VBZ) an (DT) uncertain (JJ) background (NN) or (CC) different (JJ) uses (NNS) of (IN) a (DT) term (NN) originate (NN) from (IN) different (JJ) areas (NNS) of (IN) life (NN) , (,) a (DT) discussion (NN) about (IN) it (PRP) might (MD) follow (VB) . (.) The (DT) main (JJ) purpose (NN) of (IN) this (DT) work (NN) is (VBZ) to (TO) determine (VB) whether (IN) the (DT) terms (NNS) mostly (RB) originate (VBP) from (IN) equestrian (JJ) specialty (NN) field (NN) or (CC) outside (IN) it (PRP) and (CC) whether (IN) one (CD) category

(36)

(NN) is (VBZ) more (RBR) likely (JJ) to (TO) originate (VB) from (IN) either (DT) of (IN) them (PRP) than (IN) the (DT) other (JJ) . (.)

Text 8

According (VBG) to (TO) Erelt (NNP) , (,) the (DT) most (RBS) important (JJ) similarity (NN) is (VBZ) that (IN) there (EX) are (VBP) some (DT) instances (NNS) when (WRB) the (DT) object (NN) of (IN) the (DT) Estonian (JJ) impersonal (JJ) sentence (NN) has (VBZ) the (DT) qualities (NNS) of (IN) a (DT) subject (NN) . (.) Object (NN) can (MD) be (VB) in (IN) the (DT) partitive (NN) , (,) genitive (JJ) and (CC) nominative (JJ) cases (NNS) in (IN) Estonian (JJ) . (.) Thus (RB) , (,) when (WRB) the (DT) object (NN) , (,) especially (RB) the (DT) direct (JJ) object (NN) is (VBZ) in (IN) the (DT) nominative (JJ) case (NN) , (,) which (WDT) is (VBZ) usually (RB) a (DT) quality (NN) of (IN) a (DT) subject (NN) , (,) and (CC) appears (VBZ) at (IN) the (DT) beginning (NN) of (IN) the (DT) sentence (NN) , (,) it (PRP) may (MD) resemble (VB) an (DT) entity (NN) that (WDT) has (VBZ) been (VBN) promoted (VBN) into (IN) the (DT) status (NN) of (IN) a (DT) subject (NN) although (IN) it (PRP) keeps (VBZ) all (PDT) its (PRP$) object (JJ) properties (NNS) . (.) Another (DT) similarity (NN) is (VBZ) that (IN) it (PRP) is (VBZ) possible (JJ) to (TO) use (VB) a (DT) poolt-phrase (NN) to (TO) express (VB) the (DT) agent (NN) of (IN) the (DT) sentence (NN) , (,) very (RB) similarly (RB) to (TO) a (DT) by-phrase (NN) in (IN) English (NNP) , (,) albeit (IN) it (PRP) is (VBZ) somewhat (RB) bureaucratic (JJ) and (CC) has (VBZ) been (VBN) considered (VBN) a (DT) foreign (JJ) influence (NN) . (.) However (RB) , (,) there (EX) are (VBP) many (JJ) ways (NNS) in (IN) which (WDT) the (DT) active-passive (JJ) distinction (NN) and (CC) personal- (JJ) impersonal (JJ) distinction (NN) differ (NN) , (,) making (VBG) it (PRP) necessary (JJ) to (TO) recognise (VB) the (DT) personal-impersonal (JJ) opposition (NN) as (IN) a (DT) distinct (JJ) voice (NN) category (NN) from (IN) active- passive (JJ) opposition (NN) . (.)

Text 9

Infinite (NNP) Jest (NNP) , (,) advocates (VBZ) engagement (JJ) with (IN) entertaining (VBG) cycles (NNS) as (IN) an (DT) ethical (JJ) choice (NN) . (.) Dulk (NNP) believes (VBZ) that (IN) characters (NNS) arrive (VBP) at (IN) the (DT) realization (NN) that (IN) to (TO) have (VB) a (DT) meaningful (JJ) life (NN) , (,) one (CD) has (VBZ) to (TO) be (VB) connected (VBN) to (TO) the (DT) outside (JJ) world (NN) , (,) leaving (VBG) behind (IN) their (PRP$) solipsistic (JJ) worldviews (NNS) . (.) Bartlett (NNP) also (RB) argues (VBZ) against (IN) the (DT) common (JJ) notion (NN) that (IN) Hal (NNP) Incandenza (NNP) is (VBZ) a (DT) conduit (NN) for (IN) Wallace (NNP) in (IN) the (DT) text (NN) . (.) He (PRP) instead (RB) believes (VBZ) Hal (NNP) ’ (NNP) s (VBD) father (NN) Jim (NNP) Incandenza (NNP) to (TO) fill (VB) that (DT) role (NN) . (.) Bartlett (NNP) draws (VBZ) a (DT) parallel (NN) between (IN) Jim (NNP) Incandenza (NNP) and (CC) Wallace (NNP) in (IN) that (DT) Jim (NNP) creates (VBZ) the (DT) movie (NN) Infinite (NNP) Jest (NNP) in (IN) order (NN) to (TO) communicate (VB) with (IN) his (PRP$) son (NN) i.e (NN) . (.) the (DT) next (JJ) generation (NN) , (,) which (WDT) Wallace (NNP) has (VBZ) said (VBD) was (VBD) his (PRP$) goal (NN) with (IN) Infinite (NNP) Jest (NNP) . (.)

(37)

Text 10

With (IN) progressive (JJ) infinitive (NN) , (,) will (MD) and (CC) shall (MD) can (MD) have (VB) the (DT) meaning (NN) of (IN) future (JJ) as (IN) the (DT) matter (NN) of (IN) course (NN) . (.) The (DT) progressive (JJ) construction (NN) may (MD) also (RB) add (VB) tactfulness (NN) . (.) Huddleston (NNP) and (CC) Pullum (NNP) add (VBP) that (WDT) will (MD) can (MD) be (VB) an (DT) epistemic (JJ) modal (NN) with (IN) the (DT) meaning (NN) of (IN) futurity (NN) . (.) As (IN) futurity (NN) and (CC) modality (NN) are (VBP) closely (RB) related (VBN) by (IN) the (DT) fact (NN) that (IN) future (NN) is (VBZ) not (RB) as (RB) clear (JJ) as (IN) the (DT) past (NN) and (CC) the (DT) present (JJ) , (,) it (PRP) can (MD) still (RB) be (VB) said (VBD) that (IN) the (DT) futurity (NN) will (MD) is (VBZ) a (DT) modal (JJ) verb (NN) , (,) not (RB) a (DT) future (NN) tense (NN) marker (NN) . (.) Another (DT) meaning (NN) of (IN) will (MD) that (WDT) Huddleston (NNP) and (CC) Pullum (NNP) point (NN) out (RP) is (VBZ) propensity (NN) . (.)

ERROR RATE OF AUTOMATED PART-OF-SPEECH TAGGING OF ESTONIAN ACADEMIC LEARNER ENGLISH