Command Line Usage - General methods for fine-grained morphological and syntactic disambiguatio

MarMoT can also be used from the command line. The input to MarMoT is a file in a one-token-per-line format. Sentence boundaries should be marked by a new line. A valid example can be found in Figure A.1

If the sentences in Figure A.1 were stored in a file calledtext.txta pretrained MarMoT could be run by executing:

java -cp marmot.jar marmot.morph.cmd.Annotator\

--model-file de.marmot\

--test-file form-index=0,text.txt\

--pred-file text.out.txt

where form-index=0 specifies that the word form can be found in the first column. The MarMoT Annotator produces output in a truncated CoNLL 2009 format (the first 8 columns).

The output for the first sentence is shown in the following Figure A.2.

A new model can be trained from this output file by running:

0 Murmeltiere _ _ _ \NN _ case=nom|number=pl|gender=masc

1 sind _ _ _ VAFIN _ number=pl|person=3|tense=pres|mood=ind

2 im _ _ _ APPRART _ case=dat|number=sg|gender=neut

3 Hochgebirge _ _ _ \NN _ case=dat|number=sg|gender=neut

4 zu _ _ _ APPR _ _

5 Hause _ _ _ \NN _ case=dat|number=sg|gender=neut

6 . _ _ _ $. _ _

Figure A.2: Example output of the MarMoT Annotator.

java -Xmx5G -cp marmot.jar marmot.morph.cmd.Trainer\

-train-file form-index=1,tag-index=5,morph-index=7,text.out.txt\

-tag-morph true\

-model-file en.marmot

wheretag-index=5specifies that the POS can be found in the sixth column andmorph-index=7 that the morphological features can be found in the eighth column. An important training pa-rameter is the model order. The default is a second order model, but for some languages such as German a higher order might give better results. For completeness we give a list of all the available training options:

Parameter Description Default Value

prune Whether to use pruning. true

effective-order Maximal order to reach before increasing the

level. 1

seed Random seed to use for shuffling. 0 for

nondeterministic seed 42

prob-threshold Initial pruning threshold. Changing this value

should have almost no effect. 0.01 very-verbose Whether to print a lot of status messages. false oracle Whether to do oracle pruning. Probably not

relevant. false

trainer Which trainer to use. marmot.core.CrfTrainer

num-iterations Number of training iterations. 10

candidates-per-state Average number of states to obtain after

pruning at each order. These are theµvalues. [4, 2, 1.5]

max-transition-feature-level Something for testing the code. -1 beam-size Specify the beam size of the n-best decoder. 1

order Set the model order. 2

initial-vector-size Size of the weight vector. 10000000

averaging Whether to use averaging. Perceptron only! true shuffle Whether to shuffle between training iterations. true

verbose Whether to print status messages. false

quadratic-penalty L2 penalty parameter. 0.0

penalty L1 penalty parameter. 0.0

Table A.1: General MarMoT options

Parameter Description Default Value observed-feature Whether to use the observed feature. true

split-pos Whether to split POS tags. See

subtag-separator false

form-normalization Whether to normalize word forms before

tagging none

shape Whether to use shape features false

special-signature Whether to mark if a word contains a special

character in the word signature. false num-chunks Number of chunks. CrossAnnotator only. 5 restrict-transitions Whether to only allow POS→MORPH

transitions that have been seen during training. true type-dict Word type dictionary file (optional)

split-morphs Whether to split MORPH tags. See

subtag-separator. true

rare-word-max-freq Maximal frequency of a rare word. 10 type-embeddings Word type embeddings file (optional).

tag-morph Whether to train a morphological tagger or a

POS tagger. true

subtag-separator Regular expression to use for splitting tags.

(Has to work with Java’s String.split) \\|

internal-analyzer

Use an internal morphological analyzer.

Currently supported: ’ar’ for AraMorph (Arabic)

none

model-file Model file path. none

train-file Input training file none

test-file Input test file. (optional for training) none pred-file Output prediction file in CoNLL09. (optional

for training) none

shape-trie-file Path to the shape trie. Will be created if

non-existent. none

Table A.2: Morphological MarMoT options

MarLiN Implementation and Usage

In this appendix we explain the important implementation details of MarLiN (Martin et al.,1998).

The latest version of the MarLiN source code and its documentation can be found at http:

//cistern.cis.lmu.de/marlin/.

B.1 Implementation

Our implementation follows the ideas explained inMartin et al.(1998). The most important part is the assignment of a word form to a specific class. This can be implemented efficiently, if we keep track of the left and right contexts of each word. The following C++ code shows how this is implemented in MarLiN:

1 void incrementBigrams(int word, int klass, int factor) { 2 forvec (_, Entry, entry, left_context_[word]) {

3 int cword = entry.item;

4 if (cword != word) {

5 int cclass = word_assignment_[cword];

6 addTagTagCount(cclass, klass, factor * entry.count);

7 } else {

8 addTagTagCount(klass, klass, factor * entry.count);

9 }

10 }

11 forvec (_, Entry, entry, right_context_[word]) { 12 int cword = entry.item;

13 if (cword != word) {

14 int cclass = word_assignment_[cword];

15 addTagTagCount(klass, cclass, factor * entry.count);

16 }

17 }

18 }

left_contextandright_contextmap each form to the list of its left and right neigh-bors, respectively.addTagTagCountincrements the transition count of classklass preceed-ing classcclass.

We also found that a huge speed up could be obtained if n·logn was precomputed for all n <10.000and cached in an array:

1 size_t cache_size_ = 10000;

2 vector<double> nlogn_cache_;

4 void init_cache() {

5 nlogn_cache_.resize(cache_size_);

6 for (int i=0; i<cache_size_; i++) {

7 nlogn_cache_[i] = (i + 1) * log(i + 1);

8 }

9 } 10

11 double nlogn(int n) { 12 assert (n >= 0);

13 if (n == 0) { 14 return 0;

15 }

16 if (n - 1 < cache_size_) { 17 return nlogn_cache_[n - 1];

18 }

19 return n * log(n);

20 }

Im Dokument General methods for fine-grained morphological and syntactic disambiguation (Seite 144-150)