MarMoT can also be used from the command line. The input to MarMoT is a file in a one-token-per-line format. Sentence boundaries should be marked by a new line. A valid example can be found in Figure A.1
If the sentences in Figure A.1 were stored in a file calledtext.txta pretrained MarMoT could be run by executing:
java -cp marmot.jar marmot.morph.cmd.Annotator\
--model-file de.marmot\
--test-file form-index=0,text.txt\
--pred-file text.out.txt
where form-index=0 specifies that the word form can be found in the first column. The MarMoT Annotator produces output in a truncated CoNLL 2009 format (the first 8 columns).
The output for the first sentence is shown in the following Figure A.2.
A new model can be trained from this output file by running:
0 Murmeltiere _ _ _ \NN _ case=nom|number=pl|gender=masc
1 sind _ _ _ VAFIN _ number=pl|person=3|tense=pres|mood=ind
2 im _ _ _ APPRART _ case=dat|number=sg|gender=neut
3 Hochgebirge _ _ _ \NN _ case=dat|number=sg|gender=neut
4 zu _ _ _ APPR _ _
5 Hause _ _ _ \NN _ case=dat|number=sg|gender=neut
6 . _ _ _ $. _ _
Figure A.2: Example output of the MarMoT Annotator.
java -Xmx5G -cp marmot.jar marmot.morph.cmd.Trainer\
-train-file form-index=1,tag-index=5,morph-index=7,text.out.txt\
-tag-morph true\
-model-file en.marmot
wheretag-index=5specifies that the POS can be found in the sixth column andmorph-index=7 that the morphological features can be found in the eighth column. An important training pa-rameter is the model order. The default is a second order model, but for some languages such as German a higher order might give better results. For completeness we give a list of all the available training options:
Parameter Description Default Value
prune Whether to use pruning. true
effective-order Maximal order to reach before increasing the
level. 1
seed Random seed to use for shuffling. 0 for
nondeterministic seed 42
prob-threshold Initial pruning threshold. Changing this value
should have almost no effect. 0.01 very-verbose Whether to print a lot of status messages. false oracle Whether to do oracle pruning. Probably not
relevant. false
trainer Which trainer to use. marmot.core.CrfTrainer
num-iterations Number of training iterations. 10
candidates-per-state Average number of states to obtain after
pruning at each order. These are theµvalues. [4, 2, 1.5]
max-transition-feature-level Something for testing the code. -1 beam-size Specify the beam size of the n-best decoder. 1
order Set the model order. 2
initial-vector-size Size of the weight vector. 10000000
averaging Whether to use averaging. Perceptron only! true shuffle Whether to shuffle between training iterations. true
verbose Whether to print status messages. false
quadratic-penalty L2 penalty parameter. 0.0
penalty L1 penalty parameter. 0.0
Table A.1: General MarMoT options
Parameter Description Default Value observed-feature Whether to use the observed feature. true
split-pos Whether to split POS tags. See
subtag-separator false
form-normalization Whether to normalize word forms before
tagging none
shape Whether to use shape features false
special-signature Whether to mark if a word contains a special
character in the word signature. false num-chunks Number of chunks. CrossAnnotator only. 5 restrict-transitions Whether to only allow POS→MORPH
transitions that have been seen during training. true type-dict Word type dictionary file (optional)
split-morphs Whether to split MORPH tags. See
subtag-separator. true
rare-word-max-freq Maximal frequency of a rare word. 10 type-embeddings Word type embeddings file (optional).
tag-morph Whether to train a morphological tagger or a
POS tagger. true
subtag-separator Regular expression to use for splitting tags.
(Has to work with Java’s String.split) \\|
internal-analyzer
Use an internal morphological analyzer.
Currently supported: ’ar’ for AraMorph (Arabic)
none
model-file Model file path. none
train-file Input training file none
test-file Input test file. (optional for training) none pred-file Output prediction file in CoNLL09. (optional
for training) none
shape-trie-file Path to the shape trie. Will be created if
non-existent. none
Table A.2: Morphological MarMoT options
MarLiN Implementation and Usage
In this appendix we explain the important implementation details of MarLiN (Martin et al.,1998).
The latest version of the MarLiN source code and its documentation can be found at http:
//cistern.cis.lmu.de/marlin/.
B.1 Implementation
Our implementation follows the ideas explained inMartin et al.(1998). The most important part is the assignment of a word form to a specific class. This can be implemented efficiently, if we keep track of the left and right contexts of each word. The following C++ code shows how this is implemented in MarLiN:
1 void incrementBigrams(int word, int klass, int factor) { 2 forvec (_, Entry, entry, left_context_[word]) {
3 int cword = entry.item;
4 if (cword != word) {
5 int cclass = word_assignment_[cword];
6 addTagTagCount(cclass, klass, factor * entry.count);
7 } else {
8 addTagTagCount(klass, klass, factor * entry.count);
9 }
10 }
11 forvec (_, Entry, entry, right_context_[word]) { 12 int cword = entry.item;
13 if (cword != word) {
14 int cclass = word_assignment_[cword];
15 addTagTagCount(klass, cclass, factor * entry.count);
16 }
17 }
18 }
left_contextandright_contextmap each form to the list of its left and right neigh-bors, respectively.addTagTagCountincrements the transition count of classklass preceed-ing classcclass.
We also found that a huge speed up could be obtained if n·logn was precomputed for all n <10.000and cached in an array:
1 size_t cache_size_ = 10000;
2 vector<double> nlogn_cache_;
3
4 void init_cache() {
5 nlogn_cache_.resize(cache_size_);
6 for (int i=0; i<cache_size_; i++) {
7 nlogn_cache_[i] = (i + 1) * log(i + 1);
8 }
9 } 10
11 double nlogn(int n) { 12 assert (n >= 0);
13 if (n == 0) { 14 return 0;
15 }
16 if (n - 1 < cache_size_) { 17 return nlogn_cache_[n - 1];
18 }
19 return n * log(n);
20 }