• Keine Ergebnisse gefunden

Text Analysis and Machine Learning for Stylometrics and Stylogenetics

N/A
N/A
Protected

Academic year: 2022

Aktie "Text Analysis and Machine Learning for Stylometrics and Stylogenetics"

Copied!
1
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Text Analysis and Machine Learning for Stylometrics and Stylogenetics

Walter Daelemans University of Antwerp

Abstract

Automatic Text Categorization, learning to assign documents to specific cate- gories (e.g. in topic assignment or spam filtering), has been an influential applica- tion in Natural Language Processing. These systems consist of two components: a first one that constructs representations of documents (mostly bags of words repre- sented as binary or numeric vectors), and a second one that uses standard machine learning techniques to learn mappings between such document vectors and their topics. Recently, this general approach has been put to use for other, more lin- guistically interesting “stylometric” applications, such as assigning authorship to documents or determining the gender of the author of a document. Such applica- tions need linguistically more sophisticated document representations and provide insight into which linguistic properties of documents are relevant for predicting the (gender of) the author. In my presentation, I will give a brief overview of results in this approach and describe a number of applications of the methodology we are currently investigating in the CNTS research group. For creating linguistically more interesting document representations, we use a memory-based shallow parser that analyzes documents at the levels of morphology, part of speech, phrases, and grammatical relations. More specifically I will describe results on authorship attri- bution in the context of journalists writing about the same topic (politics). A more challenging task is personality assignment on the basis of text. We constructed a corpus consisting of 145 documents describing the contents of the same documen- tary, written by 145 different students who also took a personality test. We show which linguistic features correlate with different dimensions of personality and the predictability of personality from these features. Finally, I will describe work on what we dubbed “stylogenetics”, stylistic analysis of literary works based on the same general architecture, but using clustering as a machine learning technique rather than supervised learning.

3

Referenzen

ÄHNLICHE DOKUMENTE

Mainstreaming the LforS approach is a challenge due to diverging institutional priorities, customs, and expectations of classically trained staff. A workshop to test LforS theory and

We will show that what is termed statistical shape models in computer vision and med- ical image analysis, are just special cases of a general Gaussian Process formulation, where

– Each of the fittest individuals produce λ/µ children (mutation) – Join operation replaces the parents by the children. Hans-Paul Schwefel

• Difference in selection and breeding operation – ES selects parents before breeding children.. – GA selects little-by-little parents to breed

– Indicates the file is in CNF format; nbvar is the number of variables appearing in the file; nbclauses is the number of clauses in the file. • All clauses

Based on OMG’s metamodelling framework MOF in combination with an action language extension for the definition of operational semantics, we use QVT to transform ab- stract syntax

1) Linguistic based preprocessing raised the system’s overall performance in both scenarios. We also think that in particular shallow text processing with its high degree of

The language model for the token level is obtained using Maximum Entropy Modeling (MEM). The major advantages of MEM for IE from unstructured texts are 1) that one can easily