Text Annotation - Representation of Information

3.2 Representation of Information

3.2.1 Text Annotation

28 Chapter 3 Information Modelling and Representation

engineering often restricted scenarios and problems are given, e.g. expert systems or informa-tion models in a special context. The term knowledge is defined in a limited way according to mostly formal operations, which can be performed by a machine. One should note that the basis of computer systems is calculation processes. Although humans can perform many operations by using machines, these systems do not have the complex structure of a human brain. By transferring thinking processes to machines, these processes have still to be simpli-fied and limited. In addition, other human actions, for example emotions or creative processes, are mostly excluded because they are too complex for modelling yet.¹¹¹ In AI and knowledge engineering, information and data seems not to be ranked or grouped in the same manner as in some other disciplines. As mentioned, the terms information and knowledge are often mingled together without a clear distinction.

This presentation of theories and approaches does not claim to be complete, but it shows how complex and difficult definitions of information can be. For this reason, it is not yet possible to give a general definition. It depends on the perspective how data, information and knowledge can be regarded. Being aware of this discussion, it might be helpful to understand methods described in the next sections. In this thesis, a computer-aided approach is presented which models a special kind of information, information that is given in literature. Therefore, more restricted definitions have to be used.

In this thesis, information is defined as something that is nested in a (literary) text or other media. The text or other media themselves are treated as data. In a communication process, information has to be extracted by a human. Then, according to Schneider’s approach (see section 2.5), information from literary texts can be processed in different ways depending on the background of a recipient and on the guidance for example by a narrator.

3.2 Representation of Information 29

Akt- und Versende sowie Strophenbegrenzungen bis zum Autornamen und den Werktitel...”¹¹² This means that different kinds of information, like information about the design or structure, can be attached to the text by using mark-up. Jannidis also stresses that every mark-up is an interpretation of a text which represents a special view on it.¹¹³

The most popular markup language family isSGML(Standard Generalized Markup Language) and its derivatives XML(Extensible Markup Language) andHTML(Hypertext Markup Lan-guage). SGML has been developed by Charles Goldfarb, Ed Mosher and Ray Lorie in the 1970s.¹¹⁴ Although SGML provides semantic and structural markup, the specification is so enormous and complex that only parts of it were implemented. By restricting the conception of SGML, HTML – a SGML application with a specific and limited set of markup – was developed to describe the presentation of data, text and pictures in a web browser.¹¹⁵ Another restricted version of SGML is XML, developed by Jon Bosak, Tim Bray, C.M. Sperberg-McQueen, and James Clark in the late 1990s. Using XML, it is possible to describe and preserve the struc-ture of data. In contrast to HTML, XML is meant for a more general usage than presenting content on a web browser. Because of its specification, XML mark-up offers huge flexibility for individual purposes, e.g. to structure literary texts, like verses and dramas, as well as textual data in natural science or documents of a company. Furthermore, the possibility of a re-usage and exchange of textual data enriched with meta-information in XML is of interest.

In order to share and interchange, it might be useful that people working on textual data agree on a more standardised usage of XML. For this reason, encoding schemes or so-called DTDs (Document Type Definition) can be introduced to restrict XML mark-up on a common accepted set.¹¹⁶ Several organisations and groups have been developed schemes to provide mark-up in a standardised way, e.g. DocBook or TEI. The TEI (Text Encoding Initiative) is developed to provide a standard for the work with digital text sources. The main goal of the TEI is described as follows: “The Text Encoding Initiative (TEI) Guidelines are an international and interdisciplinary standard that enables libraries, museums, publishers, and individual scholars to represent a variety of literary and linguistic texts for online research, teaching, and preservation.”¹¹⁷ This means that a scheme has been developed to support encoding work in the context of language and culture. A wide range of mark-ups for textual components and concepts are included in the scheme to enrich e.g. data with linguistic and literary meta-information. This means that it is possible to mark textual elements and content of linguistic analysis as well as dramas, novels or verses.

The large repertoire of mark-up can be restricted on a subset, which is required for the special purposes of research projects. In the presented approach, a special subset of TEI mark-up elements is also used which will be explained in section 6.1.4.

112Jannidis 1999, p.43

113Jannidis 1999, p.45

114Harold and Means 2001, p.8

115Harold and Means 2001, p.8

116Harold and Means 2001, p.4

117Text Encoding Initiative 2003 Homepagehttp://www.tei-c.org/(last accessed October 30, 2007)

30 Chapter 3 Information Modelling and Representation

An important aspect for the understanding of data marked with XML is that it is represented as atree. It always consists of a root node followed by one or more child nodes, which can also contain children (cf. Fig. 3.1).

root

text text

#PCDATA #PCDATA

Figure 3.1:Scheme of (textual) data with XML mark-up. The data is represented as a tree with branches and nodes. The last nodes contain so-called PCDATA, which represents the content of the data itself.

One should note that mark-up could be included in e.g. a drama to mark dialogues. But because of its individual specification of XML mark-up, marked parts are not presented in a special, maybe visualised way. By using XML, a separation between marked texts and their layout is provided. This means that there are no declarations for visualisations in such documents encoded with XML mark-up. Even though e.g. a dialogue structure is marked in a drama, computer systems are not able to handle and interpret this kind of information.

Therefore, if the text should be visualised, a transformation to other formats is necessary. By encoding documents with XML, it is possible to choose different formats and visualisations for the transformation, like print-versions or versions for a web browser. In addition, one can select parts of marked text sections and transform it to different XML structures for a further processing. The processing can be done by programming languages like Java, Perl or XSL (Extensible Stylesheet Language)¹¹⁸ which has been specifically developed for that purpose:

“XSLT is published by the World Wide Web Consortium (W3C) and fits into the XML family of standards, most of which are also developed by W3C. As the name implies, XSL is intended to define formatting and presentation of XML documents for display on screen, on paper, or in the spoken word.”¹¹⁹

In summary, different ways of representing information and enriching textual data with meta-information are presented here. Textual data mostly occurs by working with already developed text sources, for example corpora of novels, corpora of interviews or learner corpora. Thereby, characteristic aspects have to be detected and marked. In linguistics, for instance grammatical structures or syntactic relations like cohesion are of interest. In the humanities, structures and semantic relations concerning for example argumentations in a philosophical text or elements of a plot in a literary text are put in focus.

118For encoding and checking XML documents, editors likeXMLSpyorOxygen can be used which support the syntax of XML.

119Kay 2001, p.25

3.2 Representation of Information 31

In linguistic corpora, it might be possible to encode e.g. grammatical elements by using semi-automatic approaches. In contrast, text corpora in the humanities, especially in literature studies, cannot be processed in that way because the aspects, e.g. semantic relations in texts, cannot yet be automatically detected.

Apart from corpora consisting of primary literature, like dramas or novels, it might also be challenging to model and mark theories from secondary literature. This means that theories about literature and their included information can also be interpreted and represented by using computer systems. Thereby, at first, cognitive processes and their included information repre-sentation has to be produced and can be modelled then: “Das liegt nicht zuletzt daran, daß solche literaturwissenschaftliche Fragestellungen sich selten direkt auf ein gegebenes prim¨ares Datum Text oder Sprache richten, sondern zumeist auf Interpretations- oder Beschrei-bungsverhalte, die erst in Abh¨angigkeit von einem Prim¨artext erarbeitet werden m¨ussen.”¹²⁰ In this thesis, especially such kind of relation between primary and secondary literature is of interest. In order to represent these aspects, an encoding using XML might not be enough:

“Because XML alone does not provide sufficient semantics for marked-up or annotated docu-ments [...]”¹²¹ For this reason, further methods have to be explored which provide a complex modelling of information. As mentioned, several methods are used in disciplines like AI, which concentrate on simulating human processes in machines. In this thesis, theories of literature studies developed by humans are modelled which also contain interpretation and comprehen-sion processes. Therefore, approaches supporting such a modelling are outlined and discussed in the following. The following sections should also serve for a better understanding of the chosen modelling which will be presented in chapter 4.

An idea of modelling information and semantics is contained in the effort to change parts of the unsorted Internet into a Semantic Web so that advanced retrievals are possible:

“The Semantic Web is a vision: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. To model data and information, different formats and methods were released. The methods go from enriching data with semantic (meta-) information to complex models which structure data in an enhanced way.” (Lacher and Decker 2001, p.313)

Popular representation formats, which are introduced in the context of the Semantic Web, are RDF(Resource Description Framework) andOWL (Web Ontology Language), both initiated by the W3C (World Wide Web Consortium). By using RDF, it is possible to enlarge web sites and documents with semantic information.¹²² RDFS (RDF Schema) is more expressive to also include elements for creating a hierarchy. OWL was created to enlarge web sites with semantic information and make the Internet usable as a structured information source.¹²³ OWL is based on RDF and RDFS, but also consists of constructs that are necessary to create an ontology.¹²⁴ These formats inherited older ideas of methods of modelling like semantic

120Meister 1999, p.78

121Park and Hunting 2003, p.124

122see Mintert 2004, p.124

123see Ziegler 2004, p.126

124see Ziegler 2004, p.127

32 Chapter 3 Information Modelling and Representation

networks, taxonomies, or ontologies. According to their kind of modelling and structuring data, in this thesis, they are divided into two groups, which are outlined in the following.

Im Dokument Noctua literaria : a computer-aided approach for the formal description of literary characters using an ontology (Seite 36-40)