• Keine Ergebnisse gefunden

Basic definitions and ONDEX framework overview

5.1 ONDEX as text mining framework

5.1.1 Basic definitions and ONDEX framework overview

Ontologies

Although ontologies are defined in many different ways in literature (Gruber, 1993; K¨ohler et al., 2003, 2004; Smithet al., 2005), in the most basic and still widely used form an ontol-ogy can be described as an extension of a controlled vocabulary CV, which is a collection of well defined terms:

CV := named set of concepts c, (5.1)

with a concept cdefined as

c:= (identifier, names, definition). (5.2) The symbols identifier, names and definition represent different data types and can be implemented in various ways. The identifier can be a string or a number, but must be unique, definition is usually a string and names a set of strings. A concept can consist of more than one name, i.e. usually a term and its synonyms are stored here. There can be a preferred or main term, but concepts are not identified by one of their names. This ensures that a CV does not contain ambiguous concepts such as e.g. homonyms (terms that obey the same name, but different definitions). A CV can still contain homonym names, but the respective concepts have different identifiers and an unambiguous definition.

Thus, a concept can be uniquely identified and hence, a controlled vocabulary is not only a loosely term collection. Each concept appears only once with one specific meaning.

An ontology O can then be defined as an extension of a controlled vocabulary (K¨ohler et al., 2003):

O := Graph G(CV, E),with edgesE ⊆CV ×CV. (5.3) The types of the edges (i.e. the relation types) are given by a function t defined as:

t:E →T,with T :={set of possible edge types}, (5.4) i.e. T describes the semantics of an edge in natural language and its algebraic relational properties (transitivity, symmetry and reflexivity).

All ontologies have an edge type ‘is-a’ ∈T. If two concepts c1, c2 ∈CV are connected by an edge of this type, the natural language meaning is “c1 is a c2”. For example, the concepts “vertebrate”, “animal” and “organism” are connected by transitive ‘is-a’ relations, i.e. vertebrate ‘is-a’ animal and animal ‘is-a’ organism. The transitive ‘is-a’ relations can then be used to derive the fact that vertebrate ‘is-a’ organism. Furthermore, an ontology is defined as an acyclic graph in respect to its ‘is-a’ relations, i.e. circular definitions regarding the ‘is-a’ structure are not allowed. A further common relation type is ‘part-of’. Examples for widely used ontologies are the Gene Ontology (GO, see Ashburner et al., 2000), the Unified Medical Language System (UMLS, see Bodenreider, 2004) and WordNet (Fellbaum, 1998). Another ontology important in this context is the Cell Ontology (CL, see Bardet al., 2005) that contains a hierarchy of cell types.

5.1 ONDEX as text mining framework 73

A similar concept is that of a semantic network, with the difference that in a semantic network the type of the relation between two concepts is not as strongly defined as in an ontology. Also in a semantic network it is not a requirement that the ‘is-a’ relation must exist. Thus, an ontology is a form of a semantic network, but not any semantic network can be regarded as ontology. In computational biology semantic networks are used for example to model intracellular signaling pathways (Hsing et al., 2004). Sometimes the UMLS is also referred to as semantic network (McCray and Nelson, 1995).

ONDEX database scheme

Databases, ontologies and other sources are imported via specialized parsers into the ON-DEX database by converting the data structure of the external source into the concept-relation scheme of ONDEX. Figure 5.2 shows the main parts of the entity-concept-relationship diagram of the ONDEX database as far as they are concerned in this thesis. These parts are the ONDEX core (left side of Figure 5.2), storing the data sources as one unified ontol-ogy graph, and the text mining part (right side of Figure 5.2), containing selected texts and concepts to be mined as well as the text mining results. A further part of ONDEX is the generalized data structure (GDS) that allows to import data which has no pre-defined data types in the ONDEX database (not shown in Figure 5.2). This allows greater flexibility for importing and managing heterogeneous data, but is not used within this thesis.

The central table of the ONDEX core is CONCEPT. Here the identifier (field: id) and the description (field: description) of a concept (according to Definition 5.2) are stored. Each concept can consist of one or several names (synonyms) in the joined table CONCEPT NAME.

One of the names can be flagged as the preferred or main name (field: is preferred), but this is not necessary for an unambiguous identification of concepts. So, synonyms are not stored as related concepts, but rather as properties of the concept itself. The field is unique is set to true if there are no other concepts carrying the same name (homonyms) in the database. Since the concept names are processed with several natural language processing (NLP) tools (see following Section 5.1.2, step 1) there are several fields for storing the original name and different resulting names after the NLP processing. The tableCONCEPT ACC is used for storing different accession numbers of the same concept. An accession number in this context is a reference that maps a database entry to entities in other databases, i.e. a protein in Swiss-Prot for example can have links to corresponding genes the Gene Ontology.

In this version of ONDEX relations defined in the imported data sources are distin-guished from mappings created by algorithms in ONDEX. For this purpose, imported re-lations are stored in RELATION and additional mappings between concepts derived by one of the ONDEX mapping algorithms (see Section 5.1.2, step 2) are written into MAPPING.

Each relation and mapping is further characterized by a RELATION TYPE (e.g. is a) and a MAPPING METHOD respectively. In a RELATION the direction is considered by discriminating between the source (FROM CONCEPT) and the target (TO CONCEPT) whereas a MAPPING con-sists simply of two concepts without any order. The re-designed ONDEX system currently under development unifies relations and mappings into the RELATION table by assigning

!

" ! # !

$

$ !""

!

% !

& $ ! "

! #

! %

#

$' %

! !

! !""

& (&

$

! "

&) "

!

! ) "& &

* +

! , +

- .

!

$ !

!

!

-!"

!

" ! !'

" ! !'

&) (& &

& '

"/ ' % '

"0 ' % '

! ' % '

Figure 5.2: Entity relationship diagram of the ONDEX core and the text mining part.The generalized data structure (GDS) part is not used in this thesis and therefore are left out.

appropriate types in RELATION TYPE. Undirected mappings will then be represented by double entries (one for each direction) in the RELATION table.

A further central table of the ONDEX core is CV. CV means controlled vocabulary and contains identifiers for all imported data sources, i.e. any concept and relation can be queried for its origin. Internally derived mappings are also identifiable by a CV entry.

The CV table also reflects our basic understanding of ontologies implemented in the ONDEX system: controlled vocabularies in the beginning contain concepts consisting of an identifier, a description and a set of names. The concepts can then subsequently be linked by relations (either imported or newly created) constituting altogether the ONDEX ontology. Thus, ONDEX is able to import simple controlled vocabularies without any pre-defined relations as well as ontologies and databases (at least those databases that can

5.1 ONDEX as text mining framework 75

CV Full name Content Concepts Relations

CL Cell Ontology Cell types 699 1 056

EC Enzyme Nomenclature Enzyme classification 4 502 4 496

Committee

DRA Drastic Insight Database Host-pathogen interactions 5 438 9 907 TF Biobase TransFac Transcription factor database 13 945 9 179 GO Gene Ontology Gene and gene product attributes in

any organism

17 427 25 337

AC AraCyc Biochemical pathway database for

arabidopsis

18 909 18 424 MC MetaCyc Metabolic pathway database for

dif-ferent organisms

21 345 34 384 MESHD Medical Subject Headings

(MeSH) Descriptors

NLM’s standard medical index terms

22 995 40 525 TP Biobase Transpath Signal transduction database 45 243 59 825 WN WordNet Lexical reference system of English

nouns, verbs, adjectives and adverbs

115 424 213 335 TX NCBI Taxonomy Organism names and classification 220 927 220 927 BREND BRENDA database Enzyme information system 286 983 691 799 KEGG KEGG database Kyoto encyclopedia of genes and

genomes

2 170 188 3 073 502

Table 5.1: Overview of some of the ontologies and databases available for importing into ONDEX. The gray lines mark the data sources imported and used within this thesis. “CV” is the identifier of the database/ontology. The table is ordered by increasing numbers of concepts and relations provided by the data source.

be interpreted as ontologies). The unified concept-relation structure supports the search for additional mappings between concepts of the various imported data sources.

The text mining part of ONDEX consists mainly of the texts imported into the table TEXT. There they are stored unchanged as well as processed by the same NLP tools used for importing the databases and ontologies. In case of mining huge amounts of text, a split into a number of text tables might be necessary (see step 3 in Section 5.1.3 for a more detailed problem description and Section 5.2 for the specific procedure applied to the MEDLINE texts imported in this context).

For each text mining task a PROJECT can be defined that specifies a subset of selected concepts (CONCEPT SUBSET) and a subset of selected texts (TEXT SUBSET). The concept based indexing algorithm (see Section 5.1.3, step 4) then assigns in IDENTIFIED CONCEPT to each concept of CONCEPT SUBSET the texts where it is found.

With the table ANNOTATION, texts can be additionally annotated by entries of other controlled vocabularies. The MEDLINE database for instance provides links to databases and other external sources.