• Keine Ergebnisse gefunden

5.2 Applying ONDEX to cell-cell relation mining

5.2.5 Information extraction

Based on the texts indexed by the entities of interest (cell types, messenger substances and receptors) the actual cell-cell relations can be extracted by performing concept based co-occurrence searches and subsequent hypotheses generation (step 5). Probably the most simple approach to begin with, is to search for texts containing concepts of all elements suf-ficient to describe a signaling relation, i.e. two cell types (source and target cell) and a first messenger. However, only very few texts deal with complete cellular interactions, as e.g. like

“cell type A interacts with cell type B through messenger substance C”. Even constraining the search by including keywords indicating interactions or signaling, as e.g. “interact”,

“release” or “signal”, did not result in more specific texts.

Consequently, the co-occurrence search approach had to be refined. Reconsidering the biological background (Section 2.1), any cell-cell signal can be decomposed into three components (Section 3.1): the messenger release (source cell → messenger), the ligand-receptor binding (messenger → receptor) and the occurrence of receptors in cells

(receptor → target cell). Thus, each of these components can be searched independently by separate co-occurrence searches and subsequently combined into cell-cell relation hy-potheses.

Double co-occurrence searches

After decomposing a cell-cell signal into its three components, the most straightforward approach is to search for double co-occurrences, i.e. tuples with each two concepts from different concept groups, in the abstracts. Therefore, according to Definition 5.8 three concept groups are defined by applying the previously defined set of concept types CT (Definition 5.10):

CG1 :={c∈CS|ct(c) = cell},

CG2 :={c∈CS|ct(c) = msngr}, (5.11) CG3 :={c∈CS|ct(c) = rec},

where the concepts c are from the concept subset CS ⊆ C, that is generated from the manually created concept lists (Section 5.2.1) andctis the function that returns the concept type of a concept c (Definition 5.7). Thus, each concept group CGg consists of concepts possessing a specified concept type tg ∈CT.

Using these concept groups, the co-occurrence groups containing all concept tuples to be searched are (according to Definition 5.9):

COCcell-msngr := CG1 ×CG2,

COCmsngr-rec := CG2 ×CG3, (5.12)

COCrec-cell := CG3 ×CG1,

Hence, each co-occurrence group contains a set of concept tuples, with each concept tuple consisting of two concepts from different concept groups. For instance, tuples in the co-occurrence group for celland msngr could look like

COCcell-msngr :={(erythrocyte,insulin), (hepatocyte,FSH), ...}.

Hypotheses are then generated by combining the tuples of the three co-occurrence searches possessing equal concepts connecting them, i.e. obeying the samemsngrand the same rec concept.

Triple co-occurrence searches

To further restrain the co-occurrence search, three additional lists are applied which con-tain keyword concepts indicating that a text expresses the searched fact. These lists consist of concepts describing the cellular release or production of molecules (rword), the binding or interaction of molecules (bword) and when cells are able to contain or express molecules

5.2 Applying ONDEX to cell-cell relation mining 89

(cword) respectively (Section 5.2.1). The respective concept groups (according to Defini-tion 5.8) are:

CG4 :={c∈CS|ct(c) = rword},

CG5 :={c∈CS|ct(c) = bword}, (5.13) CG6 :={c∈CS|ct(c) = cword}.

In contrast to the double co-occurrence searches, it is not of importance here which con-cept of a keyword concon-cept group (Definition 5.13) is contained in a text, but rather if a text contains any keyword concept. For example, if a text containing the concept tuple (erythrocyte,insulin) is found, it is checked now whether additionally any keyword like

“release” or “secrete” or “production” appears. Thus, it is not necessary to check for all concept triples that would be generated for e.g. CG1×CG2×CG4, but rather for triples that have the set of all keyword concepts as third element. Therefore, we define new con-cept groups that contain each of the concon-cept groups defined in Definition 5.13 as the only element:

CG04 :={CG4},

CG05 :={CG5}, (5.14)

CG06 :={CG6}.

Using these concept groups, the new co-occurrence groups for triple co-occurrence searches are (according to Definition 5.9) defined as:

COCcell-msngr-rword := CG1×CG2×CG04,

COCmsngr-rec-bword := CG2×CG3×CG05, (5.15)

COCrec-cell-cword := CG3×CG1×CG06,

i.e. each co-occurrence group contains now a set of concept triples, with each triple con-sisting as before (Definition 5.12) of two cell-, msngr- or rec-concepts and additionally a set of keyword concepts. So could, for example, a part of the co-occurrence group for cell, msngr and rword look like

COCcell-msngr-rword :={(erythrocyte,insulin,{release,secrete,produce, ...}), (hepatocyte,FSH,{release,secrete,produce, ...}), ...}.

Thus, the texts found in the previous step are here checked for additional keywords. Hy-potheses are then generated from the resulting reduced set of concept co-occurrences as before.

Triple co-occurrence searches in sentences

To get more specific results and to increase the precision rates, the texts gained by the two preceding steps (double and triple co-occurrence searches) are split into their single

sentences and searched again for triple co-occurrences. Technically, the same processes are applied as for the co-occurrence searches in whole texts: the sentences are regarded as

“texts” (i.e., each sentence generates a single entry in the TEXTtable (see also Figure 5.2) and indexed with all concepts of the concept subset. Finally, the searches for the concept triples of the previously defined co-occurrence groups (Definition 5.15) are performed.

Validation

The concept co-occurrences in each of the three tasks described above are validated by manual inspection of 100 randomly selected co-occurrence hits with one respective text each. The resulting precision value is the ratio of texts out of all sample texts that indeed describe the fact assumed by the co-occurrence. Although these might be too few evaluation samples, the precision values help to indicate tendencies.

For instance, a text containing acelland amsngrconcept is considered as correct if the text describes that this cell type is able to release this messenger. Any further conditions under which this signaling might take place are neglected. Thus, the “semantic range” of probably relevant texts is larger as if only special types of interactions are searched. That means that in case of messenger release not only texts describing a “release” or “secretion”

of ligands are taken into account, but also texts talking about the “production”, “synthesis”

or “expression” of messenger substances. For our purpose it is assumed that cells which are able to produce a substance stated in the input lists as messenger substance are probably also able to secrete this substance. Similar assumptions hold for the other components as well: for ligand-receptor bindings texts also describing any “interaction” between both substances are positively evaluated and receptor occurrence in a cell can be characterized by the “expression” of the receptor or simply that a cell type “contains” receptor molecules.

These assumptions are also reflected by the choice of additional keywords (see the lists for rword, bword and cword in the appendix, Section D).

Furthermore, a co-occurrence hit is rated as false-positive if one of the searched concepts does not occur at all, i.e. the concept-text match generated by the concept based indexing is incorrect, which is here not checked separately.

The validation measure used here is the precision, i.e. the proportion of extracted relevant entities to all entities retrieved (as defined in Section 2.3.1). Unfortunately, a recall value (i.e. the fraction of correctly identified entities in the set of relevant and thus true-positive entities) can not be measured since it is not knowna priori, whether a text contains relevant information (regarding recall and precision measures see also Section 2.3.1). Note also that the same co-occurrence tuple can be selected for evaluation several times with different texts.

The generated hypotheses are difficult to evaluate for the same reasons as discussed in the database reconstruction approach (Section 4.5), i.e. many of them might hold true, but have not been investigated and reported explicitly yet. Hypotheses evaluation is best feasible for a small subset of cell types selected for a specific application (Section 6).