Experimental Evaluation of General Concept Inclusions Learned from Textual Data

(1)

Experimental Evaluation of General Concept Inclusions Learned from Textual Data.

Hybris B1: Automatic Generation of Description Logic-based Biomedical Ontologies

Daniel Borchmann, Anas Elghafari, and Yue Ma

TU Dresden

2015-06-08

(2)

Goal

Automatically construct biomedical ontologies from text:

Learn concept definitions from text Learn terminological knowledge from text Evalutation

Example (Terminological Knowledge)

Genes are not protein complexes, and vice versa. Gene[ProteinComplexĎK Proteins contain amino acids

ProteinDomain[ DhasPart.JĎDhasPart.AminoAcid

Experimental Evaluation of GCIs Learned from Textual Data 2015-06-08 2 / 14

(3)

Goal

Learn concept definitions from text

Learn terminological knowledge from text Evalutation

(4)

Goal

Learn concept definitions from text Learn terminological knowledge from text

Evalutation

(5)

Goal

(6)

Goal

Genes are not protein complexes, and vice versa.

Gene[ProteinComplexĎK Proteins contain amino acids

(7)

Goal

Gene[ProteinComplexĎK

Proteins contain amino acids

(8)

Goal

Proteins contain amino acids

(9)

Goal

(10)

Goal

(11)

Looking Back

Previous Approaches

Exploit approach of learning SNOMED definitions from text.

Generate GCIs and check for their occurrence in the text.

GCIs fromattribute exploration of certain basic concept description, with DL reasoner as expert

did not finish (ě2 weeks) GCIs produced mostly nonsense

Compute implicationsin instance-data generated from annotated text

obtained terminological knowledge

“good” quality, measured with precision and recall

only restricted form of concept descriptions (at most 2 conjuncts on the left-hand side, of pre-defined form)

Current Goal

Learn all GCIs that are valid in the text corpus Find a way to evaluate them

(12)

Looking Back

Previous Approaches

Current Goal

(13)

Looking Back

Previous Approaches

Current Goal

(14)

Looking Back

Previous Approaches

GCIs fromattribute explorationof certain basic concept description, with DL reasoner as expert

Current Goal

(15)

Looking Back

Previous Approaches

Current Goal

(16)

Looking Back

Previous Approaches

did not finish (ě2 weeks)

GCIs produced mostly nonsense

Current Goal

(17)

Looking Back

Previous Approaches

Current Goal

(18)

Looking Back

Previous Approaches

Current Goal

(19)

Looking Back

Previous Approaches

Compute implicationsin instance-data generated from annotated text obtained terminological knowledge

Current Goal

(20)

Looking Back

Previous Approaches

Current Goal

(21)

Looking Back

Previous Approaches

Current Goal

(22)

Looking Back

Previous Approaches

Current Goal

(23)

Looking Back

Previous Approaches

Current Goal

Learn all GCIs that are valid in the text corpus

Find a way to evaluate them

(24)

Looking Back

Previous Approaches

Current Goal

(25)

Learning GCIs [Baader and Distel, 2007]

Allows to learnall valid ℰℒ-GCIs fromfinite interpretations

Person Artist

Person

Person Writer child

child

Dchild.WriterĎArtist, . . .

Computes abaseof all such GCIs

Can also compute base ofminimal cardinality

Can include role-depth bounds[Distel, 2012; Borchmann et.al., 2015] Implementations available (prototypes)

(26)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

(27)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

(28)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

(29)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

(30)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

(31)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

Can include role-depth bounds[Distel, 2012; Borchmann et.al., 2015]

Implementations available (prototypes)

(32)

Learning GCIs [Baader and Distel, 2007]

Person Artist

Person

Person Writer child

child

Can include role-depth bounds[Distel, 2012; Borchmann et.al., 2015]

Implementations available (prototypes)

(33)

Application

Experimental Setup

Take annotated text from the biomedical domain (GRO) Turn annotation into relational data

Learn valid GCIs of a particular role-depth Evaluate

Evaluation

How many GCIs learned follow from the GRO? (certainly true positives) How many GCIs causeinconsistency or unsatisfiable classesin the GRO? (certainly false positives)

How many GCIs of the GRO follow from the GCIs we learned? (“recall”)

“Small” Issue

Annotation uses open-world semantics Learning uses closed-world semantics

(34)

Application

Experimental Setup

Evaluation

“Small” Issue

(35)

Application

Experimental Setup

Take annotated text from the biomedical domain (GRO)

Turn annotation into relational data Learn valid GCIs of a particular role-depth Evaluate

Evaluation

“Small” Issue

(36)

Application

Experimental Setup

Evaluation

“Small” Issue

(37)

Application

Experimental Setup

Learn valid GCIs of a particular role-depth

Evaluate Evaluation

“Small” Issue

(38)

Application

Experimental Setup

Evaluation

“Small” Issue

(39)

Application

Experimental Setup

Evaluation

How many GCIs learned follow from the GRO? (certainly true positives) How many GCIs causeinconsistency orunsatisfiable classes in the GRO? (certainly false positives)

“Small” Issue

(40)

Application

Experimental Setup

Evaluation

How many GCIs learned follow from the GRO? (certainly true positives)

How many GCIs causeinconsistency orunsatisfiable classes in the GRO? (certainly false positives)

“Small” Issue

(41)

Application

Experimental Setup

Evaluation

“Small” Issue

(42)

Application

Experimental Setup

Evaluation

“Small” Issue

(43)

Application

Experimental Setup

Evaluation

“Small” Issue

(44)

Application

Experimental Setup

Evaluation

“Small” Issue

Annotation uses open-world semantics

Learning uses closed-world semantics

(45)

Application

Experimental Setup

Evaluation

“Small” Issue

(46)

The Data-Set

Gene Regulation Ontology task at BioNLP Shared Task 2013 (http://2013.bionlp-st.org)

200 manually annotated PubMed abstracts on gene regulation processes Annotations from the Gene Regulation Ontology (GRO)

Entities (Cell, Protein, Tissue, . . . )

Events (Mutation, Localization, Experimental Intervention, . . . ) Relations (encodes, locatedIn, fromSpecies, . . . )

Example (Entities and Events)

Activin addition strongly promotes an interaction between these two proteins .

Protein Activation ProteinProteinInteraction Protein

hasAgent hasPatient hasPatient

(47)

The Data-Set

(48)

The Data-Set

200 manually annotated PubMed abstracts on gene regulation processes

Annotations from the Gene Regulation Ontology (GRO)

(49)

The Data-Set

(50)

The Data-Set

(51)

The Data-Set

Events (Mutation, Localization, Experimental Intervention, . . . )

Relations (encodes, locatedIn, fromSpecies, . . . )

(52)

The Data-Set

(53)

The Data-Set

(54)

The Data-Set

(55)

The Data-Set

Protein Protein

Activation ProteinProteinInteraction

(56)

The Data-Set

Activin addition strongly promotes an interaction between these two proteins . Protein Activation ProteinProteinInteraction Protein

(57)

The Data-Set

Activin addition strongly promotes an interaction between these two proteins . Protein Activation ProteinProteinInteraction Protein

(58)

Evaluation

Experiment

considered only 30 most frequent concept-names (reason: performance) Resulting interpretation has 7399 elements, 30 concept-names, and 7 role-names

role-depth bound 1

Results

1552 GCIs extracted

GRO with these GCIs is still consistent has 321 unsatisfiable classes (out of 507)

49 GCIs (each on its own) cause unsatisfiable classes («3.2%) Removal of 56 GCIs results in no unsatisfiable classes («3.6%) 319 are entailed by the GRO («20.6%)

Recall not yet available

(59)

Evaluation

Experiment

role-depth bound 1 Results

1552 GCIs extracted

(60)

Evaluation

Experiment

considered only 30 most frequent concept-names (reason: performance)

Resulting interpretation has 7399 elements, 30 concept-names, and 7 role-names

1552 GCIs extracted

(61)

Evaluation

Experiment

1552 GCIs extracted

(62)

Evaluation

Experiment

role-depth bound 1

Results

1552 GCIs extracted

(63)

Evaluation

Experiment

1552 GCIs extracted

(64)

Evaluation

Experiment

1552 GCIs extracted

(65)

Evaluation

Experiment

1552 GCIs extracted

GRO with these GCIs is still consistent

has 321 unsatisfiable classes (out of 507)

(66)

Evaluation

Experiment

1552 GCIs extracted

(67)

Evaluation

Experiment

1552 GCIs extracted

49 GCIs (each on its own) cause unsatisfiable classes («3.2%)

Removal of 56 GCIs results in no unsatisfiable classes («3.6%) 319 are entailed by the GRO («20.6%)

(68)

Evaluation

Experiment

1552 GCIs extracted

49 GCIs (each on its own) cause unsatisfiable classes («3.2%) Removal of 56 GCIs results in no unsatisfiable classes («3.6%)

319 are entailed by the GRO («20.6%) Recall not yet available

(69)

Evaluation

Experiment

1552 GCIs extracted

(70)

Evaluation

Experiment

1552 GCIs extracted

(71)

Certainly correct GCIs

Example

Dencodes.J [ DhasPart.J [ChromosomeĎK DhasPart.J [CellĎDhasPart.CellComponent Dencodes.J [ProteinĎGene

DhasPart.J [ DlocatedIn.J [Gene[ProteinĎDfromSpecies.Eukaryote Dencodes.J [ DfromSpecies.Eukaryote[ DhasPart.Peptide[

DhasPart.ProteinDomain[Gene[ProteinĎDencodes.MessengerRNA

(72)

Certainly correct GCIs

Example

(73)

Certainly correct GCIs

Example

Dencodes.J [ DhasPart.J [ChromosomeĎK

DhasPart.J [CellĎDhasPart.CellComponent Dencodes.J [ProteinĎGene

(74)

Certainly correct GCIs

Example

Dencodes.J [ DhasPart.J [ChromosomeĎK DhasPart.J [CellĎDhasPart.CellComponent

Dencodes.J [ProteinĎGene

(75)

Certainly correct GCIs

Example

(76)

Certainly correct GCIs

Example

DhasPart.J [ DlocatedIn.J [Gene[ProteinĎDfromSpecies.Eukaryote

Dencodes.J [ DfromSpecies.Eukaryote[ DhasPart.Peptide[

(77)

Certainly correct GCIs

Example

(78)

Inconclusive GCIs

Example

Cell[EukaryoteĎK Dencodes.EukaryoteĎK Cell[VirusĎK

Eukaryote[SignalingPathwayĎK Observation

Two reasons (at least) for inconclusive GCIs

simply wrong GRO incomplete

(79)

Inconclusive GCIs

Example

Cell[EukaryoteĎK

Dencodes.EukaryoteĎK Cell[VirusĎK

(80)

Inconclusive GCIs

Example

Cell[EukaryoteĎK Dencodes.EukaryoteĎK

Cell[VirusĎK

(81)

Inconclusive GCIs

Example

(82)

Inconclusive GCIs

Example

Eukaryote[SignalingPathwayĎK

Observation

(83)

Inconclusive GCIs

Example

(84)

Inconclusive GCIs

Example

Two reasons (at least) for inconclusive GCIs simply wrong

GRO incomplete

(85)

Inconclusive GCIs

Example

Two reasons (at least) for inconclusive GCIs simply wrong

GRO incomplete

(86)

Unsatisfiable Classes

Question: Where do they come from? Example

CellComponent[NucleusĎK

Data-set did not contain any occurrence of an individual that is both CellComponent and Nucleus

In the GRO, CellComponent is a super-class of Nucleus So, the annotation isincomplete

Conclusion

unsatisfiable classes can arise through theclosed-world interpretation of the open-worlddata-set.

alldisjointness axioms containing only concept-names are caused by this

(87)

Unsatisfiable Classes

Question: Where do they come from?

Example

Conclusion

(88)

Unsatisfiable Classes

Example

Conclusion

(89)

Unsatisfiable Classes

Example

Conclusion

(90)

Unsatisfiable Classes

Example

Conclusion

(91)

Unsatisfiable Classes

Example

In the GRO, CellComponent is a super-class of Nucleus

So, the annotation isincomplete Conclusion

(92)

Unsatisfiable Classes

Example

Conclusion

(93)

Unsatisfiable Classes

Example

Conclusion

(94)

Unsatisfiable Classes

Example

Conclusion

(95)

Unsatisfiable Classes

Example

Conclusion

(96)

Unsatisfiable Classes

Example

DlocatedIn.Cell[ DlocatedIn.NucleusĎProtein

Causes the class NuclearExportOfmRNA to become unsatisfiable GRO entails

NuclearExportOfmRNA[ProteinĎK NuclearExportOfmRNAĎDlocatedIn.Nucleus

NuclearExportOfmRNAĎProteinTargetingĎDlocatedIn.Cell

But data-set does not contain any reference to NuclearExportOfmRNA Approach could not learn this counterexample

Idea

Remove concept-names not occurring in the data-set before evaluation?

(97)

Unsatisfiable Classes

Example

DlocatedIn.Cell[ DlocatedIn.NucleusĎProtein

Causes the class NuclearExportOfmRNA to become unsatisfiable

GRO entails

NuclearExportOfmRNA[ProteinĎK NuclearExportOfmRNAĎDlocatedIn.Nucleus

NuclearExportOfmRNAĎProteinTargetingĎDlocatedIn.Cell

But data-set does not contain any reference to NuclearExportOfmRNA Approach could not learn this counterexample

Idea

Remove concept-names not occurring in the data-set before evaluation?