Integrating Ontologies: Difference between revisions
(No difference)
|
Revision as of 15:50, 11 November 2009
Participants and interests
- Jim Case: reference models as a basis for ontology
- Julie Thompson and Brandon Chisham: CDAO
- Rutger Vos: Treebase
- Peter Midford: Phenoscape
- John Wieczorek and Stan Blum: Darwin Core, TDWG
- Rosemary Shrestha: GCP crop ontology
Introduction
This group results from the merging of three pitches (from Julie, John and Rutger) all identifying the problem of development, maintenance and integration of ontologies. The participants (see above) come from different but overlapping problem domains and so the group started with some participants formally presenting their individual interests (CDAO, TreeBASE/RDF, PhenoScape, TDWG). The pitchers agree that the goal of the merger should be to identify recommendations and best practices for managing the proliferation of ontologies over recent years in particular from the perspective of promoting interoperability.
The participants note that ontology alignment is the frequently advocated approach for promoting interoperability between their projects, but actual examples of this practice are scarce. Some solutions exist; for example, automated tools (e.g. LOOM, BGee), tables that align NCBI OBO-format ontologies provided by BioPortal, and extensive research into ontology matching. However, alignment sometimes needs to be done by hand and the participants want this group to inform how this is best done.
Objectives
Determine best practices for the building, maintenance and integration of ontologies in a community with rapidly emerging and changing requirements. More specifically, should the community concentrate on a monolithic ontology for its domain, or is it ultimately more fruitful to create smaller modular reference ontologies.
Approach
Begin with a use case that demonstrates integration by bridging between two existing ontologies to see how the exercise informs the building and maintenance of technologies.
Use cases
Case I: Find the most recent ancestor of all CDAO OTUs with a given Phenoscape state
- Create an adapter ontology importing CDAO and adding ShapeCharacter as a subclass of CDAO Character and a new ShapeDatum as a subclass of CharacterStateDatum.
- Import Phenoscape ontologies (TAO, TTO, PATO) into adapter ontology in order to build relations between Phenoscape Character States and CDAO character matrix.
- Equivalence newly created ShapeCharacter, ShapeDatum to PATO classes.
- Get instances: 1. character state matrix from phenoscape KB, 2. build a dummy NexML tree and convert to CDAO RDF.
The initial attempt at the exercise failed to demonstrate the use case because the states in Phenoscape are embedded in XML literals, not references to external resources - therefore triplet couldn't be built. We need a workaround to transform the literals, at least for a subset of the data, so that we can go ahead with the rest of the use case. In addition to parsing out the XML literal, we will need to add stable namespace references to TAO, PATO, and BPSO, since there are references to terms in these ontologies, but the references can not be resolved without including the (known) stable URL references to define the prefixes.
Case II: One taxon in Darwin Core, one taxon in TTO - what is their most recent common ancestor
- Get Darwin Core records from VertNet, get IDs and create a new document. Convert to RDF.
- Import darwin core into adaptor ontology.
- Map CDAO TU has_external_reference to TTO ID and Darwin Core acceptedNameUsageID.
- Perform SQL query that understands mapping CDAO-DwC
- Perform reasoning in Pellet to find MRCA of 2 taxa
Case III: Integrating a prototype 'TDWG domain' ontology with behavioral traits
Specific use case: Find countries in which social taxa occur given taxa on a given tree and sociality traits in a character matrix expressed with CDAO. Ultimately this might allow such questions as: in what environments do we find social tuco-tucos? Where did they originate?
- Get a tree for a set of tuco-tucos from Treebase http://treebasedb-dev.nescent.org:6666/treebase-web/phylows/tree/TB2:Tr4387
- Modify the tree to refer to actionable global unique identifiers for the string literal taxon names in the tree (used ITIS, NCBI, UniProt, and made up dummy URIs for taxa not in taxon authorities)
- Create a character matrix consisting of 2 characters: sociality (social, solitary) and habitat (fossorial, subterranean)
- Use CDAO to represent tree and character matrix character state and tree input file
- Create a prototype TDWG ontology compliant dataset in Protégé, with classes Country and Occurrence, where Occurrence has properties acceptedNameUsageID and country (range Country). Darwin Core Occurrence data for tuco-tucos
SPARQL Query for finding social tucos.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX tdwg09: <http://www.tdwg.org/tdwg09.rdf#>
PREFIX cdao: <http://www.evolutionaryontology.org/cdao.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT $name $country
WHERE {
$datum cdao:belongs_to_TU $tu.
$datum cdao:has_Standard_State tdwg09:s2.
$name cdao:represents_TU $tu.
$name tdwg09:country $country.
}
The above query identifies the following social tucos based on the example data files.
Query Results (2 answers):
name | country
=============================
Spalacopus_cyanus | CL
Ctenomys_sociabilis | AR
Conclusions
For ontology integration: For ontology integration, our work has led us to conclude that:
- instance data should be fully ontologized. For example, our phenoscape use case could not be completed because phenoscape uses XML literals to express trait post-composition. These traits were consequently inaccessible for the purpose of data integration.
- ontologies should be designed as reusable modules rather than monolithic artifacts. Aligning CDAO with DarwinCore was relatively easy because DarwinCore doesn't have a lot of structure (which is a good thing from the perspective of re-use). (although DarwinCore still needs to be ontologized).
- data integration is most easily achieved by developing small adaptor ontologies rather than merging of large (and potentially well-established and "stable") artifacts. Merging large ontologies has a greater potential to have irreconcilable incongruities. Adapting smaller ontologies requires immediate reconciliation, but insulates the practitioner from irrelevant inconsistencies. Implementations are likely to be more efficient and scalable. Nevertheless, if two domains have significant overlap, it is probably better to merge them, reconcile the inconsistencies and thereby decrease the overall noise subsequent use of the domain.
- URIs (URLs) for terms should be carefully constructed, predictable and stabilized, perhaps using PURLs. For example, several queries failed to produce expected results due to omission of
www
prefixes or#
suffixes in URLs. - several tools (Homonto developed by BGee, LOOM) and a lot of research (Ontology Matching) has already gone into the problem of ontology alignment. However, expert knowledge for manual alignment is often still necessary.
For ontology management: Review of vocabulary management processes has led us to conclude that:
- Darwin Core Namespace Policy is a good example of a social process (used within TDWG) for term management modeled on Dublin Core.
- Darwin Core actually fits well with the term management mechanisms in the OBO Community, which includes request submission, trackers, feedback at the level of terms. These go out for (archived) discussion until resolution is reached, after which the project curator(s) make the change to the ontology and inform the subscribed stakeholders.
- The OBO Foundry is a reasonable model for the creation of new well-scoped ontologies that need to be reconciled with other ontologies within the appropriate sphere of knowledge. TDWG could manage onotolgies for biodiversity through similar processes.
Deliverables
The listed deliverables below are all available from public repositories (in the case of reusable code) or from this wiki (for one-off example files).
- An XSLT stylesheet that transforms NeXML to CDAO, updated to handle character state matrices. Produced output validates against W3C RDF validator. This style sheet is in current use for the creation of TreeBASE RDF/XML output.
- An example NeXML instance document of semantically annotated OTUs. This file is part of a growing collection of canonical NeXML example files.
- A CDAO RDF/XML translation of the NeXML instance document. This file is part of a growing collection of canonical NeXML example files.
- An adaptor ontology that aligns CDAO with DarwinCore
- An ontology for occurrence instance data. This file is a one-off example.
Follow-up
- Build a web-based service to allow a user to choose two ontologies and align them in order to produce an adapter ontology as an output.