Tree Annotation

From Evolutionary Interoperability and Outreach
Jump to navigation Jump to search

Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".

Quick links

Synopsis

We (1) identified a set of 10 large trees useful as phylotastic source trees (2) created free-text annotations (metadata) for citations, sources and methods, (3) encoded the data and metadata as RDF using CDAO and a new ontology, (4) loaded the encoded information into a triplestore, and (5) demonstrated logical querying based on data and metadata attributes. During the hackathon, group members spent their time developing and revising a strategy, interpreting source materials, developing language support, encoding annotations, working out technical bugs in the workflow, and addressing emerging challenges. The tangible outcomes of this exercise include

  • a set of 10 source trees (720 to 250,000 species) with low-res metadata (see AnnotatedPhylotasticSourceTrees)
  • a demonstration of semantic annotation, data-basing, and querying (see TreestoreMetadataQueryDemonstration)
    • a workflow plan for encoding tree data and metadata and loading them into a treestore
    • a treestore instance populated with some of these data and metadata
  • Advances in Minimum Information About a Phylogenetic Analysis (see AdvancingMIAPA)
    • a new "MIAPA" ontology that leverages several existing ontologies
    • recommendations on the draft checklist, and input form
  • optionally
    • a screencast
    • a draft GSOC proposal

Background, Motivation, and Aims

Metadata annotations represent an essential part of the design of phylotastic systems, for two reasons. First, while we do not have a robust and detailed understand of how users will make use of phylotastic systems, we assume that they will wish to identify trees based on sources and methods. For instance, a user may restrict a phylotastic query so as to include only trees inferred by Maximum Likelihood, or to exclude grafted trees, or to implicate only the tree associated with the publication by Bininda-Emonds, 2007.

Second, one of the design criteria of phylotastic systems is to provide credible results, which in the scientific world means providing a description of provenance suitable for a scientific publication. To be credible, a tree generated by a phylotastic system must include a description of how it was derived, which includes information on source trees as well as a description of any subsequent manipulations. Yet, metadata play little or no role in current phylotastic component implementations.

Some guidance may be obtained from prior art relating to databases and to metadata. Two databases for trees exist already, TreeBASE and Dryad Dryad does not provide any explicit support for tree-specific annotations. The TreeBASE input interface allows citation data, creates links to species ident, and links a matrix to a tree with a "analysis" link that may implicate a particular software program. Though useful, the TreeBASE model falls far short of the recommendations for a "minimum information" standard for phylogeny metadata known as MIAPA, or "Minimum Information About a Phylogenetic Analysis". For instance, the draft MIAPA checklist from TDWG2011 calls for an explicit indication of whether a tree is a gene or species tree, whether it is rooted, what software (and version number) was used to derive it, and so on.

The TreeAnnotation team of hackathon 2 (Enrico, Hilmar, Joachim, Arlin, Ramona and 0.5 of Andrea) decided to conduct an annotation exercise that would cover the flow of information from initial annotation of trees, to querying of treestores (not including the annotation of subsequent phylotastic manipulations such as pruning or scaling).

The motivation for this exercise relates partly to the aspirational nature of MIAPA, which was proposed many years ago but has never evolved into a clear standard supported with convenient technology. Those of us who have been involved in MIAPA-related efforts sensed a need for practical experiences in real-world uses of annotations. While the scope of Phylotastic is narrower than that of MIAPA, in the sense of covering only species trees, and mainly large ones, the challenge of supporting useful metadata queries in a treestore represents a critical test of the relevance of the MIAPA checklist and the technology for encoding and managing semantic annotations.

We also hoped to enrich current phylotastic implementations by providing metadata for a specific set of useful trees. Hackathon participants have been using a handful of trees (APGIII, Bininda-Emonds, etc) without any metadata on citations or methods.

Thus, our approach has 3 inter-connected aims:

  • to create a set of 10 usefully annotated source trees
  • to demonstrate the feasibility of metadata-based querying in a treestore
  • to leverage a practical annotation exercise to advance the MIAPA project

Approach

Our approach consisted of the following steps

  1. identify 10 useful source trees with available publications
  2. generate free-text annotations
  3. encode citations and annotations in computable form
  4. load the citation, annotations, and trees into a treestore
  5. demonstrate querying based on metadata

In particular, we chose to gather metadata corresponding to the MIAPA draft checklist, to enode it as RDF using a new ontology that imports several other ontologies, and to load the results into Ben Morris's Virtuoso-based treestore implementation. On Day 3, we decided to begin by focusing on citations, which are not in the MIAPA checklist, with the plan to carry citation data through steps 3 to 5.

workflow, in more detail

  1. Identify 10 trees for use as phylotastic source trees. Most of the trees were identified pre-hackathon by Arlin. One of the trees was replaced on day 2 due to lack of metadata (the unpublished fish tree of Westneat & Lundberg).
  2. Annotate them in free-text form. This was done on day #2 as a team effort by Ramona, Enrico, Arlin and Andrea.
    • create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
    • Spread sheet has pull down menus, plus options for free text entries under "other"
  3. Transform annotations into a formal language statements in RDF. This was done on Days 3 and 4 by Ramona, Arlin, Enrico and Hilmar.
    • Literature Citations
      • after some discussion, we decided to use BIBO (not dc or prism alone)
      • we spent 6 to 8 person hours trying to do this interactively in Protege before finding an automated pathway of discovery and conversion via PubMed--> EndNote --(bibtex export)--> Zotero --> bibo export (bibliontology RDF).
      • here is the File:10trees bibliontology.rdf
    • Hilmar developed an annotation ontology that incorporates CDAO, OBI, PROV and other ontologies
  4. Load trees into TreeStore. On days 2 to 3, Joachim worked on the technology for getting our encodings into a triplestore. Part of the challenge was deciding on an URL scheme.
  5. Execute queries to demonstrate success. On Day 3 we had success in querying for citation metadata. On Day 5 ? ?  ?

Model for semantic encoding

additional suggestions for MIAPA ontology

From annotation session on afternoon of 1/31.

  • It would be good to generate an instance of useMaximumLikelihood ("Maximum Likelihood algorithm") in MIAPA, so we don't have to create one for each annotation. Filed as Issue #8
  • Alternatively, maybe make classes of software (like PhyML or RAxML) implement ML algorithm, rather than having to assert it for each instance we create. Some software can use multiple algorithms, so we can't do this for each case.
    • Note that in OWL classes cannot be asserted to have property values, only instances can. We can put property restrictions with existential quantification on classes, and a OWL reasoner could then infer that an instance must have at least one such property association (and thus a DL query should in principle return the instance), but this wouldn't work in an RDF triple store so that we could then actually query for these things in SPARQL.
    • Note also that there can be multiple swo:implements assertions for a software instance, so multiple algorithms can be easily asserted. However, this wouldn't the also say which of those implemented algorithms was the one utilized for the generation of the tree of alignment. The idea is that this would be evident from the miapa:'Parameter specification'.

more annotations

miapa ontology

  • topology
    • gene tree vs species tree: Network:Tree:'Gene tree' or SpeciesTree
    • rooted: Network:Tree:RootedTree or UnrootedTree
    • 'Consensus tree'
  • otus
    • toTaxon, object property, points to taxon concept, can be URI from NCBI or other authority
    • derived_from specimen
    • location imported from geo
  • branch properties
    • branch lengths:
      • data property edge length
      • object property has_Annotation edge_length
    • branch support: data property has support value either bootstrap or posterior prob
  • character matrix
  • alignment method
    • name of software, version
    • parameters
    • manual correction
  • tree inference method
    • name of software, version: tree wasGeneratedBy (activity=) software procedure; software procedure wasAssociatedWith instance of software agent named "RaXML"
    • parameters: (activity) used instance of a parameter specification (which is a kind of plan)
    • character weights