Tree Annotation
Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".
Quick links
Reports
- AnnotatedPhylotasticSourceTrees - report on the set of source trees, focusing on the types of metadata available, and how they might be used in phylotastic systems
- TreestoreMetadataQueryDemonstration - report on the model of semantic encoding, the technology for translation, the treestore technology, and the implications of this for supporting phylotastic querying
- AdvancingMIAPA report page - report on the adequacy of the MIAPA checklist, recommendations for revisions, ontology development, challenges of semantic encoding, and also (redundant to above report) the model of semantic encoding.
Other tangible outcomes
- new MIAPA ontology
- GSOC project proposal
Key resources
- checklist from TDWG 2011
Overview
Metadata annotations represent an essential part of the design of phylotastic systems, enabling users to find trees based on sources and methods, and to generate a credible report of provenance for phylotastically generated trees. Yet, metadata play no role in current phylotastic component implementations. The TreeAnnotation team of hackathon 2 (Enrico, Hilmar, Joachim, Arlin, Ramona and 0.5 of Andrea) set out to address this deficiency. We developed an approach with 3 inter-connected goals:
- create a set of 10 usefully annotated source trees
- demonstrate metadata-based querying in a treestore
- leverage this exercise to advance the MIAPA project
Our approach consisted of the following steps
- identify 10 useful source trees with available publications
- generate free-text annotations
- encode citations and annotations in computable form
- load the citation, annotations, and trees into a treestore
- demonstrate querying based on metadata
In particular, we chose to gather metadata corresponding to the MIAPA draft checklist, to enode it as RDF using a new ontology that imports several other ontologies, and to load the results into Ben Morris's Virtuoso-based treestore implementation.
During the hackathon, group members spent their time developing and revising a strategy, interpreting source materials, developing language support, encoding annotations, implementing tools, and addressing emerging challenges.
The tangible outcomes of the group relate to phylotastic source trees (a set of trees with metadata); software tools for processing, storage and querying; an ontology to support MIAPA annotations, along with a revised MIAPA checklist and form; and written reports on these 3 types of outputs, available on this wiki.
Detailed approach
- develop plan (day 1)
- revise as needed
- some work is done in parallel
- main workflow
- identify 10 trees for use as phylotastic source trees
- annotate them in free-text form
- create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
- Spread sheet has pull down menus, plus options for free text entries under "other"
- transform annotations into a formal language statements in RDF
- encoding process is iterative with ontology editing
- Hilmar is working on language support
- Joachim is working on the technology for getting this into a triplestore
- Get URI for tree from TreeStore, add annotations to that URI in Protege
- Load trees into TreeStore
- Will need to have trees in the correct format
- execute queries to demonstrate success
Log and accomplishments
- initial plan (day 1)
- initial MIAPA checklist-based input form (day 1)
- revised input form
- plan for (temporarily) storing trees and matrices (data) separate from metadata
- annotations of 10 trees
- translation technology
- NEXUS issues, dendropy,
- protege deals poorly with unnamed individuals
- ontology for annotation
From day 4, Media:followup_goals.jpg from white board.
citation exercise
goal: annotate trees with citation data, encode, import into treestore, demonstrate querying based on citation metadata
notes on encoding
- after some discussion, we decided to use BIBO (not dc or prism alone)
- we failed to find any pre-existing method to auto-convert EndNote (or BibTex or Zotero) into BIBO
- so we started hand-encoding them using Protege instances
- authors
- articles
- used Data property "short title" instead of object property title
- used date of issue for publication year
- author-lists (RDF:list?)
- ultimately we ended up getting the encoded citations via PubMed--> EndNote --> bibtex export --> Zotero --> bibo export (bibliontology RDF).
- here is the File:10trees bibliontology.rdf
additional suggestions for MIAPA ontology
From annotation session on afternoon of 1/31.
Add a class for parsimony under algorithm.
It would be good to generate an instance of useMaximumLikelihood ("Maximum Likelihood algorithm") in MIAPA, so we don't have to create one for each annotation.
Alternatively, maybe make classes of software (like PhyML or RAxML) implement ML algorithm, rather than having to assert it for each instance we create. Some software can use multiple algorithms, so we can't do this for each case.
- Note that in OWL classes cannot be asserted to have property values, only instances can. We can put property restrictions with existential quantification on classes, and a OWL reasoner could then infer that an instance must have at least one such property association (and thus a DL query should in principle return the instance), but this wouldn't work in an RDF triple store so that we could then actually query for these things in SPARQL.
- Note also that there can be multiple swo:implements assertions for a software instance, so multiple algorithms can be easily asserted. However, this wouldn't the also say which of those implemented algorithms was the one utilized for the generation of the tree of alignment. The idea is that this would be evident from the miapa:'Parameter specification'.
Remove class for SILVA.
Add new class for set of trees.
more annotations
miapa ontology
- topology
- gene tree vs species tree: Network:Tree:'Gene tree' or SpeciesTree
- rooted: Network:Tree:RootedTree or UnrootedTree
- 'Consensus tree'
- otus
- toTaxon, object property, points to taxon concept, can be URI from NCBI or other authority
- derived_from specimen
- location imported from geo
- branch properties
- branch lengths:
- data property edge length
- object property has_Annotation edge_length
- branch support: data property has support value either bootstrap or posterior prob
- branch lengths:
- character matrix
- alignment method
- name of software, version
- parameters
- manual correction
- tree inference method
- name of software, version: tree wasGeneratedBy (activity=) software procedure; software procedure wasAssociatedWith instance of software agent named "RaXML"
- parameters: (activity) used instance of a parameter specification (which is a kind of plan)
- character weights
semantic links for tree, citation, methods, etc
- tree has unique URI produced during loading: http://phylotastic.org/hack2/...number.../...treename...#tree1
- how rooted tree connects together
:tree1 has_root node0 ;
- how unrooted tree connects together, using the belongs_to_tree relation
:node9> obo:CDAO_0000200 :tree1 ;
- and the same for all the other nodes and edges.
- how tree connects with citation (assume that pub1 is the root of the <bibo:AcademicArticle> individual )
:tree1 dcterms:isReferencedBy :pub1 ;
- some other ideas
- :pub1 IAO:is_about :tree1
- :pub1 documents :tree1
- cito:provides_methods_for :tree1
- :pub1 cito:provides_data_for :tree1
- how tree connects with methods annotation
:tree1 prov:wasGeneratedBy :tree_activity1 ;
- how char matrix connects with methods annotation
:align1 prov:wasGeneratedBy :align_activity1 ;
- how tree connects with char matrix
:tree1 prov:wasDerivedFrom :align1 ;
Annotation Workflow
Example file: Tree_2_Peters_et_al.newick
1. `python treestore.py add Tree_2_Peters_et_al.newick newick Peters2011hymenoptera`
- reads Newick file `Tree_2_Peters_et_al.newick`
- stores the tree in the named graph `http://prefix/Peters2011hymenoptera`
- the URI prefix is automatically generated; it is a hash that (more or less) uniquely identifies the data loaded
2. `python treestore.py uri`
- lists tree URIs in the triple store
- will show something along the line: "Peters2011hymenoptera http://phylotastic.org/hack2/bd414f8f72a8fabb9454b4ea72cf0e8a760171ba/Peters2011hymenoptera#tree0000001"
3. `rdfcat -out N-TRIPLE annotations.rdf > annotations.ntriples`
- takes annotations (saved with Protege as RDF/XML, Turtle, or other format)
- outputs N-Triples
4. `python treestore.py add annotations.ntriples ntriples http://phylotastic.org/hack2/bd414f8f72a8fabb9454b4ea72cf0e8a760171ba/Peters2011hymenoptera`
- adds the annotations to the named graph `http://phylotastic.org/hack2/bd414f8f72a8fabb9454b4ea72cf0e8a760171ba/Peters2011hymenoptera`
- the URI for the named graph is the URI returned by `python treestore.py uri` up to the `#` character