Tree Annotation
Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".
Overview
The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).
An initial attempt at prioritizing
- support the most common user criteria for searching a treestore to get the right tree (whatever they are)
- query on OTU identifiers for tips (most common), internal identifiers (e.g., taxonomy markups), sources or types of data, method?
- support adequate annotation of the provenance of the resulting phylotastic tree tree
- imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it
- support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
- support licensing that protects creators, resource-providers and end-users
Questions & possible approaches
What are sensible rules for treating annotations of trees subject to manipulation? For instance, a bootstrap value is typically a split on an unrooted tree, but people think of them as being associated with a node on a rooted tree. If we have pruned several groups on one side of the split, the split doesn't have the same meaning anymore. It seems to me that, under some conditions, it becomes an underestimate of the true support value.
Probably we want to develop some concrete test-cases. These could be real trees, stripped-down versions of real trees, or imaginary (but realistic) trees, but the important thing is that we have multiple cases of trees for which there are concrete instances of metadata. Some of these will be tree-level annotations (apply to whole tree) and some will be OTU- or node- or branch-associated annotations (in principle).
We could create free-text versions of annotations, based on the most important criteria from the MIAPA checklist. We could carry out a thought-experiment of asking what we need to represent in order to process queries and create annotations for modified trees.
The next step would be to try to encode some of this stuff more formally. For instance, we could use NeXML files.
If we have NeXML plus ontology plus translation to CDAO RDF (all stuff that has been used at previous hackathons), then we can feed the test files into a triple store and try to execute some queries using SPARQL.
Resources
MIAPA
- The main MIAPA page. Primarily:
- MIAPA draft checklist from TDWG 2011 workshop.
- Emily McTavish's Tree Annotation Vocabulary as resulting from the Phylotastic I hackathon. Includes a mapping to MIAPA draft checklist attributes.
- Slideshow with some results from the MIAPA community survey orchestrated by the Open Tree of Life project in fall 2012.
- Ontologies
Annotation
- representation of metadata: NeXML
- some stuff that Rutger did mapping the metadata from ToLWeb XML format onto semantic annotations in NeXML.
- ToLWeb XML described here: http://tolweb.org/tree/home.pages/downloadtree.html
- a simple script that does the conversion: https://github.com/ncbnaturalis/bio-phylo/blob/master/experimental/tolconvert.pl
- Here's an example input file: https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol.xml
- Here's the resulting output file (indented): https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol-nexml-pp.xml
Possible deliverables
- a set of >10 trees with a succinct version of minimal information
- a free-text version
- an encoded version (NeXML, NEXUS, PhyloXML)
- sample queries based on metadata
- a set of queries that a TreeStore should be able to process (cf Nakleh, et al., 2003)
- a set of tests based on input trees (e.g., find_molecular_trees( TestTreeSet ) ==> return the correct list of trees annotated as being based on molecular data)
- formal language support for this annotation
- a list of terms with free-text definitions
- a reference list of relevant ontologies, e.g., OBI, CDAO
- an ontology or extension to existing ontologies
- a token TreeStore implementation that satisfies tests
Getting started
List of trees with description and links to sources & methods
- NCBI taxonomy tree (http://www.ncbi.nlm.nih.gov/guide/taxonomy/)
- 250000 species
- available as an SQL dump from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ (see the README file)
- manually curated
- FYI, NCBI provides an interactive way to get a tree phylotastically (http://www.ncbi.nlm.nih.gov/guide/howto/gen-com-tree/)
- Supertree of mammals from Bininda-Emonds, et al 2007
- 4510 species
- file NEXUS format, species-level, includes branch lengths (File:Bininda-emonds 2007 mammals.nex)
- link to supplementary data including description of phylogeny methods: http://www.nature.com/nature/journal/v446/n7135/suppinfo/nature05634.html
- Angiosperm phylogeny group (APG) tree of APGIII
- free full text version
- file File:Phylomatictree.nex
- Nodes with IDs: 1,827
- max ID Length: 34 (harrimanelloideae_to_vaccinioideae)
- Manually curated, right?
- Tree of Life Web Project Structure, zipped version of proprietary XML format, spans all of life, family level and above (File:TOL.xml.zip)
- 16K tips ?
- needs conversion to Newick or NEXUS format
- (note: this tree structure in TOL.xml.zip is *old*, it is from October 22, 2006)
- angiosperm phylogeny from Smith et al. 2011,
- file Newick format, species-level (File:Smith 2011 angiosperms.txt)
- Nodes with IDs: 55,473
- max ID Length: 63 (Aesculus_glabra_var__arguta_x_Aesculus_sylvatica_var__pubescens)
- Tree of 720 taxa from The Genomic Encyclopedia of Bacteria and Archaea (GEBA)
- file: Nexus, includes branch lengths (File:GEBAtree.nex)
- NEXML and Phyml trees available from TreeBASE
- link to website; includes link to publication
- link to TreeBASE page
- file: Nexus, includes branch lengths (File:GEBAtree.nex)
- Tree of all Eukaryotes in Genbank from Goloboff et al. 2009
- free full text with methods
- 73,060 terminal taxa analyzed with parsimony in TNT
- file: zipped version of TNT-formatted treefiles and diagrams (File:Goloboff Trees.zip)
- here is the attempt to convert this to Newick (needs testing): File:Goloboff molecules only shortest.nwk.txt
- the github repository has scripts used to convert from TNT
- Avian phylogeny of Jetz, et al. 2012
- 9,993 birds
- http://www.nature.com/nature/journal/v491/n7424/full/nature11631.html#/supplementary-information
- methods are described in the supplementary info PDF
- the trees are provided (NEXUS format) in the supplementary data package "MCC_trees.zip" in the supplementary data files above
- Peters, et al hymenoptera tree (1100 species)
- Megaphylogeny of 800+ living and fossil families of fishes, Westneat and Lundberg unpublished
- file: NEXUS format including Mesquite extensions, family level with other higher-taxa labelled (File:Westneat Lundberg BigFishTree.nex)
Hackathon workflow
- Create spread sheet in Google docs for annotations, based on ...
- Spread sheet has pull down menus, plus options for free text entries under "other"
- Hackers read papers or other documentation to fill in spread sheet
- Convert free text
A note from Brian O: btw, I made an R package (phyloorchard) to hold large trees; it has a few in there now, and I'll add the ones above. People should feel free to request to be added to that project if you want to do more with it. --BrianOMeara 15:56, 25 April 2012 (EDT))