Tree Annotation
Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".
Overview
The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).
An initial attempt at prioritizing
- support the most common user criteria for searching a treestore to get the right tree (whatever they are)
- query on OTU identifiers for tips (most common), internal identifiers (e.g., taxonomy markups), sources or types of data, method?
- support adequate annotation of the provenance of the resulting phylotastic tree tree
- imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it
- support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
- support licensing that protects creators, resource-providers and end-users
Questions & possible approaches
What are sensible rules for treating annotations of trees subject to manipulation? For instance, a bootstrap value is typically a split on an unrooted tree, but people think of them as being associated with a node on a rooted tree. If we have pruned several groups on one side of the split, the split doesn't have the same meaning anymore. It seems to me that, under some conditions, it becomes an underestimate of the true support value.
Probably we want to develop some concrete test-cases. These could be real trees, stripped-down versions of real trees, or imaginary (but realistic) trees, but the important thing is that we have multiple cases of trees for which there are concrete instances of metadata. Some of these will be tree-level annotations (apply to whole tree) and some will be OTU- or node- or branch-associated annotations (in principle).
We could create free-text versions of annotations, based on the most important criteria from the MIAPA checklist. We could carry out a thought-experiment of asking what we need to represent in order to process queries and create annotations for modified trees.
The next step would be to try to encode some of this stuff more formally. For instance, we could use NeXML files.
If we have NeXML plus ontology plus translation to CDAO RDF (all stuff that has been used at previous hackathons), then we can feed the test files into a triple store and try to execute some queries using SPARQL.
Resources
- reporting standards and related information
- MIAPA draft checklist from TDWG 2011 workshop
- some slides from an OToL project showing which types of metadata consumers want (and producers are willing to provide)
- ontologies
- representation of metadata: NeXML
- some stuff that Rutger did mapping the metadata from ToLWeb XML format onto semantic annotations in NeXML.
- ToLWeb XML described here: http://tolweb.org/tree/home.pages/downloadtree.html
- a simple script that does the conversion: https://github.com/ncbnaturalis/bio-phylo/blob/master/experimental/tolconvert.pl
- Here's an example input file: https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol.xml
- Here's the resulting output file (indented): https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol-nexml-pp.xml
Possible deliverables
- a set of >10 trees with a succinct version of minimal information
- a free-text version
- an encoded version (NeXML, NEXUS, PhyloXML)
- sample queries based on metadata
- a set of queries that a TreeStore should be able to process (cf Nakleh, et al., 2003)
- a set of tests based on input trees (e.g., find_molecular_trees( TestTreeSet ) ==> return the correct list of trees annotated as being based on molecular data)
- formal language support for this annotation
- a list of terms with free-text definitions
- a reference list of relevant ontologies, e.g., OBI, CDAO
- an ontology or extension to existing ontologies
- a token TreeStore implementation that satisfies tests
Getting started
List of trees with description and links to sources & methods
- NCBI taxonomy tree (http://www.ncbi.nlm.nih.gov/guide/taxonomy/)
- 250000 species
- available as an SQL dump from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ (see the README file)
- manually curated
- FYI, NCBI provides an interactive way to get a tree phylotastically (http://www.ncbi.nlm.nih.gov/guide/howto/gen-com-tree/)
- Supertree of mammals from Bininda-Emonds, et al 2007
- 4510 species
- file NEXUS format, species-level, includes branch lengths (File:Bininda-emonds 2007 mammals.nex)
- link to supplementary data including description of phylogeny methods: http://www.nature.com/nature/journal/v446/n7135/suppinfo/nature05634.html
- Angiosperm phylogeny group (APG) tree of APGIII
- free full text version
- file File:Phylomatictree.nex
- Nodes with IDs: 1,827
- max ID Length: 34 (harrimanelloideae_to_vaccinioideae)
- Manually curated, right?
- Tree of Life Web Project Structure, zipped version of proprietary XML format, spans all of life, family level and above (File:TOL.xml.zip)
- 16K tips ?
- needs conversion to Newick or NEXUS format
- (note: this tree structure in TOL.xml.zip is *old*, it is from October 22, 2006)
- angiosperm phylogeny from Smith et al. 2011,
- file Newick format, species-level (File:Smith 2011 angiosperms.txt)
- Nodes with IDs: 55,473
- max ID Length: 63 (Aesculus_glabra_var__arguta_x_Aesculus_sylvatica_var__pubescens)
- 800k node GreenGenes Tree from early 2011,
- file: newick format, includes branch lengths (File:Greengenes2011.txt)
- Nodes with IDs 413,004
- max ID Length: 135 (c__Thermolithobacteria; o__Thermolithobacterales; f__Thermolithobacteraceae; g__Thermolithobacter; s__Thermolithobacter ferrireducens)
- Tree of all Eukaryotes in Genbank from Goloboff et al. 2009
- free full text with methods
- 73,060 terminal taxa analyzed with parsimony in TNT
- file: zipped version of TNT-formatted treefiles and diagrams (File:Goloboff Trees.zip)
- here is the attempt to convert this to Newick (needs testing): File:Goloboff molecules only shortest.nwk.txt
- the github repository has scripts used to convert from TNT
- Avian phylogeny of Jetz, et al. 2012
- 9,993 birds
- http://www.nature.com/nature/journal/v491/n7424/full/nature11631.html#/supplementary-information
- methods are described in the supplementary info PDF
- the trees are provided (NEXUS format) in the supplementary data package "MCC_trees.zip" in the supplementary data files above
- Peters, et al hymenoptera tree (1100 species)
- Megaphylogeny of 800+ living and fossil families of fishes, Westneat and Lundberg unpublished
- file: NEXUS format including Mesquite extensions, family level with other higher-taxa labelled (File:Westneat Lundberg BigFishTree.nex)
A note from Brian O: btw, I made an R package (phyloorchard) to hold large trees; it has a few in there now, and I'll add the ones above. People should feel free to request to be added to that project if you want to do more with it. --BrianOMeara 15:56, 25 April 2012 (EDT))