Tree Annotation: Difference between revisions
(No difference)
|
Revision as of 03:56, 30 January 2013
Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".
Overview
The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).
An initial attempt at prioritizing
- support the most common user criteria for searching a treestore to get the right tree (whatever they are)
- return tree with maximal coverage of list ::= { species }
- specify namespace of <list>
- limits on source trees
- restrict by publication status (published or not)
- restrict by publication year (e.g., no trees older than 5 years)
- restrict by author (e.g., author = bininda-emonds)
- restrict by other citation information (e.g., jrnl = index fungorum)
- restrict by method
- use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
- use (exclude) trees made with evolutionary model = { }
- use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
- other restrictions (e.g., supertree, . . . )
- restrict by availability of source data such as character matrix
- restrict by minimum number of characters in matrix
- restrict by type of source data = { molecular, morphological, mixed }
- restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
- require feature
- require support values
- require rooting
- require branch lengths
- require other features
- limits on phylotastic manipulations
- disallow species substitution
- disallow grafting of source trees
- scaling: provide median age estimates
- scaling: provide only lower age estimates
- TNRS: prohibit fuzzy matching
- TNRS: use matches from source = { NCBI, ITIS, et }
- return tree with maximal coverage of list ::= { species }
- support adequate annotation of the provenance of the resulting phylotastic tree
- imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
- citation information
- support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
- support licensing that protects creators, resource-providers and end-users
Resources
MIAPA
- The main MIAPA page. Primarily:
- MIAPA draft checklist from TDWG 2011 workshop.
- Emily McTavish's Tree Annotation Vocabulary as resulting from the Phylotastic I hackathon. Includes a mapping to MIAPA draft checklist attributes.
- Slideshow with some results from the MIAPA community survey orchestrated by the Open Tree of Life project in fall 2012.
- Ontologies
- Development:
- MIAPA repo on Github
- Template development after the TNRS ontology developed at (and after) Phylotastic I.
Annotation
- representation of metadata: NeXML
- some stuff that Rutger did mapping the metadata from ToLWeb XML format onto semantic annotations in NeXML.
- ToLWeb XML described here: http://tolweb.org/tree/home.pages/downloadtree.html
- a simple script that does the conversion: https://github.com/ncbnaturalis/bio-phylo/blob/master/experimental/tolconvert.pl
- Here's an example input file: https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol.xml
- Here's the resulting output file (indented): https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol-nexml-pp.xml
Getting started
Source trees targeted for annotation
Annotated
- Supertree of mammals from Bininda-Emonds, et al 2007
- 4510 species
- file NEXUS format, species-level, includes branch lengths (File:Bininda-emonds 2007 mammals.nex)
- we are using the "mammalST_bestDates" tree out of 3 in the NEXUS file
- link to supplementary data including description of phylogeny methods: http://www.nature.com/nature/journal/v446/n7135/suppinfo/nature05634.html
- URI: http://phylotastic.org/data/Bininda-emonds_2007_mammals.nex
- Angiosperm phylogeny group (APG) tree of APGIII
- free full text version
- file File:Phylomatictree.nex
- Nodes with IDs: 1,827
- max ID Length: 34 (harrimanelloideae_to_vaccinioideae)
- Manually curated, right?
- URI: http://phylotastic.org/data/APGIII_Phylomatic_tree.nex
- Peters, et al hymenoptera tree (1100 species)
- File:Tree 2 Peters et al.tre has properly formated bootstraps like "(<node contents>):<length>[<support>]"
- File:Tree 1 Peters et al.tre has improperly formatted bootstraps like "(<node contents>)<support>:<length>"
- URI: http://phylotastic.org/data/Peters_etal_Hymenoptera.nwk
- Tree of Life Web Project Structure, zipped version of proprietary XML format, spans all of life, family level and above (File:TOL.xml.zip)
- reference is 2007 paper by Maddison, Schulz and Maddison
- 16K tips
- needs conversion to Newick or NEXUS format
- (note: this tree structure in TOL.xml.zip is *old*, it is from October 22, 2006)
- angiosperm phylogeny from Smith et al. 2011, [1]
- data file at Dryad
- file Newick format, species-level (File:Smith 2011 angiosperms.txt)
- Nodes with IDs: 55,473
- max ID Length: 63 (Aesculus_glabra_var__arguta_x_Aesculus_sylvatica_var__pubescens)
- URI: http://dx.doi.org/10.5061/dryad.8790/1
- Tree of 720 taxa from The Genomic Encyclopedia of Bacteria and Archaea (GEBA)
- file: Nexus, includes branch lengths (File:GEBAtree.nex)
- NEXML and Phyml trees available from TreeBASE
- link to website; includes link to publication
- link to TreeBASE page
- URI: http://purl.org/phylo/treebase/phylows/tree/TB2:Tr25470
- file: Nexus, includes branch lengths (File:GEBAtree.nex)
- Avian phylogeny of Jetz, et al. 2012
- 9,993 birds
- http://www.nature.com/nature/journal/v491/n7424/full/nature11631.html#/supplementary-information
- methods are described in the supplementary info PDF
- the trees are provided (NEXUS format) in the supplementary data package "MCC_trees.zip" in the supplementary data files above
- we got an arbitrary tree File:One arbitrarily chosen jetz tree.tre from the birdtree.org
- this is the first tree from the file EricsonStage2_9001_10000.zip
- URI: http://phylotastic.org/data/Jetz_etal_2012_one_birdtree.nwk
- All-species living tree of life based on SSU rRNA
- publication: http://eigr.grupoei.com/i/i8031/publicaciones/80-LIVING_TREE.pdf
- web site with tree: http://www.arb-silva.de/projects/living-tree/
- you can get the alignment and the Newick tree from this site
- newick file has metadata section
- most recent tree: File:LTPs108 SSU tree.txt This is a newick tree.
- URI: http://www.arb-silva.de/fileadmin/silva_databases/living_tree/LTP_release_108/LTPs108_SSU_tree.newick
- Tree of all Eukaryotes in Genbank from Goloboff et al. 2009
- free full text with methods
- 73,060 terminal taxa analyzed with parsimony in TNT
- file: zipped version of TNT-formatted treefiles and diagrams (File:Goloboff Trees.zip)
- here is the attempt to convert this to Newick (needs testing): File:Goloboff molecules only shortest.nwk.txt
- the github repository has scripts used to convert from TNT
- URI: http://phylotastic.org/data/Goloboff_etal_2009_molecules_only_shortest.nwk
- NCBI taxonomy tree (http://www.ncbi.nlm.nih.gov/guide/taxonomy/)
- 250000 species
- available as an SQL dump from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ (see the README file)
- the converted tree in Newick format is available from http://itol.embl.de/other_trees.shtml, as follows
- File:Ncbi complete collapsed with names.tre: complete tree, using scientific names; internal nodes with only one child are removed. This is a newick tree, but can only upload to wiki as .tre. Switch back to .newick for github.
- Note: This tree has every taxa in NCBI, including ones like "Insertion_sequence_IS2" and "Plasmid_pHV2". For Phylotastic, we would probably want to generate a tree that is some subset of this.
- manually curated
- FYI, NCBI provides an interactive way to get a tree phylotastically (http://www.ncbi.nlm.nih.gov/guide/howto/gen-com-tree/)
Not annotated yet
- Megaphylogeny of 800+ living and fossil families of fishes, Westneat and Lundberg unpublished
- file: NEXUS format including Mesquite extensions, family level with other higher-taxa labelled (File:Westneat Lundberg BigFishTree.nex)
- not yet annotated, because it is hard to get the metadata from just the nexus file.
- URI: http://phylotastic.org/data/Westneat_Lundberg_BigFishTree.nex
Hackathon plan
- develop plan (day 1)
- revise as needed
- some work is done in parallel
- main workflow
- identify 10 trees for use as phylotastic source trees
- annotate them in free-text form
- create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
- Spread sheet has pull down menus, plus options for free text entries under "other"
- transform annotations into a formal language statements in RDF
- encoding process is iterative with ontology editing
- Hilmar is working on language support
- Joachim is working on the technology for getting this into a triplestore
- Get URI for tree from TreeStore, add annotations to that URI in Protege
- Load trees into TreeStore
- Will need to have trees in the correct format
- execute queries to demonstrate success
Log and accomplishments
- initial plan (day 1)
- initial MIAPA checklist-based input form (day 1)
- revised input form
- plan for (temporarily) storing trees and matrices (data) separate from metadata
Lessons learned from tree-finding and annotation
- data frequently is not readily accessible online
- e.g. Jetz one must go to birdtree.org
- e.g., we obtained hymenoptera tree by pers. communication from Peters
- only one of the studies has trees in TreeBASE (GEBA)
- another study has a tree in Dryad (Smith)
- authors often don't provide minimal information explicitly
- whether or not a tree is rooted is usually not explicit
- e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
- e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
- the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
- whether or not a tree is rooted is usually not explicit
- a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
- e.g., Goloboff molecules only vs. molecules with morphology
- e.g., Bininda-Emonds best dates vs min dates vs max dates
- sometimes a study has many trees that all represent outputs of the same method
- e.g., Jetz provide a large sample from the posterior distribution
- process of constructing tree does not follow sequences--> alignment--> tree
- e.g., supertree method in Bininda-Emonds
- e.g., hand-crafted APG, Smith, trees
- process of constructing tree cannot be condensed easily
- e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
- GEBA tree, { missing explanation }
- partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
- clustering to define orthologs not included in checklist, but seems important
- Smith phlawd, alignment: no pre-orthology.
- mixed data is common
- Goloboff has morpho and molecular
- multiple studies have DNA (e.g., SSU rDNA) and protein sequences
- concatenated alignments are common, e.g., multiple proteins
- this means accession:OTU mapping is not 1:1 but many to one
- not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
- many important trees do not have branch lengths
- e.g., APGIII is a taxonomic framework
- e.g., some supertrees don't have branch lengths
- do binomials count as meaningful external identifiers for OTUs?
- in some cases, the methods make clear that these come from a specific source
- e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
- e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
- e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
- usually the naming authority is not clear
- in some cases, the methods make clear that these come from a specific source
- was any study straightforward?
- OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
- OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted