Tree Annotation

From Evolutionary Interoperability and Outreach
Revision as of 03:56, 30 January 2013 by Hilmar (talk | contribs) (→‎Getting started)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Synopsis Annotate a small set of large trees used as sources of phylogenetic knowledge in an automated delivery system for tree-o-life knowledge called "Phylotastic".

Overview

The current phylotastic system (which is at a very early stage of development) fails to deliver (1) metadata for the original source trees and (2) any description of further manipulations. Ultimately, Phylotastic won't be useful for research without this kind of information. Documenting sources and methods is not the only reason to have this annotation-- the (hypothetical) design of phylotastic calls for ways to identify source trees based on their metadata (e.g., user might want to select a particular source tree, either directly or via satisfaction of search criteria).

An initial attempt at prioritizing

  1. support the most common user criteria for searching a treestore to get the right tree (whatever they are)
    • return tree with maximal coverage of list ::= { species }
      • specify namespace of <list>
    • limits on source trees
      • restrict by publication status (published or not)
      • restrict by publication year (e.g., no trees older than 5 years)
      • restrict by author (e.g., author = bininda-emonds)
      • restrict by other citation information (e.g., jrnl = index fungorum)
      • restrict by method
        • use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
        • use (exclude) trees made with evolutionary model = { }
        • use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
        • other restrictions (e.g., supertree, . . . )
      • restrict by availability of source data such as character matrix
        • restrict by minimum number of characters in matrix
      • restrict by type of source data = { molecular, morphological, mixed }
      • restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
      • require feature
        • require support values
        • require rooting
        • require branch lengths
        • require other features
    • limits on phylotastic manipulations
      • disallow species substitution
      • disallow grafting of source trees
      • scaling: provide median age estimates
      • scaling: provide only lower age estimates
      • TNRS: prohibit fuzzy matching
      • TNRS: use matches from source = { NCBI, ITIS, et }
  2. support adequate annotation of the provenance of the resulting phylotastic tree
    • imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
    • citation information
  3. support to the proper assignment of credit (blame) for tree-producers and phylotastic service-providers
  4. support licensing that protects creators, resource-providers and end-users

Resources

MIAPA

Annotation

Getting started

Source trees targeted for annotation

Annotated

  • Tree of Life Web Project Structure, zipped version of proprietary XML format, spans all of life, family level and above (File:TOL.xml.zip)
    • reference is 2007 paper by Maddison, Schulz and Maddison
    • 16K tips
    • needs conversion to Newick or NEXUS format
    • (note: this tree structure in TOL.xml.zip is *old*, it is from October 22, 2006)

Not annotated yet

Hackathon plan

  • develop plan (day 1)
    • revise as needed
    • some work is done in parallel
  • main workflow
  1. identify 10 trees for use as phylotastic source trees
  2. annotate them in free-text form
    • create web form in Google docs for input of annotations, based on MIAPA draft checklist from TDWG 2011 workshop
    • Spread sheet has pull down menus, plus options for free text entries under "other"
  3. transform annotations into a formal language statements in RDF
    • encoding process is iterative with ontology editing
    • Hilmar is working on language support
    • Joachim is working on the technology for getting this into a triplestore
    • Get URI for tree from TreeStore, add annotations to that URI in Protege
  4. Load trees into TreeStore
    • Will need to have trees in the correct format
  5. execute queries to demonstrate success

Log and accomplishments

  • initial plan (day 1)
  • initial MIAPA checklist-based input form (day 1)
  • revised input form
  • plan for (temporarily) storing trees and matrices (data) separate from metadata

Lessons learned from tree-finding and annotation

  • data frequently is not readily accessible online
    • e.g. Jetz one must go to birdtree.org
    • e.g., we obtained hymenoptera tree by pers. communication from Peters
    • only one of the studies has trees in TreeBASE (GEBA)
    • another study has a tree in Dryad (Smith)
  • authors often don't provide minimal information explicitly
    • whether or not a tree is rooted is usually not explicit
      • e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
      • e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
    • the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
  • a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
    • e.g., Goloboff molecules only vs. molecules with morphology
    • e.g., Bininda-Emonds best dates vs min dates vs max dates
  • sometimes a study has many trees that all represent outputs of the same method
    • e.g., Jetz provide a large sample from the posterior distribution
  • process of constructing tree does not follow sequences--> alignment--> tree
    • e.g., supertree method in Bininda-Emonds
    • e.g., hand-crafted APG, Smith, trees
  • process of constructing tree cannot be condensed easily
    • e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
    • GEBA tree, { missing explanation }
    • partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
  • clustering to define orthologs not included in checklist, but seems important
    • Smith phlawd, alignment: no pre-orthology.
  • mixed data is common
    • Goloboff has morpho and molecular
    • multiple studies have DNA (e.g., SSU rDNA) and protein sequences
  • concatenated alignments are common, e.g., multiple proteins
    • this means accession:OTU mapping is not 1:1 but many to one
  • not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
  • many important trees do not have branch lengths
    • e.g., APGIII is a taxonomic framework
    • e.g., some supertrees don't have branch lengths
  • do binomials count as meaningful external identifiers for OTUs?
    • in some cases, the methods make clear that these come from a specific source
      • e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
      • e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
      • e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
    • usually the naming authority is not clear
  • was any study straightforward?
  • OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
  • OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted