Advancing MIAPA
Factors shaping our conception of source-tree annotations
- what folks in the evoinfo community believe is the Minimal Information About a Phylogenetic Analysis (MIAPA)
- The current synopsis of this is the MIAPA checklist from teh 2011 TDWG meeting.
- need to support assignment of credit (blame) for tree-producers and phylotastic service-providers
- need to support licensing that protects creators, resource-providers and end-users
- need to contribute to a credible provenance report for phylotastic-generated trees
- e.g., a tree might be returned with information as follows (free text form): "This tree was obtained on Jan 29, 2013. An input list of 58 names was submitted to Taxosaurus, resulting in 45 valid species binomials ( list ). This list was sent to a pruner with instructions to prune out the indicated species from the phylogeny of Bininda-Emonds, et al. 2007. The resulting sub-tree with 40 species was scaled using teh DateLife service. "
- support the most common user criteria for phylotastic searching
- limits on source trees
- return tree with maximal coverage of list ::= { species }
- specify namespace of <list>
- restrict by publication status (published or not)
- restrict by publication year (e.g., no trees older than 5 years)
- restrict by author (e.g., author = bininda-emonds)
- restrict by other citation information (e.g., jrnl = index fungorum)
- restrict by method
- use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
- use (exclude) trees made with evolutionary model = { }
- use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
- other restrictions (e.g., supertree, . . . )
- restrict by availability of source data such as character matrix
- restrict by minimum number of characters in matrix
- restrict by type of source data = { molecular, morphological, mixed }
- restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
- require feature
- require support values
- require rooting
- require branch lengths
- require fully resolved
- require other features
- return tree with maximal coverage of list ::= { species }
- limits on phylotastic manipulations
- disallow species substitution
- disallow grafting of source trees
- scaling: provide median age estimates
- scaling: provide only lower age estimates
- TNRS: prohibit fuzzy matching
- TNRS: use matches from source = { NCBI, ITIS, et }
- limits on source trees
- support adequate annotation of the provenance of the resulting phylotastic tree
- imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
- citation information
Lessons learned from tree-finding and annotation
- data frequently is not readily accessible online
- e.g. Jetz one must go to birdtree.org
- e.g., we obtained hymenoptera tree by pers. communication from Peters
- only one of the studies has trees in TreeBASE (GEBA)
- another study has a tree in Dryad (Smith)
- authors often don't provide minimal information explicitly
- whether or not a tree is rooted is usually not explicit
- e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
- e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
- the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
- whether or not a tree is rooted is usually not explicit
- a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
- e.g., Goloboff molecules only vs. molecules with morphology
- e.g., Bininda-Emonds best dates vs min dates vs max dates
- sometimes a study has many trees that all represent outputs of the same method
- e.g., Jetz provide a large sample from the posterior distribution
- process of constructing tree does not follow sequences--> alignment--> tree
- e.g., supertree method in Bininda-Emonds
- e.g., hand-crafted APG, Smith, trees
- process of constructing tree cannot be condensed easily
- e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
- GEBA tree, { missing explanation }
- partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
- clustering to define orthologs not included in checklist, but seems important
- Smith phlawd, alignment: no pre-orthology.
- mixed data is common
- Goloboff has morpho and molecular
- multiple studies have DNA (e.g., SSU rDNA) and protein sequences
- concatenated alignments are common, e.g., multiple proteins
- this means accession:OTU mapping is not 1:1 but many to one
- not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
- many important trees do not have branch lengths
- e.g., APGIII is a taxonomic framework
- e.g., some supertrees don't have branch lengths
- do binomials count as meaningful external identifiers for OTUs?
- in some cases, the methods make clear that these come from a specific source
- e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
- e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
- e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
- usually the naming authority is not clear
- in some cases, the methods make clear that these come from a specific source
- was any study straightforward?
- OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
- OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted
MIAPA Resources
- The main MIAPA page. Primarily:
- MIAPA draft checklist from TDWG 2011 workshop.
- Emily McTavish's Tree Annotation Vocabulary as resulting from the Phylotastic I hackathon. Includes a mapping to MIAPA draft checklist attributes.
- Slideshow with some results from the MIAPA community survey orchestrated by the Open Tree of Life project in fall 2012.
- Ontologies
- Development:
- MIAPA repo on Github
- Template development after the TNRS ontology developed at (and after) Phylotastic I.
Annotation
- representation of metadata: NeXML
- some stuff that Rutger did mapping the metadata from ToLWeb XML format onto semantic annotations in NeXML.
- ToLWeb XML described here: http://tolweb.org/tree/home.pages/downloadtree.html
- a simple script that does the conversion: https://github.com/ncbnaturalis/bio-phylo/blob/master/experimental/tolconvert.pl
- Here's an example input file: https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol.xml
- Here's the resulting output file (indented): https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol-nexml-pp.xml