Advancing MIAPA

Factors shaping our conception of source-tree annotations

what folks in the evoinfo community believe is the Minimal Information About a Phylogenetic Analysis (MIAPA)
- The current synopsis of this is the MIAPA checklist from teh 2011 TDWG meeting.
need to support assignment of credit (blame) for tree-producers and phylotastic service-providers
need to support licensing that protects creators, resource-providers and end-users
need to contribute to a credible provenance report for phylotastic-generated trees
- e.g., a tree might be returned with information as follows (free text form): "This tree was obtained on Jan 29, 2013. An input list of 58 names was submitted to Taxosaurus, resulting in 45 valid species binomials ( list ). This list was sent to a pruner with instructions to prune out the indicated species from the phylogeny of Bininda-Emonds, et al. 2007. The resulting sub-tree with 40 species was scaled using teh DateLife service. "
support the most common user criteria for phylotastic searching
- limits on source trees
  - return tree with maximal coverage of list ::= { species }
    - specify namespace of <list>
  - restrict by publication status (published or not)
  - restrict by publication year (e.g., no trees older than 5 years)
  - restrict by author (e.g., author = bininda-emonds)
  - restrict by other citation information (e.g., jrnl = index fungorum)
  - restrict by method
    - use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
    - use (exclude) trees made with evolutionary model = { }
    - use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
    - other restrictions (e.g., supertree, . . . )
  - restrict by availability of source data such as character matrix
    - restrict by minimum number of characters in matrix
  - restrict by type of source data = { molecular, morphological, mixed }
  - restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
  - require feature
    - require support values
    - require rooting
    - require branch lengths
    - require fully resolved
    - require other features
- limits on phylotastic manipulations
  - disallow species substitution
  - disallow grafting of source trees
  - scaling: provide median age estimates
  - scaling: provide only lower age estimates
  - TNRS: prohibit fuzzy matching
  - TNRS: use matches from source = { NCBI, ITIS, et }
support adequate annotation of the provenance of the resulting phylotastic tree
- imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
- citation information

Lessons learned from tree-finding and annotation

data frequently is not readily accessible online
- e.g. Jetz one must go to birdtree.org
- e.g., we obtained hymenoptera tree by pers. communication from Peters
- only one of the studies has trees in TreeBASE (GEBA)
- another study has a tree in Dryad (Smith)
authors often don't provide minimal information explicitly
- whether or not a tree is rooted is usually not explicit
  - e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
  - e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
- the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
- e.g., Goloboff molecules only vs. molecules with morphology
- e.g., Bininda-Emonds best dates vs min dates vs max dates
sometimes a study has many trees that all represent outputs of the same method
- e.g., Jetz provide a large sample from the posterior distribution
process of constructing tree does not follow sequences--> alignment--> tree
- e.g., supertree method in Bininda-Emonds
- e.g., hand-crafted APG, Smith, trees
process of constructing tree cannot be condensed easily
- e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
- GEBA tree, { missing explanation }
- partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
clustering to define orthologs not included in checklist, but seems important
- Smith phlawd, alignment: no pre-orthology.
mixed data is common
- Goloboff has morpho and molecular
- multiple studies have DNA (e.g., SSU rDNA) and protein sequences
concatenated alignments are common, e.g., multiple proteins
- this means accession:OTU mapping is not 1:1 but many to one
not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
many important trees do not have branch lengths
- e.g., APGIII is a taxonomic framework
- e.g., some supertrees don't have branch lengths
do binomials count as meaningful external identifiers for OTUs?
- in some cases, the methods make clear that these come from a specific source
  - e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
  - e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
  - e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
- usually the naming authority is not clear
was any study straightforward?
OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted

MIAPA Resources

The main MIAPA page. Primarily:
- MIAPA draft checklist from TDWG 2011 workshop.
- Emily McTavish's Tree Annotation Vocabulary as resulting from the Phylotastic I hackathon. Includes a mapping to MIAPA draft checklist attributes.
- Slideshow with some results from the MIAPA community survey orchestrated by the Open Tree of Life project in fall 2012.
Ontologies
- phylont from Maryam Panahiazar, et al
- CDAO
Development:
- MIAPA repo on Github
- Template development after the TNRS ontology developed at (and after) Phylotastic I.

Annotation

representation of metadata: NeXML
some stuff that Rutger did mapping the metadata from ToLWeb XML format onto semantic annotations in NeXML.
- ToLWeb XML described here: http://tolweb.org/tree/home.pages/downloadtree.html
- a simple script that does the conversion: https://github.com/ncbnaturalis/bio-phylo/blob/master/experimental/tolconvert.pl
- Here's an example input file: https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol.xml
- Here's the resulting output file (indented): https://raw.github.com/ncbnaturalis/bio-phylo/master/experimental/tol-nexml-pp.xml

Advancing MIAPA

Contents

Factors shaping our conception of source-tree annotations

Lessons learned from tree-finding and annotation

MIAPA Resources

Annotation

Navigation menu

Advancing MIAPA

Factors shaping our conception of source-tree annotations

Lessons learned from tree-finding and annotation

MIAPA Resources

Annotation

Navigation menu

Search