Advancing MIAPA

From Evolutionary Interoperability and Outreach
Revision as of 23:03, 31 January 2013 by Hilmar (talk | contribs) (→‎MIAPA)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Factors shaping our conception of source-tree annotations

  1. what folks in the evoinfo community believe is the Minimal Information About a Phylogenetic Analysis (MIAPA)
    • The current synopsis of this is the MIAPA checklist from teh 2011 TDWG meeting.
  2. need to support assignment of credit (blame) for tree-producers and phylotastic service-providers
  3. need to support licensing that protects creators, resource-providers and end-users
  4. need to contribute to a credible provenance report for phylotastic-generated trees
    • e.g., a tree might be returned with information as follows (free text form): "This tree was obtained on Jan 29, 2013. An input list of 58 names was submitted to Taxosaurus, resulting in 45 valid species binomials ( list ). This list was sent to a pruner with instructions to prune out the indicated species from the phylogeny of Bininda-Emonds, et al. 2007. The resulting sub-tree with 40 species was scaled using teh DateLife service. "
  5. support the most common user criteria for phylotastic searching
    • limits on source trees
      • return tree with maximal coverage of list ::= { species }
        • specify namespace of <list>
      • restrict by publication status (published or not)
      • restrict by publication year (e.g., no trees older than 5 years)
      • restrict by author (e.g., author = bininda-emonds)
      • restrict by other citation information (e.g., jrnl = index fungorum)
      • restrict by method
        • use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
        • use (exclude) trees made with evolutionary model = { }
        • use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
        • other restrictions (e.g., supertree, . . . )
      • restrict by availability of source data such as character matrix
        • restrict by minimum number of characters in matrix
      • restrict by type of source data = { molecular, morphological, mixed }
      • restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
      • require feature
        • require support values
        • require rooting
        • require branch lengths
        • require fully resolved
        • require other features
    • limits on phylotastic manipulations
      • disallow species substitution
      • disallow grafting of source trees
      • scaling: provide median age estimates
      • scaling: provide only lower age estimates
      • TNRS: prohibit fuzzy matching
      • TNRS: use matches from source = { NCBI, ITIS, et }
  6. support adequate annotation of the provenance of the resulting phylotastic tree
    • imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
    • citation information

Lessons learned from tree-finding and annotation

  • data frequently is not readily accessible online
    • e.g. Jetz one must go to birdtree.org
    • e.g., we obtained hymenoptera tree by pers. communication from Peters
    • only one of the studies has trees in TreeBASE (GEBA)
    • another study has a tree in Dryad (Smith)
  • authors often don't provide minimal information explicitly
    • whether or not a tree is rooted is usually not explicit
      • e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
      • e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
    • the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
  • a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
    • e.g., Goloboff molecules only vs. molecules with morphology
    • e.g., Bininda-Emonds best dates vs min dates vs max dates
  • sometimes a study has many trees that all represent outputs of the same method
    • e.g., Jetz provide a large sample from the posterior distribution
  • process of constructing tree does not follow sequences--> alignment--> tree
    • e.g., supertree method in Bininda-Emonds
    • e.g., hand-crafted APG, Smith, trees
  • process of constructing tree cannot be condensed easily
    • e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
    • GEBA tree, { missing explanation }
    • partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
  • clustering to define orthologs not included in checklist, but seems important
    • Smith phlawd, alignment: no pre-orthology.
  • mixed data is common
    • Goloboff has morpho and molecular
    • multiple studies have DNA (e.g., SSU rDNA) and protein sequences
  • concatenated alignments are common, e.g., multiple proteins
    • this means accession:OTU mapping is not 1:1 but many to one
  • not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
  • many important trees do not have branch lengths
    • e.g., APGIII is a taxonomic framework
    • e.g., some supertrees don't have branch lengths
  • do binomials count as meaningful external identifiers for OTUs?
    • in some cases, the methods make clear that these come from a specific source
      • e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
      • e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
      • e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
    • usually the naming authority is not clear
  • was any study straightforward?
  • OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
  • OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted


MIAPA Resources

Annotation