Advancing MIAPA

From Evolutionary Interoperability and Outreach
Revision as of 00:32, 3 February 2013 by Hilmar (talk | contribs) (Hilmar moved page AdvancingMIAPA to Advancing MIAPA)
Jump to navigation Jump to search

Factors shaping our conception of source-tree annotations

  1. what folks in the evoinfo community believe is the Minimal Information About a Phylogenetic Analysis (MIAPA)
    • The current synopsis of this is the MIAPA checklist from teh 2011 TDWG meeting.
  2. need to support assignment of credit (blame) for tree-producers and phylotastic service-providers
  3. need to support licensing that protects creators, resource-providers and end-users
  4. need to contribute to a credible provenance report for phylotastic-generated trees
    • e.g., a tree might be returned with information as follows (free text form): "This tree was obtained on Jan 29, 2013. An input list of 58 names was submitted to Taxosaurus, resulting in 45 valid species binomials ( list ). This list was sent to a pruner with instructions to prune out the indicated species from the phylogeny of Bininda-Emonds, et al. 2007. The resulting sub-tree with 40 species was scaled using teh DateLife service. "
  5. support the most common user criteria for phylotastic searching
    • limits on source trees
      • return tree with maximal coverage of list ::= { species }
        • specify namespace of <list>
      • restrict by publication status (published or not)
      • restrict by publication year (e.g., no trees older than 5 years)
      • restrict by author (e.g., author = bininda-emonds)
      • restrict by other citation information (e.g., jrnl = index fungorum)
      • restrict by method
        • use (exclude) trees made with method class = { parsimony, likelihood, Bayesian, distance, supertree, supermatrix, hand-crafted. . . }
        • use (exclude) trees made with evolutionary model = { }
        • use (exclude) trees made with software = { RaXML, BEAST, PAUP*, . . . }
        • other restrictions (e.g., supertree, . . . )
      • restrict by availability of source data such as character matrix
        • restrict by minimum number of characters in matrix
      • restrict by type of source data = { molecular, morphological, mixed }
      • restrict by annotation level = { platinum, gold, silver, bronze, polystyrene }
      • require feature
        • require support values
        • require rooting
        • require branch lengths
        • require fully resolved
        • require other features
    • limits on phylotastic manipulations
      • disallow species substitution
      • disallow grafting of source trees
      • scaling: provide median age estimates
      • scaling: provide only lower age estimates
      • TNRS: prohibit fuzzy matching
      • TNRS: use matches from source = { NCBI, ITIS, et }
  6. support adequate annotation of the provenance of the resulting phylotastic tree
    • imagine that you have to publish an analysis using a phylogenetic tree-- how do you describe it? "The phylogeny of 40 mammal species was obtained via phylotastic (phylotastic.org
    • citation information

Suggestions for MIAPA ontology

From annotation session on afternoon of 1/31.

  • It would be good to generate an instance of useMaximumLikelihood ("Maximum Likelihood algorithm") in MIAPA, so we don't have to create one for each annotation. Filed as Issue #8
  • Alternatively, maybe make classes of software (like PhyML or RAxML) implement ML algorithm, rather than having to assert it for each instance we create. Some software can use multiple algorithms, so we can't do this for each case.
    • Note that in OWL classes cannot be asserted to have property values, only instances can. We can put property restrictions with existential quantification on classes, and a OWL reasoner could then infer that an instance must have at least one such property association (and thus a DL query should in principle return the instance), but this wouldn't work in an RDF triple store so that we could then actually query for these things in SPARQL.
    • Note also that there can be multiple swo:implements assertions for a software instance, so multiple algorithms can be easily asserted. However, this wouldn't the also say which of those implemented algorithms was the one utilized for the generation of the tree of alignment. The idea is that this would be evident from the miapa:'Parameter specification'.

Lessons learned from tree-finding and annotation

  • data frequently is not readily accessible online
    • e.g. Jetz one must go to birdtree.org
    • e.g., we obtained hymenoptera tree by pers. communication from Peters
    • only one of the studies has trees in TreeBASE (GEBA)
    • another study has a tree in Dryad (Smith)
  • authors often don't provide minimal information explicitly
    • whether or not a tree is rooted is usually not explicit
      • e.g., taxonomic hierarchies (APG, ToLWeb, NCBI) imply rootedness
      • e.g., Peters, et al 2011 never invokes the term "root" but we infer rooting from the description of outgroups not present in the final tree
    • the methodology is largely unexplained for curated or hand-crafted taxonomic frameworks like APGIII or NCBI
  • a study frequently has several trees by slightly different methods, but the checklist seems to imply one method
    • e.g., Goloboff molecules only vs. molecules with morphology
    • e.g., Bininda-Emonds best dates vs min dates vs max dates
  • sometimes a study has many trees that all represent outputs of the same method
    • e.g., Jetz provide a large sample from the posterior distribution
  • process of constructing tree does not follow sequences--> alignment--> tree
    • e.g., supertree method in Bininda-Emonds
    • e.g., hand-crafted APG, Smith, trees
  • process of constructing tree cannot be condensed easily
    • e.g., Goloboff, iterative procedure with divide-and-conquer search to find parsimony tree ("tree fuse" & "sector" search repeated)
    • GEBA tree, { missing explanation }
    • partitioned alignment (e.g., Smith, et al), but miapa implies "model of evolution" as though there were only one, whereas in Peters, et al there are 2 models for 2 partitions
  • clustering to define orthologs not included in checklist, but seems important
    • Smith phlawd, alignment: no pre-orthology.
  • mixed data is common
    • Goloboff has morpho and molecular
    • multiple studies have DNA (e.g., SSU rDNA) and protein sequences
  • concatenated alignments are common, e.g., multiple proteins
    • this means accession:OTU mapping is not 1:1 but many to one
  • not encountered in our inputs but sometimes the OTU is <genus_sp> and the data are fused from multiple species (this is common in MorphoBank)
  • many important trees do not have branch lengths
    • e.g., APGIII is a taxonomic framework
    • e.g., some supertrees don't have branch lengths
  • do binomials count as meaningful external identifiers for OTUs?
    • in some cases, the methods make clear that these come from a specific source
      • e.g., Goloboff names clearly come directly from NCBI via their bioinfo pipeline
      • e.g., Bininda-Emonds publication declares that naming authority is Wilson & Roeder (Mammal Species of the World)
      • e.g., NCBI taxonomy comes as a database dump wth taxids and synonyms, so it represents its own authority
    • usually the naming authority is not clear
  • was any study straightforward?
  • OTUs checklist question may be redundant: why have external identifier and then ask for collections information?
  • OTU external refs in checklist: not applicable for supertree methods, consensus methods, hand-crafted

MIAPA Resources

Annotation