Phylotastic

From Evolutionary Interoperability and Outreach
Revision as of 22:55, 28 January 2013 by Hilmar (talk | contribs) (→‎Initial Pitches)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Error creating thumbnail: Unable to save thumbnail to destination

Phylotastic is a project to enable convenient, computable, credible access to the Tree of Life comprising expert knowledge of phylogeny: the species tree you want, in ready-to-use form, when you want it. It is a project started by a NESCent working group called HIP - Hackathons, Interoperabilties and Phylogenetics. There was a first Phylotastic hackathon at NESCent from June 4-8, 2012. Phylotastic2 is happening at iPlant on January 28 through Feb 1, 2013.

Before the hackathon

  1. Add your name (and photo!) to the list of participants
  2. Have a suggestion for making Phylotastic better? Suggest an project idea below. Take note of what has already been done; improving is often more efficient than rewriting.
  3. Something you want to learn in order to be more productive at the hackathon? Add an idea to the list of potential boot camps below
  4. Check out the list of resources. Add things you think might be relevant
  5. Review the material from the first hackathon

Subgroups

  1. Tree Annotation

Initial Pitches

  • Tree Annotation (Arlin)
    • What is the provenance of the tree returned by Phylotastic? Metadata. What are the methods of analysis used?. etc.
      • start with trees, move to freetext annotations from manuscript, then formalize them according to an ontology and feed into TreeStore
  • First draft of MIAPA ontology (Hilmar)
    • follow from Arlin's idea
    • MIAPA = Minimum reporting standard for phylogenetic analysis
    • initial idea from Leebens-Mack
    • on and off ideas - Vocamp in October 2011 produced a first checklist
    • there has been a community survey that is now being processed
  • TNRS - Name Validation (Gaurav)
    • easy to use interface
    • matching with Google Refine
    • not only match and recognize match, but suggest higher taxa
    • process easily long lists of names (1000)
  • Treestore (Ben)
    • implement the queries by Piel et al on the Treestore
  • Architecture and specifications (Rutger)
    • should provide a high level open specification of the interfaces
    • each group (TRNS, etc.) should be able to contribute to the interfaces
    • main goal is to derive the pruner interface in full detail as a demo
    • specification of interface and behavior
    • tie with MIAPA
  • Front End Versatility (Karen)
    • is this PhyloWS?
    • Ensure that Phylotastic can work with many different trees and treestores.
  • Use Cases (Brian S)
    • functional and exciting
    • alternative ways to create a list of input taxa
    • best phylogeny for a node in the tree of life
    • phylogeny for a complete museum
    • geographic search
    • ecological search
    • ...
    • incrementally growing list
    • maybe just get names automatically out of a PDF file
  • Kind of Users (Andrea)
    • what users and what kind of documentation is needed
    • e.g., why shouldn't be scary to use command-line args
    • focus on documentation development
  • Shiny (Daisy)
    • Tools for presenting extinct taxa
    • Visualizing bootstraps
    • Presenting beautiful, usable and informative trees
    • Links to annotating branches and nodes (Michael) and how to standardize the storage of the annotations
    • Googlemaps API with trees underneath - zooming and visualize
  • Tests for Phylotastic (Greg)
    • metrics for testing and sets of benchmarks
    • automated testings for the components and for the infrastructure
    • link to the interface specification?
  • Connecting trees to tip data (Julie)
    • attach trees to other repositories (e.g., locations, etc.)
    • e.g., get physiological data from Dryad...
  • Common Names for TNRS (Naim)
    • for non scientists using the system
  • Authentication (Scott)

Project ideas

There are some brief ideas and links below. No one person owns these ideas or the content. If you are excited about making a pitch, or if you just have a piece of information or an idea to share, don't hesitate to edit the linked content. This wiki thing doesn't work if you treat it as sacred!

Phylotastic reference guide

Professional incentive (read: ability to cite, to measure impact) is important, otherwise documentation is rarely maintained and becomes stale. Alternatives being considered:

  • Consider the model of "Topic Pages" in PLOS Comp Biol, which upon publication became Wikipedia articles and are further maintained there. Example: Approximate Bayesian Computation, and corresponding PLOS CB article.
  • Collaborative authoring of an eBook, for example on github using Markdown format (or its extensions implemented in Pandoc).

community science strategy

There are many ways that individual scientists can get involved in making phylotastic better:

  • submitting trees to a treestore
  • submitting calibrated trees to DateLife
  • bookmarking or reviewing a tree for quality
  • providing feedback on a service (speed, quality, convenience)

How is that going to happen? When do we start engaging people? Who are our partners? Do we leave the tree submission part to OToL project?

Phylotastic Alpha (integration challenge)

One proposal we came up with on Wednesday was to push on to the features we wanted on Phylotastic Alpha -- a well-documented, end-to-end system. End-to-end means that we start with a query that the end-user can construct, and end with a result that the end-user can employ, without requiring the user to have any special tools other than what phylotastic provides.

Minimal steps (?) in response to user submitting query consisting of list of species names

  1. system cleans up names
    • sends list to TNRS
    • parses result from TNRS
    • imposes rule to choose matches
    • records metadata on the matches that are used
  2. system finds tree with best coverage (or other desired features)
  3. system executes pruning and grafting with tree
  4. system scales tree using DateLife, if possible (animals only?)
  5. system returns scaled tree with metadata to user

R wrappers to all phylotastic components

Alot of non-technical users already use R. Creating a set of R package to hook into Phylotastic APIs wouldn't be hard. TNRS could be hooked into the taxize package I started here. Could include tree store, pruner, and phylomatic in another package to do tree acquisition/manipulation (perhaps wrap treebase into this package too by porting over the treebase package).

Phylotastic Lite

At the first hackathon, we treated Rutger's MapReduce pruner as a stripped-down version delivery system for phylogenetic knowledge, and we were able to make cool but highly limited demos including Mesquite-o-tastic and Reconcili-o-Tastic.

The idea here is to make another big jump forward in increasing capacity to handle real-life use-cases, without working out the larger problems associated with a multi-component phylotastic API. Let's take the shortest path to getting something that people actually can use for a wide range of queries. We could start with either phylomatic, or Rutger's MapReduce pruner. We'll debug the current system, load up the back end with 20 big trees that look really useful-- like the 5K-species bird tree that just came out last month, and including the NCBI taxonomy-- and we'll integrate quick-n-dirty fuzzy matching so that folks don't have to get the names perfect. We'll come up with an ad hoc system for annotating output.

This will give us something that is considerably more than a proof of concept:

  • a service to invoke in even cooler demos
  • a testbed to assess phylogenetic coverage
  • a testbed to generate challenges for annotation
  • a testbed to integrate phylotastic functionality for TNRS (name reconciliation) or tree-finding (choosing a source tree when the user doesn't specify it)
  • a source of a wide range of phylogenies for developing reconciliotastic applications

MatchMaker

Note: Gaurav has some additions to this based on recent work

An important step in data integration is matching on a key-- species names in our case. In our version of the problem, we have a resource A (the user's input list of species, a PDF, a character matrix file, etc), and we want to match that with some external resources B (a set of source trees or core TNRSs) so as to integrate the two.

But this is part of a much bigger problem that has two-- visualizing an auto-generated mapping between A and B, and interactive choosing of a mapping by the user. In the case of phyotastic, the auto-generated mapping comes from a TNRS. I think we are going to need at least the first part in order to develop and evaluate the TNRS as a practically useful too.

If we think about this graphically as a table of pairwise matches, we get an idea code-named MatchMaker, which could be a killer app with uses not just in phylotastic, not just in bioinformatics, but throughout informatics and beyond. How often to people need to reconcile or integrate two resources A and B by finding the best mapping of A names to B names? I have some mock-ups and GUI ideas from a slightly different version of the problem (the "marriage problem" of finding the optimal mapping between 2 sets of N names) here:

 http://dl.dropbox.com/u/7727158/name_matching.pptx

But we also might want to think about the TNRS meta-service, which creates a one-to-many mapping between submitted names and matched names, i.e., an input name in A might match names B1, B2, B3 . . . in different namebanks served by difference core TNRSs (aggregated by the meta-TNRS).

In this case, the mapping is naturally an unconnected graph consisting of connected sub-graphs, each with a single A node linked to 0 or more B nodes, with the A-B links representing a matching event (exact, fuzzy, deprecated, etc). So, one way to visualize this mapping is as a graph. Hilmar has constructed an ontology for TNRS language (http://phylotastic.org/terms/tnrs.rdf). SA JSON parser can be used to take the JSON output of a TNRS and turn it into RDF: visualize the RDF graph with a tool such as Protege, or parse it out of the RDF and visualize it with GraphViz. The representation could be improved with some way of graphically encoding information about the match, e.g., blue node = NCBI taxon, thick solid edge = exact match, dotted edge = fuzzy match.

A tree in the hand

Quick idea: as you roam a museum, zoo, collection or preserve, you scan QR codes (e.g., on signage) for various species, then press the "get tree" button, and you have a tree for all the critters you have seen. Great for class field trips.

In the more advanced version of this idea, there is some kind of automated or semi-automated species identification where the inputs are not encoded species names, but the user's feature descriptions, sequence samples, or photos.

Rutger

I'm still rather keen to see an attractive handheld app where people can grow their own ToL the way they "check in" on locations, such as on foursquare.
The Netherlands now has a funding mechanism to take proof-of-concept technologies (i.e. phylotastic) to market (i.e. the app store) in public/private partnerships.
Once upon a time I worked at a web development company that specializes in life sciences (biomedia.nl). I think they would be excellent partners to write a proposal with to see if phylotastic technology can be applied to such an app.

From Michael

I'm friendly with the group behind iNaturalist.com, which is a website for recording species observations and citizen science, and once floated the idea of a phylotastic interface. I could approach them again about it.

From Brian S

I think this is a really cool idea, and I would probably use an app like that in my classes if it were available. In particular, I can think of integrating it with one of the field trips in my Systematics of Fishes class. Right now they do a scavenger hunt at the Oregon Coast Aquarium and then assemble a tree of life linking the species that they found by hand. This would be way cooler if they could do it on their iPhones or Androids.

From Dan L

I think this is a great idea, I've been building iPhone apps since 2009, so I have a good handle on what's available in the native SDK. I read about a project with some similar ideas: What The Feuille ? - a web app to let you find out from what tree or plant a leaf is, which has some similar ideas at play. Wrapping something like that in an app to upload pictures and talk to a web API would go a long way.

Others

Possible integration with OneZoom?
A front-end for the general public: a fun way to accumulate a species list (e.g. QR codes in a zoo or a museum) and get an attractive tree with extra information, as a web app that displays nicely in a mobile environment (e.g. could then be wrapped into a thin iOS and/or Android app).


congruifier

Roll Eastman congruifier code into DateLife, so input topologies, rather than just lists of species, can be dated (Brian O'M)

Phylotastic metadata

Problem statement To ensure that phylotastic trees are useful to scientists, develop vocabularies to annotate phylotastic trees, and formats to embed the annotations with phylotastic trees delivered to users.

Background Phylotastic trees need metadata, for several reasons that include debugging the system and ensuring that results are reproducible. Importantly, in the absence of an external standard for truth, scientists judge the quality of trees by assessing the quality of methods used to produce them. Therefore, if a phylotastic system is to be useful to research scientists, the metadata for trees produced by the system must include sources and methods.

A very simple example At the first hackathon, we often demonstrated phylotastic by sending a query like "Homo sapiens, Mus musculus, Pan paniscus" to the MapReduce pruner and getting back a tree of the form "(( Homo sapiens, Pan paniscus ), Mus musculus )" based on pruning from the Bininda-Emonds tree. How would we want to annotate that result so that users can undertand how the phylogeny was obtained, so that they can judge how much to trust it? What will we need to know about the source tree, the Bininda-Emonds tree? Here is a quick example of a simple report, in loosely structured text:

date = 9 Jan 2013, 10:27 am
service = MapReduce pruner at http://phylotastic-wg.nescent.org/script/phylotastic.cgi
query = { taxa="Homo sapiens, Mus musculus, Pan paniscus"&tree=mammals }
topology_method = pruning only based on exact match to species binomials
topology_source = { dc:title="The delayed rise of present-day mammals" dc:creator="Bininda-Emonds, O." dcterms:bibliographiccitation="Nature 2007, 446(7135):507-512" and so on }
topology_log = (no errors or warnings)
scaling_method = none
scaling_source = none
scaling_log = none
result = (( Homo sapiens, Pan paniscus ), Mus musculus )

TreeAnnotation

Synopsis Annotate a small set of large trees used as sources.

This is described on a separate TreeAnnotation page.

comment: This potentially links to the social bookmarking / crowd-annotation of trees idea, which will become more critical as the number of available large trees expands.

Tree reconciliation, non-tree-topology presentation

From an email sent to the list: Is it necessary to resolve conflict outright, or is it possible to figure out a way to present and incorporate uncertainty into the interface and presentation? As a systematist, I often *want* to see the conflict and degree of conflict amongst trees, be they from different genes or from different authorities, etc. It seems to me that systematics still doesn't have a good way of presenting uncertainty in trees or in graphs, and that maybe Phylotastic, with its emphasis on good presentation and visuals, would be a great place to hash out a better way of displaying incongruity amongst several trees.

I'm imagining a possible use case like this: I am working on the genus Populus and I want to find out what's been published, tree-wise. I know that I will get back several trees that have pretty significant conflicts, grouping species together in different monophyletic groups. Instead of that resolving to one single "authoritative" tree, it might be more informative to me to see something like a Splitstree network, where the incongruities are visualized in some way, and maybe with branch weightings indicating support amongst published trees for that branch. Then perhaps I could click on the branch and see the breakdown of references that support that branch.

If we could work out a good way to visualize these sorts of incongruities, I imagine it would scale well into other sorts of tree conflicts, like incomplete lineage sorting, hybridization, etc.

Ideas:

Splitstree-like network visualization
Densitree-like presentation, with different levels of support indicated by thicknesses
Network presentations
Reticulate branches in a normal tree presentation

NeXML -> R converter

From Brian O. There are two main formats in R for phylogenies: ape's phylo format, and phylobase's phylo4d format. Greg Jordan in his ggphylo package also keeps phylogeny data in a separate data.frame referencing to nodes.

Hangout 1 Ideas

  • Support for common names in TNRS: NCBI provides English, EOL several other languages, wikipedia/ http://wiki.dbpedia.org/
  • Tree store implementation (support for DateLife)
  • Architecture:

From Brian O'Meara's email:

One thing that came up during today's hangout was architecture. At the last hackathon, there was a group dedicated to architecture (how all the components fit together). Some of their work is at http://www.evoio.org/wiki/Phylotastic/Architecture . Some of us on the call thought that this hackathon might work best if the architecture is mostly specified before we meet, with perhaps further improvements once we arrive. It's hard to code things to work together if you are still designing how that will happen. This wouldn't be set in stone, but would help guide development: how will metadata be passed, for example. Much of this work was apparently done by the group at the last hackathon, but we should discuss 1) whether trying to firm this up before the hackathon is a good idea (based on discussion today, I assume most people think yes, but there are probably counterarguments), and 2) how much of the current specification we should use, and what needs to be added/changed/deleted.

Hangout 3 Summary

  • In terms of overall goals for our week together, the people on the hangout were enthused about trying to roll out a full alpha or production version of Phylotastic, specifically a version of the website that uses all the pieces of the workflow in a coherent whole. In other words, people expressed primary interest in refining and connecting the components that we already have, as opposed to creating new components.
  • Several people were interested in ironing out details of architecture and specifications for how components exchange information, and it was suggested that we start an email thread on that topic before meeting in Tucson.
  • People were divided on the mobile app idea. Everyone likes it in concept, but some folks are enthusiastically behind developing it now, others suggested that development is premature until we have a fully functional basic service in place. We might want to coordinate with others who are already developing similar apps, such as the folks at iNaturalist (http://www.inaturalist.org/)
  • We talked at some length about coordination with Opentree, and generally agreed that this would be a great a source of big trees, though likely not the only source. We mentioned Arbor but didn't discuss it extensively.
  • Several people recommended devoting substantial person-hours to documentation, perhaps even having a documentation subgroup. The r-phylo wiki from the first NESCent hackathon might be a good model for how to start doing that. http://www.r-phylo.org/wiki/Main_Page
  • People are very excited about getting together in Tuscon, which is great!

Boot camps

  • Overview of the Phylotastic 1 (high level – Arlin if present)
  • Architecture and Interfaces – this could also mention some of the data standards in use, like NeXML (Rutger) File:Slides-rutger.pdf
  • TreeStore (Ben and Hilmar)
  • TNRS (User:Gaurav): slides
  • OpenTree and related efforts (Karen)
  • git and GitHub

Subgroups

  • Architastic
    • Ben, Karen, Derrick, Shannon, Naim, Mark, Cody
    • google doc

Phylotastic resources from the first hackathon