Phylotastic
Phylotastic is a project to enable convenient, computable, credible access to the Tree of Life comprising expert knowledge of phylogeny: the species tree you want, in ready-to-use form, when you want it. It is a project started by a NESCent working group called HIP - Hackathons, Interoperabilties and Phylogenetics. There was a first Phylotastic hackathon at NESCent from June 4-8, 2012. Phylotastic2 is happening at iPlant on January 28 through Feb 1, 2013.
Before the hackathon
- Add your name (and photo!) to the list of participants
- Have a suggestion for making Phylotastic better? Suggest an project idea below. Take note of what has already been done; improving is often more efficient than rewriting.
- Something you want to learn in order to be more productive at the hackathon? Add an idea to the list of potential boot camps below
- Check out the list of resources. Add things you think might be relevant
- Review the material from the first hackathon
Project ideas
There are some brief ideas and links below. No one person owns these ideas or the content. If you are excited about making a pitch, or if you just have a piece of information or an idea to share, don't hesitate to edit the linked content. This wiki thing doesn't work if you treat it as sacred!
Phylotastic reference guide
Professional incentive (read: ability to cite, to measure impact) is important, otherwise documentation is rarely maintained and becomes stale. Alternatives being considered:
- Consider the model of "Topic Pages" in PLOS Comp Biol, which upon publication became Wikipedia articles and are further maintained there. Example: Approximate Bayesian Computation, and corresponding PLOS CB article.
- Collaborative authoring of an eBook, for example on github using Markdown format (or it's extensions implemented in Pandoc).
community science strategy
There are many ways that individual scientists can get involved in making phylotastic better:
- submitting trees to a treestore
- submitting calibrated trees to DateLife
- bookmarking or reviewing a tree for quality
- providing feedback on a service (speed, quality, convenience)
How is that going to happen? When do we start engaging people? Who are our partners? Do we leave the tree submission part to OToL project?
Phylotastic Alpha (integration challenge)
One proposal we came up with on Wednesday was to push on to the features we wanted on Phylotastic Alpha -- a well-documented, end-to-end system. End-to-end means that we start with a query that the end-user can construct, and end with a result that the end-user can employ, without requiring the user to have any special tools other than what phylotastic provides.
Minimal steps (?) in response to user submitting query consisting of list of species names
- system cleans up names
- sends list to TNRS
- parses result from TNRS
- imposes rule to choose matches
- records metadata on the matches that are used
- system finds tree with best coverage (or other desired features)
- system executes pruning and grafting with tree
- system scales tree using DateLife, if possible (animals only?)
- system returns scaled tree with metadata to user
R wrappers to all phylotastic components
Alot of non-technical users already use R. Creating a set of R package to hook into Phylotastic APIs wouldn't be hard. TNRS could be hooked into the taxize package I started here. Could include tree store, pruner, and phylomatic in another package to do tree acquisition/manipulation (perhaps wrap treebase into this package too by porting over the treebase package).
Phylotastic Lite
At the first hackathon, we treated Rutger's MapReduce pruner as a stripped-down version delivery system for phylogenetic knowledge, and we were able to make cool but highly limited demos including Mesquite-o-tastic and Reconcili-o-Tastic.
The idea here is to make another big jump forward in increasing capacity to handle real-life use-cases, without working out the larger problems associated with a multi-component phylotastic API. Let's take the shortest path to getting something that people actually can use for a wide range of queries. We could start with either phylomatic, or Rutger's MapReduce pruner. We'll debug the current system, load up the back end with 20 big trees that look really useful-- like the 5K-species bird tree that just came out last month, and including the NCBI taxonomy-- and we'll integrate quick-n-dirty fuzzy matching so that folks don't have to get the names perfect. We'll come up with an ad hoc system for annotating output.
This will give us something that is considerably more than a proof of concept:
- a service to invoke in even cooler demos
- a testbed to assess phylogenetic coverage
- a testbed to generate challenges for annotation
- a testbed to integrate phylotastic functionality for TNRS (name reconciliation) or tree-finding (choosing a source tree when the user doesn't specify it)
- a source of a wide range of phylogenies for developing reconciliotastic applications
MatchMaker
Note: Gaurav has some additions to this based on recent work
An important step in data integration is matching on a key-- species names in our case. In our version of the problem, we have a resource A (the user's input list of species, a PDF, a character matrix file, etc), and we want to match that with some external resources B (a set of source trees or core TNRSs) so as to integrate the two.
But this is part of a much bigger problem that has two-- visualizing an auto-generated mapping between A and B, and interactive choosing of a mapping by the user. In the case of phyotastic, the auto-generated mapping comes from a TNRS. I think we are going to need at least the first part in order to develop and evaluate the TNRS as a practically useful too.
If we think about this graphically as a table of pairwise matches, we get an idea code-named MatchMaker, which could be a killer app with uses not just in phylotastic, not just in bioinformatics, but throughout informatics and beyond. How often to people need to reconcile or integrate two resources A and B by finding the best mapping of A names to B names? I have some mock-ups and GUI ideas from a slightly different version of the problem (the "marriage problem" of finding the optimal mapping between 2 sets of N names) here:
http://dl.dropbox.com/u/7727158/name_matching.pptx
But we also might want to think about the TNRS meta-service, which creates a one-to-many mapping between submitted names and matched names, i.e., an input name in A might match names B1, B2, B3 . . . in different namebanks served by difference core TNRSs (aggregated by the meta-TNRS).
In this case, the mapping is naturally an unconnected graph consisting of connected sub-graphs, each with a single A node linked to 0 or more B nodes, with the A-B links representing a matching event (exact, fuzzy, deprecated, etc). So, one way to visualize this mapping is as a graph. Hilmar has constructed an ontology for TNRS language (http://phylotastic.org/terms/tnrs.rdf). SA JSON parser can be used to take the JSON output of a TNRS and turn it into RDF: visualize the RDF graph with a tool such as Protege, or parse it out of the RDF and visualize it with GraphViz. The representation could be improved with some way of graphically encoding information about the match, e.g., blue node = NCBI taxon, thick solid edge = exact match, dotted edge = fuzzy match.
A tree in the hand
Quick idea: as you roam a museum, zoo, collection or preserve, you scan QR codes (e.g., on signage) for various species, then press the "get tree" button, and you have a tree for all the critters you have seen. Great for class field trips.
In the more advanced version of this idea, there is some kind of automated or semi-automated species identification where the inputs are not encoded species names, but the user's feature descriptions, sequence samples, or photos.
Rutger
- I'm still rather keen to see an attractive handheld app where people can grow their own ToL the way they "check in" on locations, such as on foursquare.
- The Netherlands now has a funding mechanism to take proof-of-concept technologies (i.e. phylotastic) to market (i.e. the app store) in public/private partnerships.
- Once upon a time I worked at a web development company that specializes in life sciences (biomedia.nl). I think they would be excellent partners to write a proposal with to see if phylotastic technology can be applied to such an app.
From Michael
- I'm friendly with the group behind iNaturalist.com, which is a website for recording species observations and citizen science, and once floated the idea of a phylotastic interface. I could approach them again about it.
From Brian S
- I think this is a really cool idea, and I would probably use an app like that in my classes if it were available. In particular, I can think of integrating it with one of the field trips in my Systematics of Fishes class. Right now they do a scavenger hunt at the Oregon Coast Aquarium and then assemble a tree of life linking the species that they found by hand. This would be way cooler if they could do it on their iPhones or Androids.
From Dan L
- I think this is a great idea, I've been building iPhone apps since 2009, so I have a good handle on what's available in the native SDK. I read about a project with some similar ideas: What The Feuille ? - a web app to let you find out from what tree or plant a leaf is, which has some similar ideas at play. Wrapping something like that in an app to upload pictures and talk to a web API would go a long way.
Others
- Possible integration with OneZoom?
- A front-end for the general public: a fun way to accumulate a species list (e.g. QR codes in a zoo or a museum) and get an attractive tree with extra information, as a web app that displays nicely in a mobile environment (e.g. could then be wrapped into a thin iOS and/or Android app).
congruifier
Roll Eastman congruifier code into DateLife, so input topologies, rather than just lists of species, can be dated (Brian O'M)
Phylotastic metadata
Problem statement To ensure that phylotastic trees are useful to scientists, develop vocabularies to annotate phylotastic trees, and formats to embed the annotations with phylotastic trees delivered to users.
Background Phylotastic trees need metadata, for several reasons that include debugging the system and ensuring that results are reproducible. Importantly, in the absence of an external standard for truth, scientists judge the quality of trees by assessing the quality of methods used to produce them. Therefore, if a phylotastic system is to be useful to research scientists, the metadata for trees produced by the system must include sources and methods.
A very simple example At the first hackathon, we often demonstrated phylotastic by sending a query like "Homo sapiens, Mus musculus, Pan paniscus" to the MapReduce pruner and getting back a tree of the form "(( Homo sapiens, Pan paniscus ), Mus musculus )" based on pruning from the Bininda-Emonds tree. How would we want to annotate that result so that users can undertand how the phylogeny was obtained, so that they can judge how much to trust it? What will we need to know about the source tree, the Bininda-Emonds tree? Here is a quick example of a simple report, in loosely structured text:
- date = 9 Jan 2013, 10:27 am
- service = MapReduce pruner at http://phylotastic-wg.nescent.org/script/phylotastic.cgi
- query = { taxa="Homo sapiens, Mus musculus, Pan paniscus"&tree=mammals }
- topology_method = pruning only based on exact match to species binomials
- topology_source = { dc:title="The delayed rise of present-day mammals" dc:creator="Bininda-Emonds, O." dcterms:bibliographiccitation="Nature 2007, 446(7135):507-512" and so on }
- topology_log = (no errors or warnings)
- scaling_method = none
- scaling_source = none
- scaling_log = none
- result = (( Homo sapiens, Pan paniscus ), Mus musculus )
TreeAnnotation
Synopsis Annotate a small set of large trees used as sources.
This is described on a separate TreeAnnotation page.
comment: This potentially links to the social bookmarking / crowd-annotation of trees idea, which will become more critical as the number of available large trees expands.
Hangout 1 Ideas
- Support for common names in TNRS: NCBI provides English, EOL several other languages, wikipedia/ http://wiki.dbpedia.org/
- Tree store implementation (support for DateLife)
- Architecture:
From Brian O'Meara's email:
One thing that came up during today's hangout was architecture. At the last hackathon, there was a group dedicated to architecture (how all the components fit together). Some of their work is at http://www.evoio.org/wiki/Phylotastic/Architecture . Some of us on the call thought that this hackathon might work best if the architecture is mostly specified before we meet, with perhaps further improvements once we arrive. It's hard to code things to work together if you are still designing how that will happen. This wouldn't be set in stone, but would help guide development: how will metadata be passed, for example. Much of this work was apparently done by the group at the last hackathon, but we should discuss 1) whether trying to firm this up before the hackathon is a good idea (based on discussion today, I assume most people think yes, but there are probably counterarguments), and 2) how much of the current specification we should use, and what needs to be added/changed/deleted.
Hangout 3 Summary
- In terms of overall goals for our week together, the people on the hangout were enthused about trying to roll out a full alpha or production version of Phylotastic, specifically a version of the website that uses all the pieces of the workflow in a coherent whole. In other words, people expressed primary interest in refining and connecting the components that we already have, as opposed to creating new components.
- Several people were interested in ironing out details of architecture and specifications for how components exchange information, and it was suggested that we start an email thread on that topic before meeting in Tucson.
- People were divided on the mobile app idea. Everyone likes it in concept, but some folks are enthusiastically behind developing it now, others suggested that development is premature until we have a fully functional basic service in place. We might want to coordinate with others who are already developing similar apps, such as the folks at iNaturalist (http://www.inaturalist.org/)
- We talked at some length about coordination with Opentree, and generally agreed that this would be a great a source of big trees, though likely not the only source. We mentioned Arbor but didn't discuss it extensively.
- Several people recommended devoting substantial person-hours to documentation, perhaps even having a documentation subgroup. The r-phylo wiki from the first NESCent hackathon might be a good model for how to start doing that. http://www.r-phylo.org/wiki/Main_Page
- People are very excited about getting together in Tuscon, which is great!
Boot camps
- git and GitHub
Existing Phylotastic components
- Phylotastic web site
- Phylotastic source code on GitHub
- Phylotastic/Architecture: Architecture and API for Phylotastic
- TNRS: resolving taxonomic names
- Phylotastic/shiny: demos to showcase phylotastic capabilities
- DateLife: getting divergence times for lists of species using existing trees