Phylotastic/Pre-Hackathon Design
Phylotastic design, pre-hackathon
goal statement
Statement of goals. 1. Build phylotastic, a collection of interoperable web services that collectively provide the means to extract a subtree (specified by tips) from any of several large species tree, and to supply branch lengths and provenance annotation. 2. For demonstration purposes, leverage these services within a graphical interface that also integrates the resulting species tree with the user's choice of several high-value types of data. Optionally, this may involve adapting an existing environment (e.g., Galaxy, Taverna) to manage a phylotastic workflow.
inputs and outputs for a simple case
inputs = {
- the user's list of species { S }; # the main input under the control of user
- optionally, the user's character data, one row for each species in { S } ;
- repository of megatrees that we have built for the project ;
- any information on { S } conveniently available online via web services (e.g., NCBI, gbif)
outputs = {
- phylogeny (with branch lengths) including only species in { S }; # main output
- optionally, user's comparative data with tree (NEXUS or NeXML), ready for phylogenetic character analysis;
- optionally, a mash-up with other information on { S } from online resources
}
where this output is presented graphically in some viewer that is relatively adaptable, e.g., Mesquite.
a bit more about the issue of integration and mashups
The main work of this project is to develop the "engine", the stuff that is "under the hood". But if this is going to benefit users all over the world, we need to show what the engine can do. For this reason, a substantial fraction of the energy will be devoted to creating integration tools that combine the engine of phylotastic, with species information that is easily gathered via existing services, such as:
- images of an individual of the species, collected from EoL or wikipedia; or silhouettes from phylopic
- geographic distribution of the species, from GBIF
- the location of the nearest museum specimen of the species
- whether a genome is available for this species, from NCBI
- the number of protein sequences known for this species, from NCBI
- the rDNA or cytochrome C sequence for this species, if available from NCBI
- the average<link rel="shortcut icon" href="/favicon.ico" />
<link rel="search" type="application/opensearchdescription+xml" href="/wg/evoioaps, is just a web form with a place to submit and validate a species list, and a set of check-boxes for which types of information to collect for those species. The user enters the species list, clicks on the desire information, and then clicks "Go", and the software goes and gets the information and the phylogeny, and presents it to the user for visualization (e.g., in Mesquite or some other viewer that can be adapted). For an example species mashup, see Rod Page's ispecies, which creates an on-the-fly web page for a species based on info from NCBI, google scholar, etc
thinking about phylotastic in an MVC design pattern
background This is an application of Model-View-Controller or MVC design pattern (http://en.wikipedia.org/wiki/Model–view–controller, or see the discussion here: http://msdn.microsoft.com/en-us/library/ff649643.aspx). In the design sketched below, the model (the M in MVC) is precisely the USER's tree. This may sound odd at first, if you've been thinking of "phylotastic" as a centralized resource with back-end megatrees at its heart. The design below gives us considerable freedom (to imagine different kinds of phylotastic implementations) by abstracting the operations away from the model. It frees us from thinking of a conventional workflow, because many operations can be done asynchronously (e.g., we can decorate OTUs with images before or after getting the topology). Because of this potential for multiple asynchronous operations, it may be helpful to add an "Observer" element to the MVC design.
model The "model" is the user's tree along with its metadata. Of course, the user typically doesn't begin with a tree, but with a kind of pre-tree. In mathematics its ok for a "graph" to be a set of unlinked nodes. We'll borrow that way of thinking and imagine that the initial state of the tree is (typically) a list of OTUs that will become the terminal nodes. The final state of the tree typically is a fully connected tree with a topology and branch lengths. The final tree may be missing some nodes that could not be found. Also, there may be annotations of individual nodes, and annotations (metadata) for the tree-as-a-whole (e.g., this tree was assembled on a particular date by a particular service).
operations If that is the model, then here is how we would conceptualize the KINDS of operations that update the model:
- a "TNRS" updates the model by replacing input OTU names, or annotating input names, with qualified OTU names.
- a "topology service" updates the model by linking some or all of the OTUs into a connected graph
- a "scaling service" updates the model by estimating the lengths of branches connecting nodes
- a decorating or annotating service updates the model by adding annotations to nodes or branches, such as
- collecting images of OTUs
- gathering fossil-based dates for internal nodes
- assessing quality or reliability of a node
- and so on
In addition
- every service updates the model by adding provenance information (e.g., describing how it has modified the model)
controllers and views The typical view of the model is going to be a phylogeny or an OTU-based table. A controller invokes services to modify the model (the user's tree) in response to user commands. Frequently we have discussed phylotastic in terms of automated controllers, such as workflow engines that manage the inputs and outputs of a series of operations. But we also could think of an interactive controller.
Architecture
Error creating thumbnail: Unable to save thumbnail to destination