MIAPA/Demonstration Project
Demonstration Project, Spring 2011
Quick overview
Build a tool that will allow users to create a NeXML file with minimal information to document a phylogenetic analysis.
- start by populating CDAO with a rich set of terms from various sources
- work out NeXML representation of methods concepts using CDAO terms (also OBI?)
- develop a web form that allows users to create annotations, output NeXML
- use natural-language workflow descriptions from papers to guide development and testing
A big-picture strategy that includes this project
under construction
Here is my vision (AS) for the long-term project to develop MIAPA while building support for data re-use. There are two stages focused on support for archiving. The work on re-use depends on the archiving work, to some extent.
- Archiving, first stage - Demonstration. We build submission and query tools to show what is possible. The resulting tools may not be very useful to users, but they provide a platform for further work.
- proof-of-concept based on phenote (Arlin, 2008)
- a web-based tool demonstration tool to create an annotated record (Maryam, spring of 2011)
- loads vocabulary terms for sources and methods
- provides term-completion based on the loaded vocabularies
- provides slots for specific types of MIAPA annotations
- provides support for term requests
- outputs in NeXML? (ok for Dryad, but not currently supported for input by TreeBASE)
- a demonstration architecture for annotation and submission (GSOC projects, summer of 2011?)
- web services protocol for submitting a phylogenetic record to an archive
- annotation tool with client capacity to submit record
- modify TreeBASE or Dryad to supply server capacity to receive record
- Archiving, second stage - Build up user base. By responding to user needs and adding intelligence, we create a submission tool that is useful both to archives and to users. Meanwhile, we are using this same tool to harvest information on user needs.
- respond to user needs
- use MIAPA survey to identify key use-cases and annotation needs (MIAPA survey team, spring 2011)
- work with users to build annotation support for key use cases
- lay the foundation for applying intelligent methods (Enrico and Arlin, CREST proposal, 2011)
- build out a formal ontology for methods annotation
- include a high-level concept of workflow
- harvest annotations from submitted records
- apply NLP methods to harvest methods annotations from publications
- build out a formal ontology for methods annotation
- incorporate intelligence into submission tool
- extract candidate annotations from Methods text
- use planning concepts to detect errors and gaps, suggest corrections
- respond to user needs
- Technology to support re-use (Enrico and Arlin, CREST proposal, 2011). The aim of this stage is to develop a system that can compile vague workflow descriptions into executable plans, allowing the user to apply the plan to a custom set of data.
Resources
- phylogenetic analysis software & methods
- Joe's extensive online list)
- entries from Brian O'Meara's treetapper project
- the table of contents from Felsenstein's _Inferring Phylogenies_
- file formats (see extensive list in BioPerl docs)
- alignment software (list in BioPerl Run modules, amazing wikipedia list of alignment software )
- services from the mygrid project services ontology (here is the owl file in RDF-XML and here is an image of part of the class hierarchy: Error creating thumbnail: Unable to save thumbnail to destination)
Notes from meetings
May 20, 2011
present:Jim, Maryam, Enrico, Rutger, Arlin
discussed case, and how to handle other cases.
had to use skype, experienced major problems with this.
April 15, 2011
present: Jim, Maryam, Enrico, Brandon, Arlin
March 18, 2011
present: Arlin, Maryam, Jim, Eric, Rutger
1. review of demo project
- inference methods
- term list from TreeBASE
- Joe Felsenstein's "Inferring Phylogenies"
- get e copy from Joe? Arlin will do this
- search for papers. Arlin will do this
- review project plan -- see 4 March notes
2. iEvoBio presentations
- possible lightning talk on Maryam's demo project
- full talk deadline is next week
- one talk on MIAPA & related projects (Jim leads)
- another talk on publishing trees practices (Arlin leads)
March 4, 2011
present: Arlin, Jim, Maryam, Enrico, Vivek
Agenda: sort out project ideas for spring (Maryam) and summer (GSOC)
- Maryam's project
- start by populating CDAO with terms
- work out NeXML representation of methods concepts using CDAO terms (also OBI?)
- work on submission form to make NeXML file
- use papers from prior literature to harvest natural-language workflow descriptions
- possible GSOC projects
- graphical UI for constructing workflow descriptions
- see http://exon.niaid.nih.gov/mobyleWorkflow/
- Vivek is willing to mentor
- successful applicant knows Java, ideally GWT (google web toolkit) and Jena
- phylogeny experience not necessary
- use library of papers from previous project
- feedback via informal user testing, comment box
- open issue: integrate with existing codebase (Mesquite? TreeBASE?)
- implement NeXML submission in TreeBASE
- Rutger agreed to be co-mentor
- develop web services protocol for phylo record submission
- maybe preconditions will not be met by this summer
- NLP analysis of methods sections of papers
- ratio of analysis to programming is too high for a GSOC
- graphical UI for constructing workflow descriptions
Action items:
- Vivek, Enrico & Arlin write GSOC description by Mar 11
- Arlin try to find co-mentor NeXML submission project by Mar 11
- put library of papers issue on agenda for next meeting
- Jim to let Eric know what's happening
February 25, 2011
present: Arlin, Jim, Maryam (10:15)
- deliverables
- search interface for TreeBASE
- submission interface (annotation) for TreeBASE
sub-searches based methods annotation
- hierarchy, term-completion
- distribution of trees by method
- download linked pubs and collect matching terms to test completion?
- developer access to treebase code
standalone search tool (GSOC)?
- web services API
Submission tool
- making it easy for user
- recognize source data
- paste methods section, match terms, supply to user
- start with templates from existing treebase entries
- following methods from a previous publication
Standalone tool (GSOC? )
- create nexml file
- TB nexml upload (basic)
- TB nexml process methods annotations into text statement
February 18, 2011
present: Arlin, Jim L-M, Maryam (10:20?)
Context:
- Maryam available until mid-May
- project outcome could support ABI proposal in July
- could coordinate with possible GSOC proposal
discussion about ontology development. 2 mistaken presumptions
- encoding domain knowledge of experts is enough (wrong: experts literally don't know what they are talking about when it comes to key philosophical distinctions)
- proper ontology has only context-independent universals (wrong in practice; just creates an elaborate system of pseudo-universals)
driving biological problem or use-case
- pre-condition: all those trees out there
- 1. estimate species tree by combining gene trees (systematics use case)
- 2. identify orthologs or duplication histories using gene tree (mol biology use case)
so, let's imagine a user scenario
- pre-requisite: list of 8 species, user wants species tree with these, possibly some others
- user searches resources with list, gets hits
- subcase1: finds a species tree with all 8 species
- user may wish to prune if there are too many other species
- user is done
- subcase2: finds gene trees with all 8 species
- user may wish to select "best" tree
- user needs to run reconciliation software
- subcase3: finds a set of trees with overlapping sub-sets of species (e.g., ABCDE, CDEFG, EFGHI)
- this case calls for supertree construction
- subcase1: finds a species tree with all 8 species
but (Jim says), we don't want to get bogged down in reconciliation
but (Arlin says), it may be sufficient (for demonstration purposes) to offer the user
- the right input trees for reconciliation
- a canned workflow description for reconciliation
- a third-party service that will execute the workflow description on the input trees
ok, we decide to pursue a simpler scenario
- pre-requisite: list of 1 gene, E. coli CAP
- user searches for tree with target gene, gets hits
- user chooses by criteria (method, bootstraps, etc. . . )
- user is done
the above could be done on a resource that aggregates from other resources (TreeBASE, Pandit, TreeFams, etc). However, an even simpler use-case would be just to provide an interface to whatever useful information is in TreeBASE.
That's where we ran out of time. Next meeting: Friday, Feb 25, 10:00 am EST.