Reuse Cases: Difference between revisions
m (moved ReUseCases to Reuse Cases) |
(No difference)
|
Revision as of 13:55, 1 August 2011
This page is for developing a list of use-cases for EvoIO- and MIAPA-relevant project planning. It is *largely incomplete*. We welcome anyone who wants to fill in some more cases, e.g., applications of supertrees in ecosystem analysis.
Scope
This page is for developing a list of use-cases for EvoIO- and MIAPA-relevant project planning. The use-cases should focus on re-use, which might mean replication, aggregation, re-purposing, meta-analysis, integration (see below for a view of what these terms mean).
We want to enable and facilitate data re-use of phylogenetic data and metadata, which isn't happening often enough. Because it isn't happening enough, it might be useful for us to consider hypothetical cases of re-use. However, even for hypothetical cases, its very important to make every effort to document user needs, e.g., as expressed in published papers. For instance, before TimeTree existed, a resource to aggregate phylogenetic dates was a hypothetical re-use case, but the user need for placing dates on nodes of trees was not hypothetical and could be documented easily.
For comparison, some other use-case lists are available:
- A list of phylo-relevant use-cases was developed for the 2006 NESCent hackathon
- A different list oriented toward interoperability was developed for the 2009 hackathon
Some of the above use-cases (or variants of them) might be relevant here.
What constitutes re-use of data?
The primary consumer of a scientific product typically is the primary producer, e.g., an ecologist collects field observations and then uses these new observations to evaluate hypotheses or clarify patterns. "Re-use" refers to the case when there is a secondary consumer.
By "data", we mean information. That is, for the present purposes, data are coded information, as distinct from the material products of research such as specimens and samples. We do not mean data in the more restricted sense of fact, observation, although this type of empirical data may be the most likely to be re-used. Sharing data (unlike sharing materials) is an informatics problem.
The general category of data re-use may cover a large number of diverse cases described with terms such as replication, aggregation, re-purposing, meta-analysis, and integration. These do not seem to be distinct non-overlapping categories, but dimensions or qualities that may interpenetrate. For instance, Yampolsky & Stoltzfus (2005) combined data from 15 studies, comprising nearly 10,000 engineered amino acid exchanges, to generate the "EX" matrix of values representing the mean exchangeability from one amino acid to another. The authors clearly were secondary consumers: each underlying study was performed by the primary producer-consumers in order to map out regions of a protein most susceptible to amino acid changes. The study was described as a meta-analysis (in the sense of combining separate studies to address an issue beyond the scope or power of any individual study), and it clearly involves re-purposing (using results for a different goal), and aggregation (in the sense of combining results from multiple studies).
To reiterate, these terms (aggregation, re-purposing, etc) do not seem to represent non-overlapping categories, but qualities or aspects that may apply in combinations. Here is one person's (AS) interpretation of the terms (for a different view, see Fig. 1 of Sidlauskas, et al., 2010):
- study replication means verifying results or conclusions of a published study by repeating it. Although the potential for study replication is integral to the self-policing nature of science, it happens only on the rare occasions when the published results of a study are perceived to be fraudulent or artefactual (e.g,. in recent memory, the "memory water" and "directed mutations" cases).
- aggregation means gathering large numbers of results of a precisely defined type. Often the aggregator adds value in the process. The Sepkoski marine fossil data set is an example. TimeTree is an example that has more of a focus on making it easy for the user.
- meta-analysis means combining several separate analyses to address issues beyond the scope or power of a single analysis. This sometimes means a meta-statistical analysis (statistical meta-analysis), in which conclusions are based on combining, not the raw data from each study, but summary statistics from each study (e.g., means and variances) in a way that is sensitive to study design. Supertree methods (for assembling composite trees from separate overlapping trees) are analogous, in that they combine trees rather than the underlying character data. Sidlauskas, et al. use "meta-analysis" to refer to two studies that "synthesized the results of hundreds of previous studies" to show conclusively that climate change causes shifts in species distributions (something that individual studies could not establish conclusively).
- re-purposing means using the results of a study for a purpose other than that of the primary consumer.
- integration seems similar to "synthesis" but may have more of an implication of bringing together things that obviously belong together but have been kept separate for arbitrary reasons, e.g., combining data from different domains or different types of studies. This kind of integration depends on integrating variables or keys by which data from separate studies are combined. The integrating variable might be an accession number, a species name, a geographic location, etc.
- synthesis seems similar to "integration" but may have more of an implication of conceptual novelty and creativity, i.e., combining results in ways that were not imagined.
Note that aggregation and meta-analysis combine data from multiple studies of the same type. Synthesis and integration necessarily combine data from studies of different types. Study replication, by definition, deals with a single study.
About use-cases
A "use case" is a description, from the perspective of the user (not the developer), of a set of transactions intended to satisfy a particular category of user needs. Here is a formula for a use-case:
- Name and description - brief overview
- Motivation - why do researchers want to do this?
- Ideal procedure
- Preconditions - what does the user need to start with?
- Steps -what are the steps in a typical case?
- Outcomes - what outcomes satisfy user needs?
- Key challenges - what makes it difficult to do this today?
- References - who does this, or wants to do it?
List of use-cases
Supertree research
Name and description
A supertree is defined as an estimate of phylogeny assembled from smaller phylogenies. These partial phylogenies (or source trees) must have some taxa in common, but not necessarily all. Modern supertrees can contain hundreds or thousands of taxa and are constructed from hundreds of source phylogenies requiring the collection of large amounts of data. These phylogenies are of great use in, for example, comparative biology, and macroevolutionary studies (quoting in verbatim Davis & Hill, 2010 [1]).
Motivation
To test hypotheses that require phylogenies that are of such great scale and breadth, that creation of a similarly-sized (taxa-wise) phylogeny by conventional methods would be far too difficult for a variety of reasons.
Typical procedure
- Preconditions
- 'Source trees' : previously published hypotheses of evolutionary relationships for a group (taxa-wise) of choice. Generally topology-only data required but can vary depending on which exact method used.
- Steps (after Davis & Hill, 2010)
- 1. Data collection and entry
- 2. Standardisation of terminal taxa
- 3. Ensure source tree independence: Remove redundancy within the [meta]dataset that would otherwise unfairly up-weight data
- 4. Check adequate taxonomic overlap of source trees
- 5. Matrix creation: Create a matrix suitable for analysis
- Outcomes
- ???? A supertree estimate of phylogeny.
Key challenges
- The lack of digitally-available tree topology-data in a recognised/standardised format (e.g. Newick) for most phylogenetic studies that have ever been published.
- Most tree-topologies are generally published (in their original papers) graphically which isn't too helpful for re-use and re-purposing.
- This problem is so widespread that ingenious methods (e.g. TreeSnatcher [2]) have been developed specifically to re-extract topology-data from published papers.
- No standardisation of taxon names between studies, hence Step 2 (above).
- Taxon sampling. Step 3 (above) is necessary because some taxa/groups are extremely 'popular' in phylogenetic studies, whilst others are only vary rarely included.
References
Beck, R., Emonds, O. B., Cardillo, M., Liu, F. G., and Purvis, A. 2006. A higher-level MRP supertree of placental mammals. BMC Evolutionary Biology 6:93+. [3]
Bininda-Emonds, O. R. P., Gittleman, J. L., and Purvis, A. 1999. Building large trees by combining phylogenetic information: a complete phylogeny of the extant Carnivora (Mammalia). Biological Reviews 74:143–175. [4]
Cotton, J. and Wilkinson, M. 2009a. Supertrees join the mainstream of phylogenetics. Trends in Ecology & Evolution 24:1–3. [5]
Davies, T. J., Barraclough, T. G., Chase, M. W., Soltis, P. S., Soltis, D. E., and Savolainen, V. 2004. Darwin's abominable mystery: Insights from a supertree of the angiosperms. Proceedings of the National Academy of Sciences of the United States of America 101:1904-1909. [6]
Davis, R. B., Baldauf, S. L., and Mayhew, P. J. 2010. Many hexapod groups originated earlier and withstood extinction events better than previously realized: inferences from supertrees. Proceedings of the Royal Society B: Biological Sciences 277:1597–1606. [7]
Lloyd, G. T., Davis, K. E., Pisani, D., Tarver, J. E., Ruta, M., Sakamoto, M., Hone, D. W. E., Jennings, R., and Benton, M. J. 2008. Dinosaurs and the Cretaceous Terrestrial Revolution. Proceedings of the Royal Society B: Biological Sciences 275:2483-2490. [8]
Ruta, M., Pisani, D., Lloyd, G. T., and Benton, M. J. 2007. A supertree of Temnospondyli: cladogenetic patterns in the most species-rich group of early tetrapods. Proceedings of the Royal Society B: Biological Sciences 274:3087-3095. [9]
Sanderson, M. J., Purvis, A., and Henze, C. 1998. Phylogenetic supertrees: assembling the trees of life. Trends in Ecology and Evolution 13:105-109. [10]
Thomas, G., Wills, M., and Szekely, T. 2004. A supertree approach to shorebird phylogeny. BMC Evolutionary Biology 4:28+. [11]
Other meta-analyses (multiple phylodata re-use cases)
Name and description
A 'catch-all' basket group for other non-supertree meta-analyses utilizing many phylogenetic datasets.
Motivation
To test hypotheses over many independent datasets
Typical procedure
- Preconditions
- Published cladistic data matrices in useable electronic formats (e.g. nexus)
- Steps (varies depending on exact case)
- Outcomes - what outcomes satisfy user needs?
- ???? Interesting results
Key challenges
The poverty of morphological cladistic datasets digitally available in standardised formats (e.g. nexus). Particularly for non-botanical, non-mycological taxonomic groups. Relevent key challenges, barriers and the scale of the problem have been outlined in a talk recently (Mounce, 2010 @ The 12th Young Systematists' Forum [15]). It is even hard just to find relevant datasets - speaking from first-hand experience, if you attempt a literature search for cladistic data you'll get lots of false positives (e.g. Titles and Abstracts that refer to "Systematics of..." yet contain no primary phylogenetic analysis), AND false negatives (if you 'just' search for "morphological systematics" and "cladist*" you won't find all that's out there)!
References
Andrea Cobbett, Mark Wilkinson, and Matthew A Wills (2007) Fossils Impact as Hard as Living Taxa in Parsimony Analyses of Morphology Syst Biol (2007) 56(5): 753-766 doi:10.1080/10635150701627296
Prevosti, Francisco J., Chemisquy, María A. (2010) The impact of missing data on real morphological phylogenies: influence of the number and distribution of missing entries. Cladistics 26(3):326-339. doi:10.1111/j.1096-0031.2009.00289.x
Song, Hojun, Bucheli, Sibyl R. (2010) Comparison of phylogenetic signal between male genitalia and non-genital characters in insect systematics. Cladistics 26(1):23-35 doi:10.1111/j.1096-0031.2009.00273.x
Mounce, R. C. P. and M. A. Wills (2010). Congruence between cranial and postcranial characters in vertebrate systematics. Proceedings of the 29th Annual Meeting of the Willi Hennig Society. http://www.citeulike.org/user/rossmounce/article/7853861
Liow, Lee Hsiang (2007) LINEAGES WITH LONG DURATIONS ARE OLD AND MORPHOLOGICALLY AVERAGE: AN ANALYSIS USING MULTIPLE DATASETS. Evolution 61(4):885-901 doi:10.1111/j.1558-5646.2007.00077.x interestingly, this last study uses phylogenetic data (cladistic data matrices of morphological characters) to create a 'morphospace' representation. No actual phylogenetic analysis per se, but ample re-use of phylogenetic datasets...
Another use-case
Name and description
brief overview
Motivation
why do researchers want to do this?
Typical procedure
- Preconditions - what does the user need to start with?
- Steps - what are the steps in a typical case?
- Outcomes - what outcomes satisfy user needs?
Key challenges
what makes it difficult to do this today?
References
who does this, or wants to do it?