Cargando…

Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques

Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gandy, Lisa M., Gumm, Jordan, Fertig, Benjamin, Thessen, Anne, Kennish, Michael J., Chavan, Sameer, Marchionni, Luigi, Xia, Xiaoxin, Shankrit, Shambhavi, Fertig, Elana J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5402950/ https://www.ncbi.nlm.nih.gov/pubmed/28437440 http://dx.doi.org/10.1371/journal.pone.0175860

_version_	1783231332792401920
author	Gandy, Lisa M. Gumm, Jordan Fertig, Benjamin Thessen, Anne Kennish, Michael J. Chavan, Sameer Marchionni, Luigi Xia, Xiaoxin Shankrit, Shambhavi Fertig, Elana J.
author_facet	Gandy, Lisa M. Gumm, Jordan Fertig, Benjamin Thessen, Anne Kennish, Michael J. Chavan, Sameer Marchionni, Luigi Xia, Xiaoxin Shankrit, Shambhavi Fertig, Elana J.
author_sort	Gandy, Lisa M.
collection	PubMed
description	Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85–100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.
format	Online Article Text
id	pubmed-5402950
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-54029502017-05-12 Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques Gandy, Lisa M. Gumm, Jordan Fertig, Benjamin Thessen, Anne Kennish, Michael J. Chavan, Sameer Marchionni, Luigi Xia, Xiaoxin Shankrit, Shambhavi Fertig, Elana J. PLoS One Research Article Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85–100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases. Public Library of Science 2017-04-24 /pmc/articles/PMC5402950/ /pubmed/28437440 http://dx.doi.org/10.1371/journal.pone.0175860 Text en © 2017 Gandy et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Gandy, Lisa M. Gumm, Jordan Fertig, Benjamin Thessen, Anne Kennish, Michael J. Chavan, Sameer Marchionni, Luigi Xia, Xiaoxin Shankrit, Shambhavi Fertig, Elana J. Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques
title	Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques
title_full	Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques
title_fullStr	Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques
title_full_unstemmed	Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques
title_short	Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques
title_sort	synthesizer: expediting synthesis studies from context-free data with information retrieval techniques
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5402950/ https://www.ncbi.nlm.nih.gov/pubmed/28437440 http://dx.doi.org/10.1371/journal.pone.0175860
work_keys_str_mv	AT gandylisam synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT gummjordan synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT fertigbenjamin synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT thessenanne synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT kennishmichaelj synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT chavansameer synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT marchionniluigi synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT xiaxiaoxin synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT shankritshambhavi synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques AT fertigelanaj synthesizerexpeditingsynthesisstudiesfromcontextfreedatawithinformationretrievaltechniques

Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques

Ejemplares similares