Cargando…

Towards interoperable and reproducible QSAR analyses: Exchange of datasets

BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important...

Descripción completa

Detalles Bibliográficos
Autores principales: Spjuth, Ola, Willighagen, Egon L, Guha, Rajarshi, Eklund, Martin, Wikberg, Jarl ES
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909924/
https://www.ncbi.nlm.nih.gov/pubmed/20591161
http://dx.doi.org/10.1186/1758-2946-2-5
_version_ 1782184328340439040
author Spjuth, Ola
Willighagen, Egon L
Guha, Rajarshi
Eklund, Martin
Wikberg, Jarl ES
author_facet Spjuth, Ola
Willighagen, Egon L
Guha, Rajarshi
Eklund, Martin
Wikberg, Jarl ES
author_sort Spjuth, Ola
collection PubMed
description BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
format Text
id pubmed-2909924
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-29099242010-07-27 Towards interoperable and reproducible QSAR analyses: Exchange of datasets Spjuth, Ola Willighagen, Egon L Guha, Rajarshi Eklund, Martin Wikberg, Jarl ES J Cheminform Methodology BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community. BioMed Central 2010-06-30 /pmc/articles/PMC2909924/ /pubmed/20591161 http://dx.doi.org/10.1186/1758-2946-2-5 Text en Copyright ©2010 Spjuth et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology
Spjuth, Ola
Willighagen, Egon L
Guha, Rajarshi
Eklund, Martin
Wikberg, Jarl ES
Towards interoperable and reproducible QSAR analyses: Exchange of datasets
title Towards interoperable and reproducible QSAR analyses: Exchange of datasets
title_full Towards interoperable and reproducible QSAR analyses: Exchange of datasets
title_fullStr Towards interoperable and reproducible QSAR analyses: Exchange of datasets
title_full_unstemmed Towards interoperable and reproducible QSAR analyses: Exchange of datasets
title_short Towards interoperable and reproducible QSAR analyses: Exchange of datasets
title_sort towards interoperable and reproducible qsar analyses: exchange of datasets
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909924/
https://www.ncbi.nlm.nih.gov/pubmed/20591161
http://dx.doi.org/10.1186/1758-2946-2-5
work_keys_str_mv AT spjuthola towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets
AT willighagenegonl towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets
AT guharajarshi towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets
AT eklundmartin towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets
AT wikbergjarles towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets