Cargando…
Towards interoperable and reproducible QSAR analyses: Exchange of datasets
BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important...
Autores principales: | , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909924/ https://www.ncbi.nlm.nih.gov/pubmed/20591161 http://dx.doi.org/10.1186/1758-2946-2-5 |
_version_ | 1782184328340439040 |
---|---|
author | Spjuth, Ola Willighagen, Egon L Guha, Rajarshi Eklund, Martin Wikberg, Jarl ES |
author_facet | Spjuth, Ola Willighagen, Egon L Guha, Rajarshi Eklund, Martin Wikberg, Jarl ES |
author_sort | Spjuth, Ola |
collection | PubMed |
description | BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community. |
format | Text |
id | pubmed-2909924 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-29099242010-07-27 Towards interoperable and reproducible QSAR analyses: Exchange of datasets Spjuth, Ola Willighagen, Egon L Guha, Rajarshi Eklund, Martin Wikberg, Jarl ES J Cheminform Methodology BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community. BioMed Central 2010-06-30 /pmc/articles/PMC2909924/ /pubmed/20591161 http://dx.doi.org/10.1186/1758-2946-2-5 Text en Copyright ©2010 Spjuth et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Spjuth, Ola Willighagen, Egon L Guha, Rajarshi Eklund, Martin Wikberg, Jarl ES Towards interoperable and reproducible QSAR analyses: Exchange of datasets |
title | Towards interoperable and reproducible QSAR analyses: Exchange of datasets |
title_full | Towards interoperable and reproducible QSAR analyses: Exchange of datasets |
title_fullStr | Towards interoperable and reproducible QSAR analyses: Exchange of datasets |
title_full_unstemmed | Towards interoperable and reproducible QSAR analyses: Exchange of datasets |
title_short | Towards interoperable and reproducible QSAR analyses: Exchange of datasets |
title_sort | towards interoperable and reproducible qsar analyses: exchange of datasets |
topic | Methodology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909924/ https://www.ncbi.nlm.nih.gov/pubmed/20591161 http://dx.doi.org/10.1186/1758-2946-2-5 |
work_keys_str_mv | AT spjuthola towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets AT willighagenegonl towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets AT guharajarshi towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets AT eklundmartin towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets AT wikbergjarles towardsinteroperableandreproducibleqsaranalysesexchangeofdatasets |