Cargando…
Observation selection bias in contact prediction and its implications for structural bioinformatics
Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for u...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5114557/ https://www.ncbi.nlm.nih.gov/pubmed/27857150 http://dx.doi.org/10.1038/srep36679 |
_version_ | 1782468360000241664 |
---|---|
author | Orlando, G. Raimondi, D. Vranken, W. F. |
author_facet | Orlando, G. Raimondi, D. Vranken, W. F. |
author_sort | Orlando, G. |
collection | PubMed |
description | Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON. |
format | Online Article Text |
id | pubmed-5114557 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-51145572016-11-25 Observation selection bias in contact prediction and its implications for structural bioinformatics Orlando, G. Raimondi, D. Vranken, W. F. Sci Rep Article Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON. Nature Publishing Group 2016-11-18 /pmc/articles/PMC5114557/ /pubmed/27857150 http://dx.doi.org/10.1038/srep36679 Text en Copyright © 2016, The Author(s) http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
spellingShingle | Article Orlando, G. Raimondi, D. Vranken, W. F. Observation selection bias in contact prediction and its implications for structural bioinformatics |
title | Observation selection bias in contact prediction and its implications for structural bioinformatics |
title_full | Observation selection bias in contact prediction and its implications for structural bioinformatics |
title_fullStr | Observation selection bias in contact prediction and its implications for structural bioinformatics |
title_full_unstemmed | Observation selection bias in contact prediction and its implications for structural bioinformatics |
title_short | Observation selection bias in contact prediction and its implications for structural bioinformatics |
title_sort | observation selection bias in contact prediction and its implications for structural bioinformatics |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5114557/ https://www.ncbi.nlm.nih.gov/pubmed/27857150 http://dx.doi.org/10.1038/srep36679 |
work_keys_str_mv | AT orlandog observationselectionbiasincontactpredictionanditsimplicationsforstructuralbioinformatics AT raimondid observationselectionbiasincontactpredictionanditsimplicationsforstructuralbioinformatics AT vrankenwf observationselectionbiasincontactpredictionanditsimplicationsforstructuralbioinformatics |