Cargando…

Observation selection bias in contact prediction and its implications for structural bioinformatics

Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for u...

Descripción completa

Detalles Bibliográficos
Autores principales: Orlando, G., Raimondi, D., Vranken, W. F.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5114557/
https://www.ncbi.nlm.nih.gov/pubmed/27857150
http://dx.doi.org/10.1038/srep36679
_version_ 1782468360000241664
author Orlando, G.
Raimondi, D.
Vranken, W. F.
author_facet Orlando, G.
Raimondi, D.
Vranken, W. F.
author_sort Orlando, G.
collection PubMed
description Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.
format Online
Article
Text
id pubmed-5114557
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-51145572016-11-25 Observation selection bias in contact prediction and its implications for structural bioinformatics Orlando, G. Raimondi, D. Vranken, W. F. Sci Rep Article Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON. Nature Publishing Group 2016-11-18 /pmc/articles/PMC5114557/ /pubmed/27857150 http://dx.doi.org/10.1038/srep36679 Text en Copyright © 2016, The Author(s) http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Orlando, G.
Raimondi, D.
Vranken, W. F.
Observation selection bias in contact prediction and its implications for structural bioinformatics
title Observation selection bias in contact prediction and its implications for structural bioinformatics
title_full Observation selection bias in contact prediction and its implications for structural bioinformatics
title_fullStr Observation selection bias in contact prediction and its implications for structural bioinformatics
title_full_unstemmed Observation selection bias in contact prediction and its implications for structural bioinformatics
title_short Observation selection bias in contact prediction and its implications for structural bioinformatics
title_sort observation selection bias in contact prediction and its implications for structural bioinformatics
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5114557/
https://www.ncbi.nlm.nih.gov/pubmed/27857150
http://dx.doi.org/10.1038/srep36679
work_keys_str_mv AT orlandog observationselectionbiasincontactpredictionanditsimplicationsforstructuralbioinformatics
AT raimondid observationselectionbiasincontactpredictionanditsimplicationsforstructuralbioinformatics
AT vrankenwf observationselectionbiasincontactpredictionanditsimplicationsforstructuralbioinformatics