Cargando…

Natural language processing in text mining for structural modeling of protein complexes

BACKGROUND: Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the bi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Badal, Varsha D., Kundrotas, Petras J., Vakser, Ilya A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838950/ https://www.ncbi.nlm.nih.gov/pubmed/29506465 http://dx.doi.org/10.1186/s12859-018-2079-4

_version_	1783304338251186176
author	Badal, Varsha D. Kundrotas, Petras J. Vakser, Ilya A.
author_facet	Badal, Varsha D. Kundrotas, Petras J. Vakser, Ilya A.
author_sort	Badal, Varsha D.
collection	PubMed
description	BACKGROUND: Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking. RESULTS: We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP. CONCLUSIONS: The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2079-4) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5838950
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-58389502018-03-09 Natural language processing in text mining for structural modeling of protein complexes Badal, Varsha D. Kundrotas, Petras J. Vakser, Ilya A. BMC Bioinformatics Methodology Article BACKGROUND: Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking. RESULTS: We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP. CONCLUSIONS: The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2079-4) contains supplementary material, which is available to authorized users. BioMed Central 2018-03-05 /pmc/articles/PMC5838950/ /pubmed/29506465 http://dx.doi.org/10.1186/s12859-018-2079-4 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Badal, Varsha D. Kundrotas, Petras J. Vakser, Ilya A. Natural language processing in text mining for structural modeling of protein complexes
title	Natural language processing in text mining for structural modeling of protein complexes
title_full	Natural language processing in text mining for structural modeling of protein complexes
title_fullStr	Natural language processing in text mining for structural modeling of protein complexes
title_full_unstemmed	Natural language processing in text mining for structural modeling of protein complexes
title_short	Natural language processing in text mining for structural modeling of protein complexes
title_sort	natural language processing in text mining for structural modeling of protein complexes
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838950/ https://www.ncbi.nlm.nih.gov/pubmed/29506465 http://dx.doi.org/10.1186/s12859-018-2079-4
work_keys_str_mv	AT badalvarshad naturallanguageprocessingintextminingforstructuralmodelingofproteincomplexes AT kundrotaspetrasj naturallanguageprocessingintextminingforstructuralmodelingofproteincomplexes AT vakserilyaa naturallanguageprocessingintextminingforstructuralmodelingofproteincomplexes

Natural language processing in text mining for structural modeling of protein complexes

Ejemplares similares