Cargando…

Identifying the missing proteins in human proteome by biological language model

BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteo...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Qiwen, Wang, Kai, Liu, Xuan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259966/
https://www.ncbi.nlm.nih.gov/pubmed/28155671
http://dx.doi.org/10.1186/s12918-016-0352-6
_version_ 1782499313493999616
author Dong, Qiwen
Wang, Kai
Liu, Xuan
author_facet Dong, Qiwen
Wang, Kai
Liu, Xuan
author_sort Dong, Qiwen
collection PubMed
description BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. RESULTS: Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the “uncertain” category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. CONCLUSION: The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.
format Online
Article
Text
id pubmed-5259966
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-52599662017-01-26 Identifying the missing proteins in human proteome by biological language model Dong, Qiwen Wang, Kai Liu, Xuan BMC Syst Biol Research BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. RESULTS: Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the “uncertain” category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. CONCLUSION: The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods. BioMed Central 2016-12-23 /pmc/articles/PMC5259966/ /pubmed/28155671 http://dx.doi.org/10.1186/s12918-016-0352-6 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Dong, Qiwen
Wang, Kai
Liu, Xuan
Identifying the missing proteins in human proteome by biological language model
title Identifying the missing proteins in human proteome by biological language model
title_full Identifying the missing proteins in human proteome by biological language model
title_fullStr Identifying the missing proteins in human proteome by biological language model
title_full_unstemmed Identifying the missing proteins in human proteome by biological language model
title_short Identifying the missing proteins in human proteome by biological language model
title_sort identifying the missing proteins in human proteome by biological language model
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259966/
https://www.ncbi.nlm.nih.gov/pubmed/28155671
http://dx.doi.org/10.1186/s12918-016-0352-6
work_keys_str_mv AT dongqiwen identifyingthemissingproteinsinhumanproteomebybiologicallanguagemodel
AT wangkai identifyingthemissingproteinsinhumanproteomebybiologicallanguagemodel
AT liuxuan identifyingthemissingproteinsinhumanproteomebybiologicallanguagemodel