Cargando…
Identifying the missing proteins in human proteome by biological language model
BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteo...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259966/ https://www.ncbi.nlm.nih.gov/pubmed/28155671 http://dx.doi.org/10.1186/s12918-016-0352-6 |
_version_ | 1782499313493999616 |
---|---|
author | Dong, Qiwen Wang, Kai Liu, Xuan |
author_facet | Dong, Qiwen Wang, Kai Liu, Xuan |
author_sort | Dong, Qiwen |
collection | PubMed |
description | BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. RESULTS: Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the “uncertain” category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. CONCLUSION: The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods. |
format | Online Article Text |
id | pubmed-5259966 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-52599662017-01-26 Identifying the missing proteins in human proteome by biological language model Dong, Qiwen Wang, Kai Liu, Xuan BMC Syst Biol Research BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. RESULTS: Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the “uncertain” category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. CONCLUSION: The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods. BioMed Central 2016-12-23 /pmc/articles/PMC5259966/ /pubmed/28155671 http://dx.doi.org/10.1186/s12918-016-0352-6 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Dong, Qiwen Wang, Kai Liu, Xuan Identifying the missing proteins in human proteome by biological language model |
title | Identifying the missing proteins in human proteome by biological language model |
title_full | Identifying the missing proteins in human proteome by biological language model |
title_fullStr | Identifying the missing proteins in human proteome by biological language model |
title_full_unstemmed | Identifying the missing proteins in human proteome by biological language model |
title_short | Identifying the missing proteins in human proteome by biological language model |
title_sort | identifying the missing proteins in human proteome by biological language model |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5259966/ https://www.ncbi.nlm.nih.gov/pubmed/28155671 http://dx.doi.org/10.1186/s12918-016-0352-6 |
work_keys_str_mv | AT dongqiwen identifyingthemissingproteinsinhumanproteomebybiologicallanguagemodel AT wangkai identifyingthemissingproteinsinhumanproteomebybiologicallanguagemodel AT liuxuan identifyingthemissingproteinsinhumanproteomebybiologicallanguagemodel |