Cargando…

Representation learning applications in biological sequence analysis

Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to...

Descripción completa

Detalles Bibliográficos
Autores principales: Iuchi, Hitoshi, Matsutani, Taro, Yamada, Keisuke, Iwano, Natsuki, Sumi, Shunsuke, Hosoda, Shion, Zhao, Shitao, Fukunaga, Tsukasa, Hamada, Michiaki
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8190442/
https://www.ncbi.nlm.nih.gov/pubmed/34141139
http://dx.doi.org/10.1016/j.csbj.2021.05.039
_version_ 1783705685307949056
author Iuchi, Hitoshi
Matsutani, Taro
Yamada, Keisuke
Iwano, Natsuki
Sumi, Shunsuke
Hosoda, Shion
Zhao, Shitao
Fukunaga, Tsukasa
Hamada, Michiaki
author_facet Iuchi, Hitoshi
Matsutani, Taro
Yamada, Keisuke
Iwano, Natsuki
Sumi, Shunsuke
Hosoda, Shion
Zhao, Shitao
Fukunaga, Tsukasa
Hamada, Michiaki
author_sort Iuchi, Hitoshi
collection PubMed
description Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
format Online
Article
Text
id pubmed-8190442
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-81904422021-06-16 Representation learning applications in biological sequence analysis Iuchi, Hitoshi Matsutani, Taro Yamada, Keisuke Iwano, Natsuki Sumi, Shunsuke Hosoda, Shion Zhao, Shitao Fukunaga, Tsukasa Hamada, Michiaki Comput Struct Biotechnol J Review Article Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis. Research Network of Computational and Structural Biotechnology 2021-05-23 /pmc/articles/PMC8190442/ /pubmed/34141139 http://dx.doi.org/10.1016/j.csbj.2021.05.039 Text en © 2021 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Review Article
Iuchi, Hitoshi
Matsutani, Taro
Yamada, Keisuke
Iwano, Natsuki
Sumi, Shunsuke
Hosoda, Shion
Zhao, Shitao
Fukunaga, Tsukasa
Hamada, Michiaki
Representation learning applications in biological sequence analysis
title Representation learning applications in biological sequence analysis
title_full Representation learning applications in biological sequence analysis
title_fullStr Representation learning applications in biological sequence analysis
title_full_unstemmed Representation learning applications in biological sequence analysis
title_short Representation learning applications in biological sequence analysis
title_sort representation learning applications in biological sequence analysis
topic Review Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8190442/
https://www.ncbi.nlm.nih.gov/pubmed/34141139
http://dx.doi.org/10.1016/j.csbj.2021.05.039
work_keys_str_mv AT iuchihitoshi representationlearningapplicationsinbiologicalsequenceanalysis
AT matsutanitaro representationlearningapplicationsinbiologicalsequenceanalysis
AT yamadakeisuke representationlearningapplicationsinbiologicalsequenceanalysis
AT iwanonatsuki representationlearningapplicationsinbiologicalsequenceanalysis
AT sumishunsuke representationlearningapplicationsinbiologicalsequenceanalysis
AT hosodashion representationlearningapplicationsinbiologicalsequenceanalysis
AT zhaoshitao representationlearningapplicationsinbiologicalsequenceanalysis
AT fukunagatsukasa representationlearningapplicationsinbiologicalsequenceanalysis
AT hamadamichiaki representationlearningapplicationsinbiologicalsequenceanalysis