Cargando…
Active learning for efficient analysis of high-throughput nanopore data
MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the suc...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825740/ https://www.ncbi.nlm.nih.gov/pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764 |
_version_ | 1784866687983550464 |
---|---|
author | Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang |
author_facet | Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang |
author_sort | Guan, Xiaoyu |
collection | PubMed |
description | MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Results: The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. AVAILABILITY AND IMPLEMENTATION: The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9825740 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-98257402023-01-10 Active learning for efficient analysis of high-throughput nanopore data Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang Bioinformatics Original Paper MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Results: The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. AVAILABILITY AND IMPLEMENTATION: The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-11-29 /pmc/articles/PMC9825740/ /pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang Active learning for efficient analysis of high-throughput nanopore data |
title | Active learning for efficient analysis of high-throughput nanopore data |
title_full | Active learning for efficient analysis of high-throughput nanopore data |
title_fullStr | Active learning for efficient analysis of high-throughput nanopore data |
title_full_unstemmed | Active learning for efficient analysis of high-throughput nanopore data |
title_short | Active learning for efficient analysis of high-throughput nanopore data |
title_sort | active learning for efficient analysis of high-throughput nanopore data |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825740/ https://www.ncbi.nlm.nih.gov/pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764 |
work_keys_str_mv | AT guanxiaoyu activelearningforefficientanalysisofhighthroughputnanoporedata AT lizhongnian activelearningforefficientanalysisofhighthroughputnanoporedata AT zhouyueying activelearningforefficientanalysisofhighthroughputnanoporedata AT shaowei activelearningforefficientanalysisofhighthroughputnanoporedata AT zhangdaoqiang activelearningforefficientanalysisofhighthroughputnanoporedata |