Cargando…

Active learning for efficient analysis of high-throughput nanopore data

MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the suc...

Descripción completa

Detalles Bibliográficos
Autores principales: Guan, Xiaoyu, Li, Zhongnian, Zhou, Yueying, Shao, Wei, Zhang, Daoqiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825740/
https://www.ncbi.nlm.nih.gov/pubmed/36445037
http://dx.doi.org/10.1093/bioinformatics/btac764
_version_ 1784866687983550464
author Guan, Xiaoyu
Li, Zhongnian
Zhou, Yueying
Shao, Wei
Zhang, Daoqiang
author_facet Guan, Xiaoyu
Li, Zhongnian
Zhou, Yueying
Shao, Wei
Zhang, Daoqiang
author_sort Guan, Xiaoyu
collection PubMed
description MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Results: The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. AVAILABILITY AND IMPLEMENTATION: The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9825740
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98257402023-01-10 Active learning for efficient analysis of high-throughput nanopore data Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang Bioinformatics Original Paper MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Results: The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. AVAILABILITY AND IMPLEMENTATION: The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-11-29 /pmc/articles/PMC9825740/ /pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Guan, Xiaoyu
Li, Zhongnian
Zhou, Yueying
Shao, Wei
Zhang, Daoqiang
Active learning for efficient analysis of high-throughput nanopore data
title Active learning for efficient analysis of high-throughput nanopore data
title_full Active learning for efficient analysis of high-throughput nanopore data
title_fullStr Active learning for efficient analysis of high-throughput nanopore data
title_full_unstemmed Active learning for efficient analysis of high-throughput nanopore data
title_short Active learning for efficient analysis of high-throughput nanopore data
title_sort active learning for efficient analysis of high-throughput nanopore data
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825740/
https://www.ncbi.nlm.nih.gov/pubmed/36445037
http://dx.doi.org/10.1093/bioinformatics/btac764
work_keys_str_mv AT guanxiaoyu activelearningforefficientanalysisofhighthroughputnanoporedata
AT lizhongnian activelearningforefficientanalysisofhighthroughputnanoporedata
AT zhouyueying activelearningforefficientanalysisofhighthroughputnanoporedata
AT shaowei activelearningforefficientanalysisofhighthroughputnanoporedata
AT zhangdaoqiang activelearningforefficientanalysisofhighthroughputnanoporedata