Cargando…

Active learning for efficient analysis of high-throughput nanopore data

MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the suc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Guan, Xiaoyu, Li, Zhongnian, Zhou, Yueying, Shao, Wei, Zhang, Daoqiang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825740/ https://www.ncbi.nlm.nih.gov/pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764

_version_	1784866687983550464
author	Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang
author_facet	Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang
author_sort	Guan, Xiaoyu
collection	PubMed
description	MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Results: The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. AVAILABILITY AND IMPLEMENTATION: The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9825740
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98257402023-01-10 Active learning for efficient analysis of high-throughput nanopore data Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang Bioinformatics Original Paper MOTIVATION: As the third-generation sequencing technology, nanopore sequencing has been used for high-throughput sequencing of DNA, RNA, and even proteins. Recently, many studies have begun to use machine learning technology to analyze the enormous data generated by nanopores. Unfortunately, the success of this technology is due to the extensive labeled data, which often suffer from enormous labor costs. Therefore, there is an urgent need for a novel technology that can not only rapidly analyze nanopore data with high-throughput, but also significantly reduce the cost of labeling. To achieve the above goals, we introduce active learning to alleviate the enormous labor costs by selecting the samples that need to be labeled. This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD). Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Results: The experimental results show that for the same performance metric, 50% labeling amount can achieve the best baseline performance for ONT-BD, while only 15% labeling amount can achieve the best baseline performance for RNA-CD. Crucially, the experiments show that active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost. Active learning can greatly reduce the dilemma of difficult labeling of high-capacity nanopore data. We hope active learning can be applied to other problems in nanopore sequence analysis. AVAILABILITY AND IMPLEMENTATION: The main program is available at https://github.com/guanxiaoyu11/AL-for-nanopore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-11-29 /pmc/articles/PMC9825740/ /pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Guan, Xiaoyu Li, Zhongnian Zhou, Yueying Shao, Wei Zhang, Daoqiang Active learning for efficient analysis of high-throughput nanopore data
title	Active learning for efficient analysis of high-throughput nanopore data
title_full	Active learning for efficient analysis of high-throughput nanopore data
title_fullStr	Active learning for efficient analysis of high-throughput nanopore data
title_full_unstemmed	Active learning for efficient analysis of high-throughput nanopore data
title_short	Active learning for efficient analysis of high-throughput nanopore data
title_sort	active learning for efficient analysis of high-throughput nanopore data
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825740/ https://www.ncbi.nlm.nih.gov/pubmed/36445037 http://dx.doi.org/10.1093/bioinformatics/btac764
work_keys_str_mv	AT guanxiaoyu activelearningforefficientanalysisofhighthroughputnanoporedata AT lizhongnian activelearningforefficientanalysisofhighthroughputnanoporedata AT zhouyueying activelearningforefficientanalysisofhighthroughputnanoporedata AT shaowei activelearningforefficientanalysisofhighthroughputnanoporedata AT zhangdaoqiang activelearningforefficientanalysisofhighthroughputnanoporedata

Active learning for efficient analysis of high-throughput nanopore data

Ejemplares similares