Cargando…

A64 Viral sequence classification using deep learning algorithms

Sewage samples have a high potential benefit for surveillance of circulating pathogens because they are easy to obtain and reflect population-wide circulation of pathogens. These type of samples typically contain a great diversity of viruses. Therefore, one of the main challenges of metagenomic sequ...

Descripción completa

Detalles Bibliográficos
Autores principales: Nieuwenhuijse, David, Munnink, Bas Oude, Phan, My, Koopmans, Marion
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736044/
http://dx.doi.org/10.1093/ve/vez002.063
_version_ 1783450448617799680
author Nieuwenhuijse, David
Munnink, Bas Oude
Phan, My
Koopmans, Marion
author_facet Nieuwenhuijse, David
Munnink, Bas Oude
Phan, My
Koopmans, Marion
author_sort Nieuwenhuijse, David
collection PubMed
description Sewage samples have a high potential benefit for surveillance of circulating pathogens because they are easy to obtain and reflect population-wide circulation of pathogens. These type of samples typically contain a great diversity of viruses. Therefore, one of the main challenges of metagenomic sequencing of sewage for surveillance is sequence annotation and interpretation. Especially for high-threat viruses, false positive signals can trigger unnecessary alerts, but true positives should not be missed. Annotation thus requires high sensitivity and specificity. To better interpret annotated reads for high-threat viruses, we attempt to determine how classifiable they are in a background of reads of closely related low-threat viruses. As an example, we attempted to distinguish poliovirus reads, a virus of high public health importance, from other enterovirus reads. A sequence-based deep learning algorithm was used to classify reads as either polio or non-polio enterovirus. Short reads were generated from 500 polio and 2,000 non-polio enterovirus genomes as a training set. By training the algorithm on this dataset we try to determine, on a single read level, which short reads can reliably be labeled as poliovirus and which cannot. After training the deep learning algorithm on the generated reads we were able to calculate the probability with which a read can be assigned to a poliovirus genome or a non-poliovirus genome. We show that the algorithm succeeds in classifying the reads with high accuracy. The probability of assigning the read to the correct class was related to the location in the genome to which the read mapped, which conformed with our expectations since some regions of the genome are more conserved than others. Classifying short reads of high-threat viral pathogens seems to be a promising application of sequence-based deep learning algorithms. Also, recent developments in software and hardware have facilitated the development and training of deep learning algorithms. Further plans of this work are to characterize the hard-to-classify regions of the poliovirus genome, build larger training databases, and expand on the current approach to other viruses.
format Online
Article
Text
id pubmed-6736044
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-67360442019-09-16 A64 Viral sequence classification using deep learning algorithms Nieuwenhuijse, David Munnink, Bas Oude Phan, My Koopmans, Marion Virus Evol Abstract Overview Sewage samples have a high potential benefit for surveillance of circulating pathogens because they are easy to obtain and reflect population-wide circulation of pathogens. These type of samples typically contain a great diversity of viruses. Therefore, one of the main challenges of metagenomic sequencing of sewage for surveillance is sequence annotation and interpretation. Especially for high-threat viruses, false positive signals can trigger unnecessary alerts, but true positives should not be missed. Annotation thus requires high sensitivity and specificity. To better interpret annotated reads for high-threat viruses, we attempt to determine how classifiable they are in a background of reads of closely related low-threat viruses. As an example, we attempted to distinguish poliovirus reads, a virus of high public health importance, from other enterovirus reads. A sequence-based deep learning algorithm was used to classify reads as either polio or non-polio enterovirus. Short reads were generated from 500 polio and 2,000 non-polio enterovirus genomes as a training set. By training the algorithm on this dataset we try to determine, on a single read level, which short reads can reliably be labeled as poliovirus and which cannot. After training the deep learning algorithm on the generated reads we were able to calculate the probability with which a read can be assigned to a poliovirus genome or a non-poliovirus genome. We show that the algorithm succeeds in classifying the reads with high accuracy. The probability of assigning the read to the correct class was related to the location in the genome to which the read mapped, which conformed with our expectations since some regions of the genome are more conserved than others. Classifying short reads of high-threat viral pathogens seems to be a promising application of sequence-based deep learning algorithms. Also, recent developments in software and hardware have facilitated the development and training of deep learning algorithms. Further plans of this work are to characterize the hard-to-classify regions of the poliovirus genome, build larger training databases, and expand on the current approach to other viruses. Oxford University Press 2019-08-22 /pmc/articles/PMC6736044/ http://dx.doi.org/10.1093/ve/vez002.063 Text en © Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access publication distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Abstract Overview
Nieuwenhuijse, David
Munnink, Bas Oude
Phan, My
Koopmans, Marion
A64 Viral sequence classification using deep learning algorithms
title A64 Viral sequence classification using deep learning algorithms
title_full A64 Viral sequence classification using deep learning algorithms
title_fullStr A64 Viral sequence classification using deep learning algorithms
title_full_unstemmed A64 Viral sequence classification using deep learning algorithms
title_short A64 Viral sequence classification using deep learning algorithms
title_sort a64 viral sequence classification using deep learning algorithms
topic Abstract Overview
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736044/
http://dx.doi.org/10.1093/ve/vez002.063
work_keys_str_mv AT nieuwenhuijsedavid a64viralsequenceclassificationusingdeeplearningalgorithms
AT munninkbasoude a64viralsequenceclassificationusingdeeplearningalgorithms
AT phanmy a64viralsequenceclassificationusingdeeplearningalgorithms
AT koopmansmarion a64viralsequenceclassificationusingdeeplearningalgorithms