Cargando…
A64 Viral sequence classification using deep learning algorithms
Sewage samples have a high potential benefit for surveillance of circulating pathogens because they are easy to obtain and reflect population-wide circulation of pathogens. These type of samples typically contain a great diversity of viruses. Therefore, one of the main challenges of metagenomic sequ...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736044/ http://dx.doi.org/10.1093/ve/vez002.063 |
_version_ | 1783450448617799680 |
---|---|
author | Nieuwenhuijse, David Munnink, Bas Oude Phan, My Koopmans, Marion |
author_facet | Nieuwenhuijse, David Munnink, Bas Oude Phan, My Koopmans, Marion |
author_sort | Nieuwenhuijse, David |
collection | PubMed |
description | Sewage samples have a high potential benefit for surveillance of circulating pathogens because they are easy to obtain and reflect population-wide circulation of pathogens. These type of samples typically contain a great diversity of viruses. Therefore, one of the main challenges of metagenomic sequencing of sewage for surveillance is sequence annotation and interpretation. Especially for high-threat viruses, false positive signals can trigger unnecessary alerts, but true positives should not be missed. Annotation thus requires high sensitivity and specificity. To better interpret annotated reads for high-threat viruses, we attempt to determine how classifiable they are in a background of reads of closely related low-threat viruses. As an example, we attempted to distinguish poliovirus reads, a virus of high public health importance, from other enterovirus reads. A sequence-based deep learning algorithm was used to classify reads as either polio or non-polio enterovirus. Short reads were generated from 500 polio and 2,000 non-polio enterovirus genomes as a training set. By training the algorithm on this dataset we try to determine, on a single read level, which short reads can reliably be labeled as poliovirus and which cannot. After training the deep learning algorithm on the generated reads we were able to calculate the probability with which a read can be assigned to a poliovirus genome or a non-poliovirus genome. We show that the algorithm succeeds in classifying the reads with high accuracy. The probability of assigning the read to the correct class was related to the location in the genome to which the read mapped, which conformed with our expectations since some regions of the genome are more conserved than others. Classifying short reads of high-threat viral pathogens seems to be a promising application of sequence-based deep learning algorithms. Also, recent developments in software and hardware have facilitated the development and training of deep learning algorithms. Further plans of this work are to characterize the hard-to-classify regions of the poliovirus genome, build larger training databases, and expand on the current approach to other viruses. |
format | Online Article Text |
id | pubmed-6736044 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-67360442019-09-16 A64 Viral sequence classification using deep learning algorithms Nieuwenhuijse, David Munnink, Bas Oude Phan, My Koopmans, Marion Virus Evol Abstract Overview Sewage samples have a high potential benefit for surveillance of circulating pathogens because they are easy to obtain and reflect population-wide circulation of pathogens. These type of samples typically contain a great diversity of viruses. Therefore, one of the main challenges of metagenomic sequencing of sewage for surveillance is sequence annotation and interpretation. Especially for high-threat viruses, false positive signals can trigger unnecessary alerts, but true positives should not be missed. Annotation thus requires high sensitivity and specificity. To better interpret annotated reads for high-threat viruses, we attempt to determine how classifiable they are in a background of reads of closely related low-threat viruses. As an example, we attempted to distinguish poliovirus reads, a virus of high public health importance, from other enterovirus reads. A sequence-based deep learning algorithm was used to classify reads as either polio or non-polio enterovirus. Short reads were generated from 500 polio and 2,000 non-polio enterovirus genomes as a training set. By training the algorithm on this dataset we try to determine, on a single read level, which short reads can reliably be labeled as poliovirus and which cannot. After training the deep learning algorithm on the generated reads we were able to calculate the probability with which a read can be assigned to a poliovirus genome or a non-poliovirus genome. We show that the algorithm succeeds in classifying the reads with high accuracy. The probability of assigning the read to the correct class was related to the location in the genome to which the read mapped, which conformed with our expectations since some regions of the genome are more conserved than others. Classifying short reads of high-threat viral pathogens seems to be a promising application of sequence-based deep learning algorithms. Also, recent developments in software and hardware have facilitated the development and training of deep learning algorithms. Further plans of this work are to characterize the hard-to-classify regions of the poliovirus genome, build larger training databases, and expand on the current approach to other viruses. Oxford University Press 2019-08-22 /pmc/articles/PMC6736044/ http://dx.doi.org/10.1093/ve/vez002.063 Text en © Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access publication distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Abstract Overview Nieuwenhuijse, David Munnink, Bas Oude Phan, My Koopmans, Marion A64 Viral sequence classification using deep learning algorithms |
title | A64 Viral sequence classification using deep learning algorithms |
title_full | A64 Viral sequence classification using deep learning algorithms |
title_fullStr | A64 Viral sequence classification using deep learning algorithms |
title_full_unstemmed | A64 Viral sequence classification using deep learning algorithms |
title_short | A64 Viral sequence classification using deep learning algorithms |
title_sort | a64 viral sequence classification using deep learning algorithms |
topic | Abstract Overview |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736044/ http://dx.doi.org/10.1093/ve/vez002.063 |
work_keys_str_mv | AT nieuwenhuijsedavid a64viralsequenceclassificationusingdeeplearningalgorithms AT munninkbasoude a64viralsequenceclassificationusingdeeplearningalgorithms AT phanmy a64viralsequenceclassificationusingdeeplearningalgorithms AT koopmansmarion a64viralsequenceclassificationusingdeeplearningalgorithms |