Cargando…

Analysis of heterogeneous genomic samples using image normalization and machine learning

BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded b...

Descripción completa

Detalles Bibliográficos
Autores principales: Basodi, Sunitha, Baykal, Pelin Icer, Zelikovsky, Alex, Skums, Pavel, Pan, Yi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751093/
https://www.ncbi.nlm.nih.gov/pubmed/33349236
http://dx.doi.org/10.1186/s12864-020-6661-6
_version_ 1783625604585750528
author Basodi, Sunitha
Baykal, Pelin Icer
Zelikovsky, Alex
Skums, Pavel
Pan, Yi
author_facet Basodi, Sunitha
Baykal, Pelin Icer
Zelikovsky, Alex
Skums, Pavel
Pan, Yi
author_sort Basodi, Sunitha
collection PubMed
description BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. RESULTS: We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy. CONCLUSIONS: Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models.
format Online
Article
Text
id pubmed-7751093
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77510932020-12-22 Analysis of heterogeneous genomic samples using image normalization and machine learning Basodi, Sunitha Baykal, Pelin Icer Zelikovsky, Alex Skums, Pavel Pan, Yi BMC Genomics Methodology BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. RESULTS: We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy. CONCLUSIONS: Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models. BioMed Central 2020-12-21 /pmc/articles/PMC7751093/ /pubmed/33349236 http://dx.doi.org/10.1186/s12864-020-6661-6 Text en © The Author(s). 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Basodi, Sunitha
Baykal, Pelin Icer
Zelikovsky, Alex
Skums, Pavel
Pan, Yi
Analysis of heterogeneous genomic samples using image normalization and machine learning
title Analysis of heterogeneous genomic samples using image normalization and machine learning
title_full Analysis of heterogeneous genomic samples using image normalization and machine learning
title_fullStr Analysis of heterogeneous genomic samples using image normalization and machine learning
title_full_unstemmed Analysis of heterogeneous genomic samples using image normalization and machine learning
title_short Analysis of heterogeneous genomic samples using image normalization and machine learning
title_sort analysis of heterogeneous genomic samples using image normalization and machine learning
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751093/
https://www.ncbi.nlm.nih.gov/pubmed/33349236
http://dx.doi.org/10.1186/s12864-020-6661-6
work_keys_str_mv AT basodisunitha analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning
AT baykalpelinicer analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning
AT zelikovskyalex analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning
AT skumspavel analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning
AT panyi analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning