Cargando…
Analysis of heterogeneous genomic samples using image normalization and machine learning
BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded b...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751093/ https://www.ncbi.nlm.nih.gov/pubmed/33349236 http://dx.doi.org/10.1186/s12864-020-6661-6 |
_version_ | 1783625604585750528 |
---|---|
author | Basodi, Sunitha Baykal, Pelin Icer Zelikovsky, Alex Skums, Pavel Pan, Yi |
author_facet | Basodi, Sunitha Baykal, Pelin Icer Zelikovsky, Alex Skums, Pavel Pan, Yi |
author_sort | Basodi, Sunitha |
collection | PubMed |
description | BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. RESULTS: We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy. CONCLUSIONS: Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models. |
format | Online Article Text |
id | pubmed-7751093 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-77510932020-12-22 Analysis of heterogeneous genomic samples using image normalization and machine learning Basodi, Sunitha Baykal, Pelin Icer Zelikovsky, Alex Skums, Pavel Pan, Yi BMC Genomics Methodology BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. RESULTS: We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy. CONCLUSIONS: Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models. BioMed Central 2020-12-21 /pmc/articles/PMC7751093/ /pubmed/33349236 http://dx.doi.org/10.1186/s12864-020-6661-6 Text en © The Author(s). 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Basodi, Sunitha Baykal, Pelin Icer Zelikovsky, Alex Skums, Pavel Pan, Yi Analysis of heterogeneous genomic samples using image normalization and machine learning |
title | Analysis of heterogeneous genomic samples using image normalization and machine learning |
title_full | Analysis of heterogeneous genomic samples using image normalization and machine learning |
title_fullStr | Analysis of heterogeneous genomic samples using image normalization and machine learning |
title_full_unstemmed | Analysis of heterogeneous genomic samples using image normalization and machine learning |
title_short | Analysis of heterogeneous genomic samples using image normalization and machine learning |
title_sort | analysis of heterogeneous genomic samples using image normalization and machine learning |
topic | Methodology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751093/ https://www.ncbi.nlm.nih.gov/pubmed/33349236 http://dx.doi.org/10.1186/s12864-020-6661-6 |
work_keys_str_mv | AT basodisunitha analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning AT baykalpelinicer analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning AT zelikovskyalex analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning AT skumspavel analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning AT panyi analysisofheterogeneousgenomicsamplesusingimagenormalizationandmachinelearning |