Cargando…

Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine

We recently developed a test based on the Agilent SureSelect target enrichment system capturing genomic fragments from 191 human papillomaviruses (HPV) types for Illumina sequencing. This enriched whole genome sequencing (eWGS) assay provides an approach to identify all HPV types in a sample. Here w...

Descripción completa

Detalles Bibliográficos
Autores principales: Lomsadze, Alexandre, Li, Tengguo, Rajeevan, Mangalathu S., Unger, Elizabeth R., Borodovsky, Mark
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7412107/
https://www.ncbi.nlm.nih.gov/pubmed/32629900
http://dx.doi.org/10.3390/v12070710
_version_ 1783568532182663168
author Lomsadze, Alexandre
Li, Tengguo
Rajeevan, Mangalathu S.
Unger, Elizabeth R.
Borodovsky, Mark
author_facet Lomsadze, Alexandre
Li, Tengguo
Rajeevan, Mangalathu S.
Unger, Elizabeth R.
Borodovsky, Mark
author_sort Lomsadze, Alexandre
collection PubMed
description We recently developed a test based on the Agilent SureSelect target enrichment system capturing genomic fragments from 191 human papillomaviruses (HPV) types for Illumina sequencing. This enriched whole genome sequencing (eWGS) assay provides an approach to identify all HPV types in a sample. Here we present a machine learning algorithm that calls HPV types based on the eWGS output. The algorithm based on the support vector machine (SVM) technique was trained on eWGS data from 122 control samples with known HPV types. The new algorithm demonstrated good performance in HPV type detection for designed samples with 25 or greater HPV plasmid copies per sample. We compared the results of HPV typing made by the new algorithm for 261 residual epidemiologic samples with the results of the typing delivered by the standard HPV Linear Array (LA). The agreement between methods (97.4%) was substantial (kappa = 0.783). However, the new algorithm identified additionally 428 instances of HPV types not detectable by the LA assay by design. Overall, we have demonstrated that the bioinformatics pipeline is an accurate tool for calling HPV types by analyzing data generated by eWGS processing of DNA fragments extracted from control and epidemiological samples.
format Online
Article
Text
id pubmed-7412107
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-74121072020-08-25 Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine Lomsadze, Alexandre Li, Tengguo Rajeevan, Mangalathu S. Unger, Elizabeth R. Borodovsky, Mark Viruses Article We recently developed a test based on the Agilent SureSelect target enrichment system capturing genomic fragments from 191 human papillomaviruses (HPV) types for Illumina sequencing. This enriched whole genome sequencing (eWGS) assay provides an approach to identify all HPV types in a sample. Here we present a machine learning algorithm that calls HPV types based on the eWGS output. The algorithm based on the support vector machine (SVM) technique was trained on eWGS data from 122 control samples with known HPV types. The new algorithm demonstrated good performance in HPV type detection for designed samples with 25 or greater HPV plasmid copies per sample. We compared the results of HPV typing made by the new algorithm for 261 residual epidemiologic samples with the results of the typing delivered by the standard HPV Linear Array (LA). The agreement between methods (97.4%) was substantial (kappa = 0.783). However, the new algorithm identified additionally 428 instances of HPV types not detectable by the LA assay by design. Overall, we have demonstrated that the bioinformatics pipeline is an accurate tool for calling HPV types by analyzing data generated by eWGS processing of DNA fragments extracted from control and epidemiological samples. MDPI 2020-06-30 /pmc/articles/PMC7412107/ /pubmed/32629900 http://dx.doi.org/10.3390/v12070710 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Lomsadze, Alexandre
Li, Tengguo
Rajeevan, Mangalathu S.
Unger, Elizabeth R.
Borodovsky, Mark
Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine
title Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine
title_full Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine
title_fullStr Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine
title_full_unstemmed Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine
title_short Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine
title_sort bioinformatics pipeline for human papillomavirus short read genomic sequences classification using support vector machine
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7412107/
https://www.ncbi.nlm.nih.gov/pubmed/32629900
http://dx.doi.org/10.3390/v12070710
work_keys_str_mv AT lomsadzealexandre bioinformaticspipelineforhumanpapillomavirusshortreadgenomicsequencesclassificationusingsupportvectormachine
AT litengguo bioinformaticspipelineforhumanpapillomavirusshortreadgenomicsequencesclassificationusingsupportvectormachine
AT rajeevanmangalathus bioinformaticspipelineforhumanpapillomavirusshortreadgenomicsequencesclassificationusingsupportvectormachine
AT ungerelizabethr bioinformaticspipelineforhumanpapillomavirusshortreadgenomicsequencesclassificationusingsupportvectormachine
AT borodovskymark bioinformaticspipelineforhumanpapillomavirusshortreadgenomicsequencesclassificationusingsupportvectormachine