Cargando…

A machine learning framework for genotyping the structural variations with copy number variant

BACKGROUND: Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing g...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zheng, Tian, Zhu, Xiaoyan, Zhang, Xuanping, Zhao, Zhongmeng, Yi, Xin, Wang, Jiayin, Li, Hongle
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7450592/ https://www.ncbi.nlm.nih.gov/pubmed/32854699 http://dx.doi.org/10.1186/s12920-020-00733-w

_version_	1783574834080382976
author	Zheng, Tian Zhu, Xiaoyan Zhang, Xuanping Zhao, Zhongmeng Yi, Xin Wang, Jiayin Li, Hongle
author_facet	Zheng, Tian Zhu, Xiaoyan Zhang, Xuanping Zhao, Zhongmeng Yi, Xin Wang, Jiayin Li, Hongle
author_sort	Zheng, Tian
collection	PubMed
description	BACKGROUND: Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions. METHODS: Here we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features. RESULTS: We applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Naïve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency. CONCLUSION: This work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations with CNV on the next generation sequence data. The source codes have been uploaded at https://github.com/TrinaZ/Mixgenotypefor academic usage only.
format	Online Article Text
id	pubmed-7450592
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-74505922020-08-28 A machine learning framework for genotyping the structural variations with copy number variant Zheng, Tian Zhu, Xiaoyan Zhang, Xuanping Zhao, Zhongmeng Yi, Xin Wang, Jiayin Li, Hongle BMC Med Genomics Research BACKGROUND: Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions. METHODS: Here we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features. RESULTS: We applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Naïve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency. CONCLUSION: This work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations with CNV on the next generation sequence data. The source codes have been uploaded at https://github.com/TrinaZ/Mixgenotypefor academic usage only. BioMed Central 2020-08-27 /pmc/articles/PMC7450592/ /pubmed/32854699 http://dx.doi.org/10.1186/s12920-020-00733-w Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Zheng, Tian Zhu, Xiaoyan Zhang, Xuanping Zhao, Zhongmeng Yi, Xin Wang, Jiayin Li, Hongle A machine learning framework for genotyping the structural variations with copy number variant
title	A machine learning framework for genotyping the structural variations with copy number variant
title_full	A machine learning framework for genotyping the structural variations with copy number variant
title_fullStr	A machine learning framework for genotyping the structural variations with copy number variant
title_full_unstemmed	A machine learning framework for genotyping the structural variations with copy number variant
title_short	A machine learning framework for genotyping the structural variations with copy number variant
title_sort	machine learning framework for genotyping the structural variations with copy number variant
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7450592/ https://www.ncbi.nlm.nih.gov/pubmed/32854699 http://dx.doi.org/10.1186/s12920-020-00733-w
work_keys_str_mv	AT zhengtian amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhuxiaoyan amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhangxuanping amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhaozhongmeng amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT yixin amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT wangjiayin amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT lihongle amachinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhengtian machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhuxiaoyan machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhangxuanping machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT zhaozhongmeng machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT yixin machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT wangjiayin machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant AT lihongle machinelearningframeworkforgenotypingthestructuralvariationswithcopynumbervariant

A machine learning framework for genotyping the structural variations with copy number variant

Ejemplares similares