Cargando…

A distance-type measure approach to the analysis of copy number variation in DNA sequencing data

BACKGROUND: The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Biswas, Bipasa, Lai, Yinglei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456939/ https://www.ncbi.nlm.nih.gov/pubmed/30967117 http://dx.doi.org/10.1186/s12864-019-5491-x

_version_	1783409830948503552
author	Biswas, Bipasa Lai, Yinglei
author_facet	Biswas, Bipasa Lai, Yinglei
author_sort	Biswas, Bipasa
collection	PubMed
description	BACKGROUND: The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However, the related data pre-processing is based on the moving window method, in which a window size need to be defined in order to obtain count-type data. Furthermore, useful information can be reduced after data pre-processing for count-type data. RESULTS: In this study, we propose to analyze DNA-seq data based on the related distance-type measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distance-type measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the pre-processing of DNA-seq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GC-content bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the Newton-Raphson algorithm as well as the Expectation-Maximization (E-M) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related z-values for a follow-up analysis. For an illustration, an application to the DNA-seq data from a pair of normal and tumor cell lines is presented with a change-point analysis of z-values to detect DNA copy number alterations. CONCLUSION: Our distance-type measure approach is novel. It does not require either a fixed or a sliding window procedure for generating count-type data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5491-x) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6456939
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-64569392019-04-19 A distance-type measure approach to the analysis of copy number variation in DNA sequencing data Biswas, Bipasa Lai, Yinglei BMC Genomics Research BACKGROUND: The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However, the related data pre-processing is based on the moving window method, in which a window size need to be defined in order to obtain count-type data. Furthermore, useful information can be reduced after data pre-processing for count-type data. RESULTS: In this study, we propose to analyze DNA-seq data based on the related distance-type measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distance-type measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the pre-processing of DNA-seq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GC-content bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the Newton-Raphson algorithm as well as the Expectation-Maximization (E-M) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related z-values for a follow-up analysis. For an illustration, an application to the DNA-seq data from a pair of normal and tumor cell lines is presented with a change-point analysis of z-values to detect DNA copy number alterations. CONCLUSION: Our distance-type measure approach is novel. It does not require either a fixed or a sliding window procedure for generating count-type data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5491-x) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-04 /pmc/articles/PMC6456939/ /pubmed/30967117 http://dx.doi.org/10.1186/s12864-019-5491-x Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Biswas, Bipasa Lai, Yinglei A distance-type measure approach to the analysis of copy number variation in DNA sequencing data
title	A distance-type measure approach to the analysis of copy number variation in DNA sequencing data
title_full	A distance-type measure approach to the analysis of copy number variation in DNA sequencing data
title_fullStr	A distance-type measure approach to the analysis of copy number variation in DNA sequencing data
title_full_unstemmed	A distance-type measure approach to the analysis of copy number variation in DNA sequencing data
title_short	A distance-type measure approach to the analysis of copy number variation in DNA sequencing data
title_sort	distance-type measure approach to the analysis of copy number variation in dna sequencing data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456939/ https://www.ncbi.nlm.nih.gov/pubmed/30967117 http://dx.doi.org/10.1186/s12864-019-5491-x
work_keys_str_mv	AT biswasbipasa adistancetypemeasureapproachtotheanalysisofcopynumbervariationindnasequencingdata AT laiyinglei adistancetypemeasureapproachtotheanalysisofcopynumbervariationindnasequencingdata AT biswasbipasa distancetypemeasureapproachtotheanalysisofcopynumbervariationindnasequencingdata AT laiyinglei distancetypemeasureapproachtotheanalysisofcopynumbervariationindnasequencingdata

A distance-type measure approach to the analysis of copy number variation in DNA sequencing data

Ejemplares similares