Cargando…

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

BACKGROUND: There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kuśmirek, Wiktor, Szmurło, Agnieszka, Wiewiórka, Marek, Nowak, Robert, Gambin, Tomasz
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6537193/ https://www.ncbi.nlm.nih.gov/pubmed/31138108 http://dx.doi.org/10.1186/s12859-019-2889-z

_version_	1783421948896739328
author	Kuśmirek, Wiktor Szmurło, Agnieszka Wiewiórka, Marek Nowak, Robert Gambin, Tomasz
author_facet	Kuśmirek, Wiktor Szmurło, Agnieszka Wiewiórka, Marek Nowak, Robert Gambin, Tomasz
author_sort	Kuśmirek, Wiktor
collection	PubMed
description	BACKGROUND: There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. METHODS: We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. RESULTS AND CONCLUSIONS: The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2889-z) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6537193
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-65371932019-05-30 Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance Kuśmirek, Wiktor Szmurło, Agnieszka Wiewiórka, Marek Nowak, Robert Gambin, Tomasz BMC Bioinformatics Methodology Article BACKGROUND: There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. METHODS: We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. RESULTS AND CONCLUSIONS: The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2889-z) contains supplementary material, which is available to authorized users. BioMed Central 2019-05-28 /pmc/articles/PMC6537193/ /pubmed/31138108 http://dx.doi.org/10.1186/s12859-019-2889-z Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Kuśmirek, Wiktor Szmurło, Agnieszka Wiewiórka, Marek Nowak, Robert Gambin, Tomasz Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_full	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_fullStr	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_full_unstemmed	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_short	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_sort	comparison of knn and k-means optimization methods of reference set selection for improved cnv callers performance
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6537193/ https://www.ncbi.nlm.nih.gov/pubmed/31138108 http://dx.doi.org/10.1186/s12859-019-2889-z
work_keys_str_mv	AT kusmirekwiktor comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT szmurłoagnieszka comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT wiewiorkamarek comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT nowakrobert comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT gambintomasz comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Ejemplares similares