Cargando…

Identification of copy number variants in whole-genome data using Reference Coverage Profiles

The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, partic...

Descripción completa

Detalles Bibliográficos
Autores principales: Glusman, Gustavo, Severson, Alissa, Dhankani, Varsha, Robinson, Max, Farrah, Terry, Mauldin, Denise E., Stittrich, Anna B., Ament, Seth A., Roach, Jared C., Brunkow, Mary E., Bodian, Dale L., Vockley, Joseph G., Shmulevich, Ilya, Niederhuber, John E., Hood, Leroy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4330915/
https://www.ncbi.nlm.nih.gov/pubmed/25741365
http://dx.doi.org/10.3389/fgene.2015.00045
_version_ 1782357646824701952
author Glusman, Gustavo
Severson, Alissa
Dhankani, Varsha
Robinson, Max
Farrah, Terry
Mauldin, Denise E.
Stittrich, Anna B.
Ament, Seth A.
Roach, Jared C.
Brunkow, Mary E.
Bodian, Dale L.
Vockley, Joseph G.
Shmulevich, Ilya
Niederhuber, John E.
Hood, Leroy
author_facet Glusman, Gustavo
Severson, Alissa
Dhankani, Varsha
Robinson, Max
Farrah, Terry
Mauldin, Denise E.
Stittrich, Anna B.
Ament, Seth A.
Roach, Jared C.
Brunkow, Mary E.
Bodian, Dale L.
Vockley, Joseph G.
Shmulevich, Ilya
Niederhuber, John E.
Hood, Leroy
author_sort Glusman, Gustavo
collection PubMed
description The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation.
format Online
Article
Text
id pubmed-4330915
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-43309152015-03-04 Identification of copy number variants in whole-genome data using Reference Coverage Profiles Glusman, Gustavo Severson, Alissa Dhankani, Varsha Robinson, Max Farrah, Terry Mauldin, Denise E. Stittrich, Anna B. Ament, Seth A. Roach, Jared C. Brunkow, Mary E. Bodian, Dale L. Vockley, Joseph G. Shmulevich, Ilya Niederhuber, John E. Hood, Leroy Front Genet Genetics The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation. Frontiers Media S.A. 2015-02-17 /pmc/articles/PMC4330915/ /pubmed/25741365 http://dx.doi.org/10.3389/fgene.2015.00045 Text en Copyright © 2015 Glusman, Severson, Dhankani, Robinson, Farrah, Mauldin, Stittrich, Ament, Roach, Brunkow, Bodian, Vockley, Shmulevich, Niederhuber and Hood. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Glusman, Gustavo
Severson, Alissa
Dhankani, Varsha
Robinson, Max
Farrah, Terry
Mauldin, Denise E.
Stittrich, Anna B.
Ament, Seth A.
Roach, Jared C.
Brunkow, Mary E.
Bodian, Dale L.
Vockley, Joseph G.
Shmulevich, Ilya
Niederhuber, John E.
Hood, Leroy
Identification of copy number variants in whole-genome data using Reference Coverage Profiles
title Identification of copy number variants in whole-genome data using Reference Coverage Profiles
title_full Identification of copy number variants in whole-genome data using Reference Coverage Profiles
title_fullStr Identification of copy number variants in whole-genome data using Reference Coverage Profiles
title_full_unstemmed Identification of copy number variants in whole-genome data using Reference Coverage Profiles
title_short Identification of copy number variants in whole-genome data using Reference Coverage Profiles
title_sort identification of copy number variants in whole-genome data using reference coverage profiles
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4330915/
https://www.ncbi.nlm.nih.gov/pubmed/25741365
http://dx.doi.org/10.3389/fgene.2015.00045
work_keys_str_mv AT glusmangustavo identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT seversonalissa identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT dhankanivarsha identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT robinsonmax identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT farrahterry identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT mauldindenisee identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT stittrichannab identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT amentsetha identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT roachjaredc identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT brunkowmarye identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT bodiandalel identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT vockleyjosephg identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT shmulevichilya identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT niederhuberjohne identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles
AT hoodleroy identificationofcopynumbervariantsinwholegenomedatausingreferencecoverageprofiles