Cargando…

A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data

The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build...

Descripción completa

Detalles Bibliográficos
Autores principales:	Goldfeder, Rachel L., Parker, Stephen C. J., Ajay, Subramanian S., Ozel Abaan, Hatice, Margulies, Elliott H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3157463/ https://www.ncbi.nlm.nih.gov/pubmed/21858207 http://dx.doi.org/10.1371/journal.pone.0023683

_version_	1782210311422476288
author	Goldfeder, Rachel L. Parker, Stephen C. J. Ajay, Subramanian S. Ozel Abaan, Hatice Margulies, Elliott H.
author_facet	Goldfeder, Rachel L. Parker, Stephen C. J. Ajay, Subramanian S. Ozel Abaan, Hatice Margulies, Elliott H.
author_sort	Goldfeder, Rachel L.
collection	PubMed
description	The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality.
format	Online Article Text
id	pubmed-3157463
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-31574632011-08-19 A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data Goldfeder, Rachel L. Parker, Stephen C. J. Ajay, Subramanian S. Ozel Abaan, Hatice Margulies, Elliott H. PLoS One Research Article The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality. Public Library of Science 2011-08-17 /pmc/articles/PMC3157463/ /pubmed/21858207 http://dx.doi.org/10.1371/journal.pone.0023683 Text en This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
spellingShingle	Research Article Goldfeder, Rachel L. Parker, Stephen C. J. Ajay, Subramanian S. Ozel Abaan, Hatice Margulies, Elliott H. A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data
title	A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data
title_full	A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data
title_fullStr	A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data
title_full_unstemmed	A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data
title_short	A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data
title_sort	bioinformatics approach for determining sample identity from different lanes of high-throughput sequencing data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3157463/ https://www.ncbi.nlm.nih.gov/pubmed/21858207 http://dx.doi.org/10.1371/journal.pone.0023683
work_keys_str_mv	AT goldfederrachell abioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT parkerstephencj abioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT ajaysubramanians abioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT ozelabaanhatice abioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT margulieselliotth abioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT goldfederrachell bioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT parkerstephencj bioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT ajaysubramanians bioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT ozelabaanhatice bioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata AT margulieselliotth bioinformaticsapproachfordeterminingsampleidentityfromdifferentlanesofhighthroughputsequencingdata

A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data

Ejemplares similares