Cargando…

Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing

Motivation: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat...

Descripción completa

Detalles Bibliográficos
Autores principales: Sevim, Volkan, Bashir, Ali, Chin, Chen-Shan, Miga, Karen H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4920115/
https://www.ncbi.nlm.nih.gov/pubmed/27153570
http://dx.doi.org/10.1093/bioinformatics/btw101
_version_ 1782439352571265024
author Sevim, Volkan
Bashir, Ali
Chin, Chen-Shan
Miga, Karen H.
author_facet Sevim, Volkan
Bashir, Ali
Chin, Chen-Shan
Miga, Karen H.
author_sort Sevim, Volkan
collection PubMed
description Motivation: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly. Results: We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies. Availability and implementation: Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI. Contact: ali.bashir@mssm.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-4920115
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49201152016-06-27 Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing Sevim, Volkan Bashir, Ali Chin, Chen-Shan Miga, Karen H. Bioinformatics Discovery Note Motivation: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly. Results: We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies. Availability and implementation: Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI. Contact: ali.bashir@mssm.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2016-07-01 2016-02-24 /pmc/articles/PMC4920115/ /pubmed/27153570 http://dx.doi.org/10.1093/bioinformatics/btw101 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Discovery Note
Sevim, Volkan
Bashir, Ali
Chin, Chen-Shan
Miga, Karen H.
Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing
title Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing
title_full Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing
title_fullStr Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing
title_full_unstemmed Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing
title_short Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing
title_sort alpha-centauri: assessing novel centromeric repeat sequence variation with long read sequencing
topic Discovery Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4920115/
https://www.ncbi.nlm.nih.gov/pubmed/27153570
http://dx.doi.org/10.1093/bioinformatics/btw101
work_keys_str_mv AT sevimvolkan alphacentauriassessingnovelcentromericrepeatsequencevariationwithlongreadsequencing
AT bashirali alphacentauriassessingnovelcentromericrepeatsequencevariationwithlongreadsequencing
AT chinchenshan alphacentauriassessingnovelcentromericrepeatsequencevariationwithlongreadsequencing
AT migakarenh alphacentauriassessingnovelcentromericrepeatsequencevariationwithlongreadsequencing