Cargando…

Finding long tandem repeats in long noisy reads

MOTIVATION: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt...

Descripción completa

Detalles Bibliográficos
Autores principales:	Morishita, Shinichi, Ichikawa, Kazuki, Myers, Eugene W
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8097686/ https://www.ncbi.nlm.nih.gov/pubmed/33031558 http://dx.doi.org/10.1093/bioinformatics/btaa865

_version_	1783688368864886784
author	Morishita, Shinichi Ichikawa, Kazuki Myers, Eugene W
author_facet	Morishita, Shinichi Ichikawa, Kazuki Myers, Eugene W
author_sort	Morishita, Shinichi
collection	PubMed
description	MOTIVATION: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10–20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats ([Formula: see text] nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. RESULTS: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity. AVAILABILITY AND IMPLEMENTATION: https://github.com/morisUtokyo/mTR.
format	Online Article Text
id	pubmed-8097686
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-80976862021-05-10 Finding long tandem repeats in long noisy reads Morishita, Shinichi Ichikawa, Kazuki Myers, Eugene W Bioinformatics Original Papers MOTIVATION: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10–20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats ([Formula: see text] nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. RESULTS: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity. AVAILABILITY AND IMPLEMENTATION: https://github.com/morisUtokyo/mTR. Oxford University Press 2020-10-08 /pmc/articles/PMC8097686/ /pubmed/33031558 http://dx.doi.org/10.1093/bioinformatics/btaa865 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Morishita, Shinichi Ichikawa, Kazuki Myers, Eugene W Finding long tandem repeats in long noisy reads
title	Finding long tandem repeats in long noisy reads
title_full	Finding long tandem repeats in long noisy reads
title_fullStr	Finding long tandem repeats in long noisy reads
title_full_unstemmed	Finding long tandem repeats in long noisy reads
title_short	Finding long tandem repeats in long noisy reads
title_sort	finding long tandem repeats in long noisy reads
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8097686/ https://www.ncbi.nlm.nih.gov/pubmed/33031558 http://dx.doi.org/10.1093/bioinformatics/btaa865
work_keys_str_mv	AT morishitashinichi findinglongtandemrepeatsinlongnoisyreads AT ichikawakazuki findinglongtandemrepeatsinlongnoisyreads AT myerseugenew findinglongtandemrepeatsinlongnoisyreads

Finding long tandem repeats in long noisy reads

Ejemplares similares