Cargando…

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

BACKGROUND: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make...

Descripción completa

Detalles Bibliográficos
Autores principales: Liao, Xingyu, Gao, Xin, Zhang, Xiankai, Wu, Fang-Xiang, Wang, Jianxin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7574428/
https://www.ncbi.nlm.nih.gov/pubmed/33076827
http://dx.doi.org/10.1186/s12859-020-03779-w
_version_ 1783597634926149632
author Liao, Xingyu
Gao, Xin
Zhang, Xiankai
Wu, Fang-Xiang
Wang, Jianxin
author_facet Liao, Xingyu
Gao, Xin
Zhang, Xiankai
Wu, Fang-Xiang
Wang, Jianxin
author_sort Liao, Xingyu
collection PubMed
description BACKGROUND: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. RESULTS: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. CONLUSIONS: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
format Online
Article
Text
id pubmed-7574428
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-75744282020-10-20 RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads Liao, Xingyu Gao, Xin Zhang, Xiankai Wu, Fang-Xiang Wang, Jianxin BMC Bioinformatics Methodology Article BACKGROUND: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. RESULTS: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. CONLUSIONS: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics. BioMed Central 2020-10-19 /pmc/articles/PMC7574428/ /pubmed/33076827 http://dx.doi.org/10.1186/s12859-020-03779-w Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Liao, Xingyu
Gao, Xin
Zhang, Xiankai
Wu, Fang-Xiang
Wang, Jianxin
RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
title RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
title_full RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
title_fullStr RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
title_full_unstemmed RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
title_short RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
title_sort repahr: an improved approach for de novo repeat identification by assembly of the high-frequency reads
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7574428/
https://www.ncbi.nlm.nih.gov/pubmed/33076827
http://dx.doi.org/10.1186/s12859-020-03779-w
work_keys_str_mv AT liaoxingyu repahranimprovedapproachfordenovorepeatidentificationbyassemblyofthehighfrequencyreads
AT gaoxin repahranimprovedapproachfordenovorepeatidentificationbyassemblyofthehighfrequencyreads
AT zhangxiankai repahranimprovedapproachfordenovorepeatidentificationbyassemblyofthehighfrequencyreads
AT wufangxiang repahranimprovedapproachfordenovorepeatidentificationbyassemblyofthehighfrequencyreads
AT wangjianxin repahranimprovedapproachfordenovorepeatidentificationbyassemblyofthehighfrequencyreads