Cargando…

A sensitive repeat identification framework based on short and long reads

Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads a...

Descripción completa

Detalles Bibliográficos
Autores principales: Liao, Xingyu, Li, Min, Hu, Kang, Wu, Fang-Xiang, Gao, Xin, Wang, Jianxin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8464074/
https://www.ncbi.nlm.nih.gov/pubmed/34214175
http://dx.doi.org/10.1093/nar/gkab563
_version_ 1784572541555179520
author Liao, Xingyu
Li, Min
Hu, Kang
Wu, Fang-Xiang
Gao, Xin
Wang, Jianxin
author_facet Liao, Xingyu
Li, Min
Hu, Kang
Wu, Fang-Xiang
Gao, Xin
Wang, Jianxin
author_sort Liao, Xingyu
collection PubMed
description Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).
format Online
Article
Text
id pubmed-8464074
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-84640742021-09-27 A sensitive repeat identification framework based on short and long reads Liao, Xingyu Li, Min Hu, Kang Wu, Fang-Xiang Gao, Xin Wang, Jianxin Nucleic Acids Res Methods Online Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker). Oxford University Press 2021-07-02 /pmc/articles/PMC8464074/ /pubmed/34214175 http://dx.doi.org/10.1093/nar/gkab563 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Liao, Xingyu
Li, Min
Hu, Kang
Wu, Fang-Xiang
Gao, Xin
Wang, Jianxin
A sensitive repeat identification framework based on short and long reads
title A sensitive repeat identification framework based on short and long reads
title_full A sensitive repeat identification framework based on short and long reads
title_fullStr A sensitive repeat identification framework based on short and long reads
title_full_unstemmed A sensitive repeat identification framework based on short and long reads
title_short A sensitive repeat identification framework based on short and long reads
title_sort sensitive repeat identification framework based on short and long reads
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8464074/
https://www.ncbi.nlm.nih.gov/pubmed/34214175
http://dx.doi.org/10.1093/nar/gkab563
work_keys_str_mv AT liaoxingyu asensitiverepeatidentificationframeworkbasedonshortandlongreads
AT limin asensitiverepeatidentificationframeworkbasedonshortandlongreads
AT hukang asensitiverepeatidentificationframeworkbasedonshortandlongreads
AT wufangxiang asensitiverepeatidentificationframeworkbasedonshortandlongreads
AT gaoxin asensitiverepeatidentificationframeworkbasedonshortandlongreads
AT wangjianxin asensitiverepeatidentificationframeworkbasedonshortandlongreads
AT liaoxingyu sensitiverepeatidentificationframeworkbasedonshortandlongreads
AT limin sensitiverepeatidentificationframeworkbasedonshortandlongreads
AT hukang sensitiverepeatidentificationframeworkbasedonshortandlongreads
AT wufangxiang sensitiverepeatidentificationframeworkbasedonshortandlongreads
AT gaoxin sensitiverepeatidentificationframeworkbasedonshortandlongreads
AT wangjianxin sensitiverepeatidentificationframeworkbasedonshortandlongreads