Cargando…
A sensitive repeat identification framework based on short and long reads
Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads a...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8464074/ https://www.ncbi.nlm.nih.gov/pubmed/34214175 http://dx.doi.org/10.1093/nar/gkab563 |
_version_ | 1784572541555179520 |
---|---|
author | Liao, Xingyu Li, Min Hu, Kang Wu, Fang-Xiang Gao, Xin Wang, Jianxin |
author_facet | Liao, Xingyu Li, Min Hu, Kang Wu, Fang-Xiang Gao, Xin Wang, Jianxin |
author_sort | Liao, Xingyu |
collection | PubMed |
description | Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker). |
format | Online Article Text |
id | pubmed-8464074 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-84640742021-09-27 A sensitive repeat identification framework based on short and long reads Liao, Xingyu Li, Min Hu, Kang Wu, Fang-Xiang Gao, Xin Wang, Jianxin Nucleic Acids Res Methods Online Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker). Oxford University Press 2021-07-02 /pmc/articles/PMC8464074/ /pubmed/34214175 http://dx.doi.org/10.1093/nar/gkab563 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Methods Online Liao, Xingyu Li, Min Hu, Kang Wu, Fang-Xiang Gao, Xin Wang, Jianxin A sensitive repeat identification framework based on short and long reads |
title | A sensitive repeat identification framework based on short and long reads |
title_full | A sensitive repeat identification framework based on short and long reads |
title_fullStr | A sensitive repeat identification framework based on short and long reads |
title_full_unstemmed | A sensitive repeat identification framework based on short and long reads |
title_short | A sensitive repeat identification framework based on short and long reads |
title_sort | sensitive repeat identification framework based on short and long reads |
topic | Methods Online |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8464074/ https://www.ncbi.nlm.nih.gov/pubmed/34214175 http://dx.doi.org/10.1093/nar/gkab563 |
work_keys_str_mv | AT liaoxingyu asensitiverepeatidentificationframeworkbasedonshortandlongreads AT limin asensitiverepeatidentificationframeworkbasedonshortandlongreads AT hukang asensitiverepeatidentificationframeworkbasedonshortandlongreads AT wufangxiang asensitiverepeatidentificationframeworkbasedonshortandlongreads AT gaoxin asensitiverepeatidentificationframeworkbasedonshortandlongreads AT wangjianxin asensitiverepeatidentificationframeworkbasedonshortandlongreads AT liaoxingyu sensitiverepeatidentificationframeworkbasedonshortandlongreads AT limin sensitiverepeatidentificationframeworkbasedonshortandlongreads AT hukang sensitiverepeatidentificationframeworkbasedonshortandlongreads AT wufangxiang sensitiverepeatidentificationframeworkbasedonshortandlongreads AT gaoxin sensitiverepeatidentificationframeworkbasedonshortandlongreads AT wangjianxin sensitiverepeatidentificationframeworkbasedonshortandlongreads |