Cargando…

IterCluster: a barcode clustering algorithm for long fragment read analysis

Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case...

Descripción completa

Detalles Bibliográficos
Autores principales: Weng, Jiancong, Chen, Tian, Xie, Yinlong, Xu, Xun, Zhang, Gengyun, Peters, Brock A., Drmanac, Radoje
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7100596/
https://www.ncbi.nlm.nih.gov/pubmed/32231869
http://dx.doi.org/10.7717/peerj.8431
_version_ 1783511463981219840
author Weng, Jiancong
Chen, Tian
Xie, Yinlong
Xu, Xun
Zhang, Gengyun
Peters, Brock A.
Drmanac, Radoje
author_facet Weng, Jiancong
Chen, Tian
Xie, Yinlong
Xu, Xun
Zhang, Gengyun
Peters, Brock A.
Drmanac, Radoje
author_sort Weng, Jiancong
collection PubMed
description Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster.
format Online
Article
Text
id pubmed-7100596
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-71005962020-03-30 IterCluster: a barcode clustering algorithm for long fragment read analysis Weng, Jiancong Chen, Tian Xie, Yinlong Xu, Xun Zhang, Gengyun Peters, Brock A. Drmanac, Radoje PeerJ Bioinformatics Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster. PeerJ Inc. 2020-03-24 /pmc/articles/PMC7100596/ /pubmed/32231869 http://dx.doi.org/10.7717/peerj.8431 Text en ©2020 Weng et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Weng, Jiancong
Chen, Tian
Xie, Yinlong
Xu, Xun
Zhang, Gengyun
Peters, Brock A.
Drmanac, Radoje
IterCluster: a barcode clustering algorithm for long fragment read analysis
title IterCluster: a barcode clustering algorithm for long fragment read analysis
title_full IterCluster: a barcode clustering algorithm for long fragment read analysis
title_fullStr IterCluster: a barcode clustering algorithm for long fragment read analysis
title_full_unstemmed IterCluster: a barcode clustering algorithm for long fragment read analysis
title_short IterCluster: a barcode clustering algorithm for long fragment read analysis
title_sort itercluster: a barcode clustering algorithm for long fragment read analysis
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7100596/
https://www.ncbi.nlm.nih.gov/pubmed/32231869
http://dx.doi.org/10.7717/peerj.8431
work_keys_str_mv AT wengjiancong iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis
AT chentian iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis
AT xieyinlong iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis
AT xuxun iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis
AT zhanggengyun iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis
AT petersbrocka iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis
AT drmanacradoje iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis