Cargando…
IterCluster: a barcode clustering algorithm for long fragment read analysis
Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7100596/ https://www.ncbi.nlm.nih.gov/pubmed/32231869 http://dx.doi.org/10.7717/peerj.8431 |
_version_ | 1783511463981219840 |
---|---|
author | Weng, Jiancong Chen, Tian Xie, Yinlong Xu, Xun Zhang, Gengyun Peters, Brock A. Drmanac, Radoje |
author_facet | Weng, Jiancong Chen, Tian Xie, Yinlong Xu, Xun Zhang, Gengyun Peters, Brock A. Drmanac, Radoje |
author_sort | Weng, Jiancong |
collection | PubMed |
description | Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster. |
format | Online Article Text |
id | pubmed-7100596 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-71005962020-03-30 IterCluster: a barcode clustering algorithm for long fragment read analysis Weng, Jiancong Chen, Tian Xie, Yinlong Xu, Xun Zhang, Gengyun Peters, Brock A. Drmanac, Radoje PeerJ Bioinformatics Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster. PeerJ Inc. 2020-03-24 /pmc/articles/PMC7100596/ /pubmed/32231869 http://dx.doi.org/10.7717/peerj.8431 Text en ©2020 Weng et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Weng, Jiancong Chen, Tian Xie, Yinlong Xu, Xun Zhang, Gengyun Peters, Brock A. Drmanac, Radoje IterCluster: a barcode clustering algorithm for long fragment read analysis |
title | IterCluster: a barcode clustering algorithm for long fragment read analysis |
title_full | IterCluster: a barcode clustering algorithm for long fragment read analysis |
title_fullStr | IterCluster: a barcode clustering algorithm for long fragment read analysis |
title_full_unstemmed | IterCluster: a barcode clustering algorithm for long fragment read analysis |
title_short | IterCluster: a barcode clustering algorithm for long fragment read analysis |
title_sort | itercluster: a barcode clustering algorithm for long fragment read analysis |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7100596/ https://www.ncbi.nlm.nih.gov/pubmed/32231869 http://dx.doi.org/10.7717/peerj.8431 |
work_keys_str_mv | AT wengjiancong iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis AT chentian iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis AT xieyinlong iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis AT xuxun iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis AT zhanggengyun iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis AT petersbrocka iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis AT drmanacradoje iterclusterabarcodeclusteringalgorithmforlongfragmentreadanalysis |