Cargando…

HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C

MOTIVATION: De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve...

Descripción completa

Detalles Bibliográficos
Autores principales: Nakabayashi, Ryo, Morishita, Shinichi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672694/
https://www.ncbi.nlm.nih.gov/pubmed/32369554
http://dx.doi.org/10.1093/bioinformatics/btaa288
_version_ 1783611185712594944
author Nakabayashi, Ryo
Morishita, Shinichi
author_facet Nakabayashi, Ryo
Morishita, Shinichi
author_sort Nakabayashi, Ryo
collection PubMed
description MOTIVATION: De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics. RESULTS: To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs. AVAILABILITY AND IMPLEMENTATION: HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker.
format Online
Article
Text
id pubmed-7672694
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-76726942020-11-24 HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C Nakabayashi, Ryo Morishita, Shinichi Bioinformatics Original Papers MOTIVATION: De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics. RESULTS: To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs. AVAILABILITY AND IMPLEMENTATION: HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker. Oxford University Press 2020-07 2020-05-05 /pmc/articles/PMC7672694/ /pubmed/32369554 http://dx.doi.org/10.1093/bioinformatics/btaa288 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Nakabayashi, Ryo
Morishita, Shinichi
HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C
title HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C
title_full HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C
title_fullStr HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C
title_full_unstemmed HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C
title_short HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C
title_sort hic-hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with hi-c
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672694/
https://www.ncbi.nlm.nih.gov/pubmed/32369554
http://dx.doi.org/10.1093/bioinformatics/btaa288
work_keys_str_mv AT nakabayashiryo hichikeraprobabilisticmodeltodeterminecontigorientationinchromosomelengthscaffoldswithhic
AT morishitashinichi hichikeraprobabilisticmodeltodeterminecontigorientationinchromosomelengthscaffoldswithhic