Cargando…

HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses

Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to gen...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Runzhou, Abdullah, Syed Muhammad Umer, Sun, Yanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516367/
https://www.ncbi.nlm.nih.gov/pubmed/37478372
http://dx.doi.org/10.1093/bib/bbad264
_version_ 1785109115140308992
author Yu, Runzhou
Abdullah, Syed Muhammad Umer
Sun, Yanni
author_facet Yu, Runzhou
Abdullah, Syed Muhammad Umer
Sun, Yanni
author_sort Yu, Runzhou
collection PubMed
description Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.
format Online
Article
Text
id pubmed-10516367
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-105163672023-09-23 HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses Yu, Runzhou Abdullah, Syed Muhammad Umer Sun, Yanni Brief Bioinform Problem Solving Protocol Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses. Oxford University Press 2023-07-20 /pmc/articles/PMC10516367/ /pubmed/37478372 http://dx.doi.org/10.1093/bib/bbad264 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Problem Solving Protocol
Yu, Runzhou
Abdullah, Syed Muhammad Umer
Sun, Yanni
HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
title HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
title_full HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
title_fullStr HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
title_full_unstemmed HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
title_short HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
title_sort hmmpolish: a coding region polishing tool for tgs-sequenced rna viruses
topic Problem Solving Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516367/
https://www.ncbi.nlm.nih.gov/pubmed/37478372
http://dx.doi.org/10.1093/bib/bbad264
work_keys_str_mv AT yurunzhou hmmpolishacodingregionpolishingtoolfortgssequencedrnaviruses
AT abdullahsyedmuhammadumer hmmpolishacodingregionpolishingtoolfortgssequencedrnaviruses
AT sunyanni hmmpolishacodingregionpolishingtoolfortgssequencedrnaviruses