Cargando…
HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses
Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to gen...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516367/ https://www.ncbi.nlm.nih.gov/pubmed/37478372 http://dx.doi.org/10.1093/bib/bbad264 |
_version_ | 1785109115140308992 |
---|---|
author | Yu, Runzhou Abdullah, Syed Muhammad Umer Sun, Yanni |
author_facet | Yu, Runzhou Abdullah, Syed Muhammad Umer Sun, Yanni |
author_sort | Yu, Runzhou |
collection | PubMed |
description | Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses. |
format | Online Article Text |
id | pubmed-10516367 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105163672023-09-23 HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses Yu, Runzhou Abdullah, Syed Muhammad Umer Sun, Yanni Brief Bioinform Problem Solving Protocol Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses. Oxford University Press 2023-07-20 /pmc/articles/PMC10516367/ /pubmed/37478372 http://dx.doi.org/10.1093/bib/bbad264 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Problem Solving Protocol Yu, Runzhou Abdullah, Syed Muhammad Umer Sun, Yanni HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses |
title | HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses |
title_full | HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses |
title_fullStr | HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses |
title_full_unstemmed | HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses |
title_short | HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses |
title_sort | hmmpolish: a coding region polishing tool for tgs-sequenced rna viruses |
topic | Problem Solving Protocol |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516367/ https://www.ncbi.nlm.nih.gov/pubmed/37478372 http://dx.doi.org/10.1093/bib/bbad264 |
work_keys_str_mv | AT yurunzhou hmmpolishacodingregionpolishingtoolfortgssequencedrnaviruses AT abdullahsyedmuhammadumer hmmpolishacodingregionpolishingtoolfortgssequencedrnaviruses AT sunyanni hmmpolishacodingregionpolishingtoolfortgssequencedrnaviruses |