Cargando…

EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality

BACKGROUND: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scor...

Descripción completa

Detalles Bibliográficos
Autores principales: MacDonald, Madolyn L., Lee, Kelvin H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8627028/
https://www.ncbi.nlm.nih.gov/pubmed/34837948
http://dx.doi.org/10.1186/s12859-021-04480-2
_version_ 1784606774245982208
author MacDonald, Madolyn L.
Lee, Kelvin H.
author_facet MacDonald, Madolyn L.
Lee, Kelvin H.
author_sort MacDonald, Madolyn L.
collection PubMed
description BACKGROUND: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. RESULTS: EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. CONCLUSIONS: EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04480-2.
format Online
Article
Text
id pubmed-8627028
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-86270282021-11-30 EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality MacDonald, Madolyn L. Lee, Kelvin H. BMC Bioinformatics Software BACKGROUND: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. RESULTS: EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. CONCLUSIONS: EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04480-2. BioMed Central 2021-11-27 /pmc/articles/PMC8627028/ /pubmed/34837948 http://dx.doi.org/10.1186/s12859-021-04480-2 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
MacDonald, Madolyn L.
Lee, Kelvin H.
EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_full EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_fullStr EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_full_unstemmed EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_short EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_sort evaldna: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8627028/
https://www.ncbi.nlm.nih.gov/pubmed/34837948
http://dx.doi.org/10.1186/s12859-021-04480-2
work_keys_str_mv AT macdonaldmadolynl evaldnaamachinelearningbasedtoolforthecomprehensiveevaluationofmammaliangenomeassemblyquality
AT leekelvinh evaldnaamachinelearningbasedtoolforthecomprehensiveevaluationofmammaliangenomeassemblyquality