Cargando…
EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
BACKGROUND: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scor...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8627028/ https://www.ncbi.nlm.nih.gov/pubmed/34837948 http://dx.doi.org/10.1186/s12859-021-04480-2 |
_version_ | 1784606774245982208 |
---|---|
author | MacDonald, Madolyn L. Lee, Kelvin H. |
author_facet | MacDonald, Madolyn L. Lee, Kelvin H. |
author_sort | MacDonald, Madolyn L. |
collection | PubMed |
description | BACKGROUND: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. RESULTS: EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. CONCLUSIONS: EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04480-2. |
format | Online Article Text |
id | pubmed-8627028 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-86270282021-11-30 EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality MacDonald, Madolyn L. Lee, Kelvin H. BMC Bioinformatics Software BACKGROUND: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. RESULTS: EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. CONCLUSIONS: EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04480-2. BioMed Central 2021-11-27 /pmc/articles/PMC8627028/ /pubmed/34837948 http://dx.doi.org/10.1186/s12859-021-04480-2 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software MacDonald, Madolyn L. Lee, Kelvin H. EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
title | EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
title_full | EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
title_fullStr | EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
title_full_unstemmed | EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
title_short | EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
title_sort | evaldna: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8627028/ https://www.ncbi.nlm.nih.gov/pubmed/34837948 http://dx.doi.org/10.1186/s12859-021-04480-2 |
work_keys_str_mv | AT macdonaldmadolynl evaldnaamachinelearningbasedtoolforthecomprehensiveevaluationofmammaliangenomeassemblyquality AT leekelvinh evaldnaamachinelearningbasedtoolforthecomprehensiveevaluationofmammaliangenomeassemblyquality |