Cargando…
Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life
BACKGROUND: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, stati...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7507296/ https://www.ncbi.nlm.nih.gov/pubmed/32957925 http://dx.doi.org/10.1186/s12859-020-03744-7 |
_version_ | 1783585199702933504 |
---|---|
author | Zhao, Zhengqiao Cristian, Alexandru Rosen, Gail |
author_facet | Zhao, Zhengqiao Cristian, Alexandru Rosen, Gail |
author_sort | Zhao, Zhengqiao |
collection | PubMed |
description | BACKGROUND: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. RESULTS: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4(th) of the non-incremental time with no accuracy loss. CONCLUSIONS: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources. |
format | Online Article Text |
id | pubmed-7507296 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-75072962020-09-23 Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life Zhao, Zhengqiao Cristian, Alexandru Rosen, Gail BMC Bioinformatics Research Article BACKGROUND: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. RESULTS: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4(th) of the non-incremental time with no accuracy loss. CONCLUSIONS: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources. BioMed Central 2020-09-21 /pmc/articles/PMC7507296/ /pubmed/32957925 http://dx.doi.org/10.1186/s12859-020-03744-7 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Zhao, Zhengqiao Cristian, Alexandru Rosen, Gail Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
title | Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
title_full | Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
title_fullStr | Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
title_full_unstemmed | Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
title_short | Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
title_sort | keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7507296/ https://www.ncbi.nlm.nih.gov/pubmed/32957925 http://dx.doi.org/10.1186/s12859-020-03744-7 |
work_keys_str_mv | AT zhaozhengqiao keepingupwiththegenomesefficientlearningofourincreasingknowledgeofthetreeoflife AT cristianalexandru keepingupwiththegenomesefficientlearningofourincreasingknowledgeofthetreeoflife AT rosengail keepingupwiththegenomesefficientlearningofourincreasingknowledgeofthetreeoflife |