Cargando…
Decontaminating eukaryotic genome assemblies with machine learning
BACKGROUND: High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukary...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5709863/ https://www.ncbi.nlm.nih.gov/pubmed/29191179 http://dx.doi.org/10.1186/s12859-017-1941-0 |
_version_ | 1783282857690529792 |
---|---|
author | Fierst, Janna L. Murdock, Duncan A. |
author_facet | Fierst, Janna L. Murdock, Duncan A. |
author_sort | Fierst, Janna L. |
collection | PubMed |
description | BACKGROUND: High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism. RESULTS: We introduce a novel application of an established machine learning method, a decision tree, that can rigorously classify sequences. The major strength of the decision tree is that it can take any measured feature as input and does not require a priori identification of significant descriptors. We use the decision tree to classify de novo assembled sequences and compare the method to published protocols. CONCLUSIONS: A decision tree performs better than existing methods when classifying sequences in eukaryotic de novo assemblies. It is efficient, readily implemented, and accurately identifies target and contaminant sequences. Importantly, a decision tree can be used to classify sequences according to measured descriptors and has potentially many uses in distilling biological datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1941-0) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5709863 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-57098632017-12-06 Decontaminating eukaryotic genome assemblies with machine learning Fierst, Janna L. Murdock, Duncan A. BMC Bioinformatics Methodology Article BACKGROUND: High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism. RESULTS: We introduce a novel application of an established machine learning method, a decision tree, that can rigorously classify sequences. The major strength of the decision tree is that it can take any measured feature as input and does not require a priori identification of significant descriptors. We use the decision tree to classify de novo assembled sequences and compare the method to published protocols. CONCLUSIONS: A decision tree performs better than existing methods when classifying sequences in eukaryotic de novo assemblies. It is efficient, readily implemented, and accurately identifies target and contaminant sequences. Importantly, a decision tree can be used to classify sequences according to measured descriptors and has potentially many uses in distilling biological datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1941-0) contains supplementary material, which is available to authorized users. BioMed Central 2017-12-01 /pmc/articles/PMC5709863/ /pubmed/29191179 http://dx.doi.org/10.1186/s12859-017-1941-0 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Fierst, Janna L. Murdock, Duncan A. Decontaminating eukaryotic genome assemblies with machine learning |
title | Decontaminating eukaryotic genome assemblies with machine learning |
title_full | Decontaminating eukaryotic genome assemblies with machine learning |
title_fullStr | Decontaminating eukaryotic genome assemblies with machine learning |
title_full_unstemmed | Decontaminating eukaryotic genome assemblies with machine learning |
title_short | Decontaminating eukaryotic genome assemblies with machine learning |
title_sort | decontaminating eukaryotic genome assemblies with machine learning |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5709863/ https://www.ncbi.nlm.nih.gov/pubmed/29191179 http://dx.doi.org/10.1186/s12859-017-1941-0 |
work_keys_str_mv | AT fierstjannal decontaminatingeukaryoticgenomeassemblieswithmachinelearning AT murdockduncana decontaminatingeukaryoticgenomeassemblieswithmachinelearning |