Cargando…

Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

Environmental DNA metabarcoding is a powerful approach for use in biomonitoring and impact assessments. Amplicon-based eDNA sequence data are characteristically highly divergent in sequencing depth (total reads per sample) as influenced inter alia by the number of samples simultaneously analyzed per...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dully, Verena, Wilding, Thomas A., Mühlhaus, Timo, Stoeck, Thorsten
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Research Network of Computational and Structural Biotechnology 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8093828/ https://www.ncbi.nlm.nih.gov/pubmed/33995917 http://dx.doi.org/10.1016/j.csbj.2021.04.005

_version_	1783687898538704896
author	Dully, Verena Wilding, Thomas A. Mühlhaus, Timo Stoeck, Thorsten
author_facet	Dully, Verena Wilding, Thomas A. Mühlhaus, Timo Stoeck, Thorsten
author_sort	Dully, Verena
collection	PubMed
description	Environmental DNA metabarcoding is a powerful approach for use in biomonitoring and impact assessments. Amplicon-based eDNA sequence data are characteristically highly divergent in sequencing depth (total reads per sample) as influenced inter alia by the number of samples simultaneously analyzed per sequencing run. The random forest (RF) machine learning algorithm has been successfully employed to accurately classify unknown samples into monitoring categories. To employ RF to eDNA data, and avoid sequencing-depth artifacts, sequence data across samples are normalized using rarefaction, a process that inherently loses information. The aim of this study was to inform future sampling designs in terms of the relationship between sampling depth and RF accuracy. We analyzed three published and one new bacterial amplicon datasets, using a RF, based initially on the maximal rarefied data available (minimum mean of > 30,000 reads across all datasets) to give our baseline performance. We then evaluated the RF classification success based on increasingly rarefied datasets. We found that extreme to moderate rarefaction (50–5000 sequences per sample) was sufficient to achieve prediction performance commensurate to the full data, depending on the classification task. We did not find that the number of classification classes, data balance across classes, or the total number of sequences or samples, were associated with predictive accuracy. We identified the ability of the training data to adequately characterize the classes being mapped as the most important criterion and discuss how this finding can inform future sampling design for eDNA based biomonitoring to reduce costs and computation time.
format	Online Article Text
id	pubmed-8093828
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Research Network of Computational and Structural Biotechnology
record_format	MEDLINE/PubMed
spelling	pubmed-80938282021-05-14 Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning Dully, Verena Wilding, Thomas A. Mühlhaus, Timo Stoeck, Thorsten Comput Struct Biotechnol J Research Article Environmental DNA metabarcoding is a powerful approach for use in biomonitoring and impact assessments. Amplicon-based eDNA sequence data are characteristically highly divergent in sequencing depth (total reads per sample) as influenced inter alia by the number of samples simultaneously analyzed per sequencing run. The random forest (RF) machine learning algorithm has been successfully employed to accurately classify unknown samples into monitoring categories. To employ RF to eDNA data, and avoid sequencing-depth artifacts, sequence data across samples are normalized using rarefaction, a process that inherently loses information. The aim of this study was to inform future sampling designs in terms of the relationship between sampling depth and RF accuracy. We analyzed three published and one new bacterial amplicon datasets, using a RF, based initially on the maximal rarefied data available (minimum mean of > 30,000 reads across all datasets) to give our baseline performance. We then evaluated the RF classification success based on increasingly rarefied datasets. We found that extreme to moderate rarefaction (50–5000 sequences per sample) was sufficient to achieve prediction performance commensurate to the full data, depending on the classification task. We did not find that the number of classification classes, data balance across classes, or the total number of sequences or samples, were associated with predictive accuracy. We identified the ability of the training data to adequately characterize the classes being mapped as the most important criterion and discuss how this finding can inform future sampling design for eDNA based biomonitoring to reduce costs and computation time. Research Network of Computational and Structural Biotechnology 2021-04-26 /pmc/articles/PMC8093828/ /pubmed/33995917 http://dx.doi.org/10.1016/j.csbj.2021.04.005 Text en © 2021 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Research Article Dully, Verena Wilding, Thomas A. Mühlhaus, Timo Stoeck, Thorsten Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
title	Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
title_full	Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
title_fullStr	Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
title_full_unstemmed	Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
title_short	Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
title_sort	identifying the minimum amplicon sequence depth to adequately predict classes in edna-based marine biomonitoring using supervised machine learning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8093828/ https://www.ncbi.nlm.nih.gov/pubmed/33995917 http://dx.doi.org/10.1016/j.csbj.2021.04.005
work_keys_str_mv	AT dullyverena identifyingtheminimumampliconsequencedepthtoadequatelypredictclassesinednabasedmarinebiomonitoringusingsupervisedmachinelearning AT wildingthomasa identifyingtheminimumampliconsequencedepthtoadequatelypredictclassesinednabasedmarinebiomonitoringusingsupervisedmachinelearning AT muhlhaustimo identifyingtheminimumampliconsequencedepthtoadequatelypredictclassesinednabasedmarinebiomonitoringusingsupervisedmachinelearning AT stoeckthorsten identifyingtheminimumampliconsequencedepthtoadequatelypredictclassesinednabasedmarinebiomonitoringusingsupervisedmachinelearning

Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

Ejemplares similares