Cargando…

Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification

Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barco...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Chao, Zhang, Ying, Han, Shuguang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7275950/
https://www.ncbi.nlm.nih.gov/pubmed/32566672
http://dx.doi.org/10.1155/2020/2468789
_version_ 1783542861660160000
author Wang, Chao
Zhang, Ying
Han, Shuguang
author_facet Wang, Chao
Zhang, Ying
Han, Shuguang
author_sort Wang, Chao
collection PubMed
description Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences, especially the internal transcribed spacer (ITS) region, are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challenging for existing predictors. In this study, we developed Its2vec, a bioinformatics tool for the classification of fungal ITS barcodes to the species level. An ITS database covering more than 25,000 species in a broad range of fungal taxa was assembled. For dimensionality reduction, a word embedding algorithm was used to represent an ITS sequence as a dense low-dimensional vector. A random forest-based classifier was built for species identification. Benchmarking results showed that our model achieved an accuracy comparable to that of several state-of-the-art predictors, and more importantly, it could implement large datasets and greatly reduce dimensionality. We expect the Its2vec model to be helpful for fungal species identification and, thus, for revealing microbial community structures and in deepening our understanding of their functional mechanisms.
format Online
Article
Text
id pubmed-7275950
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-72759502020-06-20 Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification Wang, Chao Zhang, Ying Han, Shuguang Biomed Res Int Research Article Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences, especially the internal transcribed spacer (ITS) region, are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challenging for existing predictors. In this study, we developed Its2vec, a bioinformatics tool for the classification of fungal ITS barcodes to the species level. An ITS database covering more than 25,000 species in a broad range of fungal taxa was assembled. For dimensionality reduction, a word embedding algorithm was used to represent an ITS sequence as a dense low-dimensional vector. A random forest-based classifier was built for species identification. Benchmarking results showed that our model achieved an accuracy comparable to that of several state-of-the-art predictors, and more importantly, it could implement large datasets and greatly reduce dimensionality. We expect the Its2vec model to be helpful for fungal species identification and, thus, for revealing microbial community structures and in deepening our understanding of their functional mechanisms. Hindawi 2020-05-27 /pmc/articles/PMC7275950/ /pubmed/32566672 http://dx.doi.org/10.1155/2020/2468789 Text en Copyright © 2020 Chao Wang et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Wang, Chao
Zhang, Ying
Han, Shuguang
Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification
title Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification
title_full Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification
title_fullStr Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification
title_full_unstemmed Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification
title_short Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification
title_sort its2vec: fungal species identification using sequence embedding and random forest classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7275950/
https://www.ncbi.nlm.nih.gov/pubmed/32566672
http://dx.doi.org/10.1155/2020/2468789
work_keys_str_mv AT wangchao its2vecfungalspeciesidentificationusingsequenceembeddingandrandomforestclassification
AT zhangying its2vecfungalspeciesidentificationusingsequenceembeddingandrandomforestclassification
AT hanshuguang its2vecfungalspeciesidentificationusingsequenceembeddingandrandomforestclassification