Cargando…

Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting

BACKGROUND: Biomolecular methods for species identification are increasingly being utilised in the study of changing environments, both at the microscopic and macroscopic levels. High-throughput peptide mass fingerprinting has been largely applied to bacterial identification, but increasingly used t...

Descripción completa

Detalles Bibliográficos
Autores principales: Gu, Muxin, Buckley, Michael
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019507/
https://www.ncbi.nlm.nih.gov/pubmed/29940843
http://dx.doi.org/10.1186/s12859-018-2221-3
_version_ 1783335138623488000
author Gu, Muxin
Buckley, Michael
author_facet Gu, Muxin
Buckley, Michael
author_sort Gu, Muxin
collection PubMed
description BACKGROUND: Biomolecular methods for species identification are increasingly being utilised in the study of changing environments, both at the microscopic and macroscopic levels. High-throughput peptide mass fingerprinting has been largely applied to bacterial identification, but increasingly used to identify archaeological and palaeontological skeletal material to yield information on past environments and human-animal interaction. However, as applications move away from predominantly domesticate and the more abundant wild fauna to a much wider range of less common taxa that do not yet have genetically-derived sequence information, robust methods of species identification and biomarker selection need to be determined. RESULTS: Here we developed a supervised machine learning algorithm for classifying the species of ancient remains based on collagen fingerprinting. The aim was to minimise requirements on prior knowledge of known species while yielding satisfactory sensitivity and specificity. The algorithm uses iterations of a modified random forest classifier with a similarity scoring system to expand its identified samples. We tested it on a set of 6805 spectra and found that a high level of accuracy can be achieved with a training set of five identified specimens per taxon. CONCLUSIONS: This method consistently achieves higher accuracy than two-dimensional principal component analysis and similar accuracy with hierarchical clustering using optimised parameters, which greatly reduces requirements for human input. Within the vertebrata, we demonstrate that this method was able to achieve the taxonomic resolution of family or sub-family level whereas the genus- or species-level identification may require manual interpretation or further experiments. In addition, it also identifies additional species biomarkers than those previously published. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2221-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6019507
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60195072018-07-06 Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting Gu, Muxin Buckley, Michael BMC Bioinformatics Methodology Article BACKGROUND: Biomolecular methods for species identification are increasingly being utilised in the study of changing environments, both at the microscopic and macroscopic levels. High-throughput peptide mass fingerprinting has been largely applied to bacterial identification, but increasingly used to identify archaeological and palaeontological skeletal material to yield information on past environments and human-animal interaction. However, as applications move away from predominantly domesticate and the more abundant wild fauna to a much wider range of less common taxa that do not yet have genetically-derived sequence information, robust methods of species identification and biomarker selection need to be determined. RESULTS: Here we developed a supervised machine learning algorithm for classifying the species of ancient remains based on collagen fingerprinting. The aim was to minimise requirements on prior knowledge of known species while yielding satisfactory sensitivity and specificity. The algorithm uses iterations of a modified random forest classifier with a similarity scoring system to expand its identified samples. We tested it on a set of 6805 spectra and found that a high level of accuracy can be achieved with a training set of five identified specimens per taxon. CONCLUSIONS: This method consistently achieves higher accuracy than two-dimensional principal component analysis and similar accuracy with hierarchical clustering using optimised parameters, which greatly reduces requirements for human input. Within the vertebrata, we demonstrate that this method was able to achieve the taxonomic resolution of family or sub-family level whereas the genus- or species-level identification may require manual interpretation or further experiments. In addition, it also identifies additional species biomarkers than those previously published. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2221-3) contains supplementary material, which is available to authorized users. BioMed Central 2018-06-26 /pmc/articles/PMC6019507/ /pubmed/29940843 http://dx.doi.org/10.1186/s12859-018-2221-3 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Gu, Muxin
Buckley, Michael
Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
title Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
title_full Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
title_fullStr Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
title_full_unstemmed Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
title_short Semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
title_sort semi-supervised machine learning for automated species identification by collagen peptide mass fingerprinting
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019507/
https://www.ncbi.nlm.nih.gov/pubmed/29940843
http://dx.doi.org/10.1186/s12859-018-2221-3
work_keys_str_mv AT gumuxin semisupervisedmachinelearningforautomatedspeciesidentificationbycollagenpeptidemassfingerprinting
AT buckleymichael semisupervisedmachinelearningforautomatedspeciesidentificationbycollagenpeptidemassfingerprinting