Cargando…

Classification of bacterial plasmid and chromosome derived sequences using machine learning

Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a...

Descripción completa

Detalles Bibliográficos
Autores principales: Zou, Xiaohui, Nguyen, Marcus, Overbeek, Jamie, Cao, Bin, Davis, James J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757591/
https://www.ncbi.nlm.nih.gov/pubmed/36525447
http://dx.doi.org/10.1371/journal.pone.0279280
_version_ 1784851852671582208
author Zou, Xiaohui
Nguyen, Marcus
Overbeek, Jamie
Cao, Bin
Davis, James J.
author_facet Zou, Xiaohui
Nguyen, Marcus
Overbeek, Jamie
Cao, Bin
Davis, James J.
author_sort Zou, Xiaohui
collection PubMed
description Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer—including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements—were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools.
format Online
Article
Text
id pubmed-9757591
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-97575912022-12-17 Classification of bacterial plasmid and chromosome derived sequences using machine learning Zou, Xiaohui Nguyen, Marcus Overbeek, Jamie Cao, Bin Davis, James J. PLoS One Research Article Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer—including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements—were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools. Public Library of Science 2022-12-16 /pmc/articles/PMC9757591/ /pubmed/36525447 http://dx.doi.org/10.1371/journal.pone.0279280 Text en https://creativecommons.org/publicdomain/zero/1.0/This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Zou, Xiaohui
Nguyen, Marcus
Overbeek, Jamie
Cao, Bin
Davis, James J.
Classification of bacterial plasmid and chromosome derived sequences using machine learning
title Classification of bacterial plasmid and chromosome derived sequences using machine learning
title_full Classification of bacterial plasmid and chromosome derived sequences using machine learning
title_fullStr Classification of bacterial plasmid and chromosome derived sequences using machine learning
title_full_unstemmed Classification of bacterial plasmid and chromosome derived sequences using machine learning
title_short Classification of bacterial plasmid and chromosome derived sequences using machine learning
title_sort classification of bacterial plasmid and chromosome derived sequences using machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9757591/
https://www.ncbi.nlm.nih.gov/pubmed/36525447
http://dx.doi.org/10.1371/journal.pone.0279280
work_keys_str_mv AT zouxiaohui classificationofbacterialplasmidandchromosomederivedsequencesusingmachinelearning
AT nguyenmarcus classificationofbacterialplasmidandchromosomederivedsequencesusingmachinelearning
AT overbeekjamie classificationofbacterialplasmidandchromosomederivedsequencesusingmachinelearning
AT caobin classificationofbacterialplasmidandchromosomederivedsequencesusingmachinelearning
AT davisjamesj classificationofbacterialplasmidandchromosomederivedsequencesusingmachinelearning