Cargando…

K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR ret...

Descripción completa

Detalles Bibliográficos
Autores principales: Orozco-Arias, Simon, Candamil-Cortés, Mariana S., Jaimes, Paula A., Piña, Johan S., Tabares-Soto, Reinel, Guyot, Romain, Isaza, Gustavo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8140598/
https://www.ncbi.nlm.nih.gov/pubmed/34055489
http://dx.doi.org/10.7717/peerj.11456
_version_ 1783696219354169344
author Orozco-Arias, Simon
Candamil-Cortés, Mariana S.
Jaimes, Paula A.
Piña, Johan S.
Tabares-Soto, Reinel
Guyot, Romain
Isaza, Gustavo
author_facet Orozco-Arias, Simon
Candamil-Cortés, Mariana S.
Jaimes, Paula A.
Piña, Johan S.
Tabares-Soto, Reinel
Guyot, Romain
Isaza, Gustavo
author_sort Orozco-Arias, Simon
collection PubMed
description Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.
format Online
Article
Text
id pubmed-8140598
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-81405982021-05-27 K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes Orozco-Arias, Simon Candamil-Cortés, Mariana S. Jaimes, Paula A. Piña, Johan S. Tabares-Soto, Reinel Guyot, Romain Isaza, Gustavo PeerJ Bioinformatics Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences. PeerJ Inc. 2021-05-19 /pmc/articles/PMC8140598/ /pubmed/34055489 http://dx.doi.org/10.7717/peerj.11456 Text en © 2021 Orozco-Arias et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Orozco-Arias, Simon
Candamil-Cortés, Mariana S.
Jaimes, Paula A.
Piña, Johan S.
Tabares-Soto, Reinel
Guyot, Romain
Isaza, Gustavo
K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes
title K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes
title_full K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes
title_fullStr K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes
title_full_unstemmed K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes
title_short K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes
title_sort k-mer-based machine learning method to classify ltr-retrotransposons in plant genomes
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8140598/
https://www.ncbi.nlm.nih.gov/pubmed/34055489
http://dx.doi.org/10.7717/peerj.11456
work_keys_str_mv AT orozcoariassimon kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes
AT candamilcortesmarianas kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes
AT jaimespaulaa kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes
AT pinajohans kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes
AT tabaressotoreinel kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes
AT guyotromain kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes
AT isazagustavo kmerbasedmachinelearningmethodtoclassifyltrretrotransposonsinplantgenomes