Cargando…

Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning

Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to unde...

Descripción completa

Detalles Bibliográficos
Autores principales: Orozco-Arias, Simon, Candamil-Cortes, Mariana S., Jaimes, Paula A., Valencia-Castrillon, Estiven, Tabares-Soto, Reinel, Isaza, Gustavo, Guyot, Romain
Formato: Online Artículo Texto
Lenguaje:English
Publicado: De Gruyter 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521825/
https://www.ncbi.nlm.nih.gov/pubmed/35822734
http://dx.doi.org/10.1515/jib-2021-0036
Descripción
Sumario:Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.