Cargando…

A machine learning based framework to identify and classify long terminal repeat retrotransposons

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs p...

Descripción completa

Detalles Bibliográficos
Autores principales: Schietgat, Leander, Vens, Celine, Cerri, Ricardo, Fischer, Carlos N., Costa, Eduardo, Ramon, Jan, Carareto, Claudia M. A., Blockeel, Hendrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5933816/
https://www.ncbi.nlm.nih.gov/pubmed/29684010
http://dx.doi.org/10.1371/journal.pcbi.1006097
_version_ 1783320017711923200
author Schietgat, Leander
Vens, Celine
Cerri, Ricardo
Fischer, Carlos N.
Costa, Eduardo
Ramon, Jan
Carareto, Claudia M. A.
Blockeel, Hendrik
author_facet Schietgat, Leander
Vens, Celine
Cerri, Ricardo
Fischer, Carlos N.
Costa, Eduardo
Ramon, Jan
Carareto, Claudia M. A.
Blockeel, Hendrik
author_sort Schietgat, Leander
collection PubMed
description Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.
format Online
Article
Text
id pubmed-5933816
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-59338162018-05-18 A machine learning based framework to identify and classify long terminal repeat retrotransposons Schietgat, Leander Vens, Celine Cerri, Ricardo Fischer, Carlos N. Costa, Eduardo Ramon, Jan Carareto, Claudia M. A. Blockeel, Hendrik PLoS Comput Biol Research Article Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. Public Library of Science 2018-04-23 /pmc/articles/PMC5933816/ /pubmed/29684010 http://dx.doi.org/10.1371/journal.pcbi.1006097 Text en © 2018 Schietgat et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Schietgat, Leander
Vens, Celine
Cerri, Ricardo
Fischer, Carlos N.
Costa, Eduardo
Ramon, Jan
Carareto, Claudia M. A.
Blockeel, Hendrik
A machine learning based framework to identify and classify long terminal repeat retrotransposons
title A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_full A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_fullStr A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_full_unstemmed A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_short A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_sort machine learning based framework to identify and classify long terminal repeat retrotransposons
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5933816/
https://www.ncbi.nlm.nih.gov/pubmed/29684010
http://dx.doi.org/10.1371/journal.pcbi.1006097
work_keys_str_mv AT schietgatleander amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT vensceline amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT cerriricardo amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT fischercarlosn amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT costaeduardo amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT ramonjan amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT cararetoclaudiama amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT blockeelhendrik amachinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT schietgatleander machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT vensceline machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT cerriricardo machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT fischercarlosn machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT costaeduardo machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT ramonjan machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT cararetoclaudiama machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons
AT blockeelhendrik machinelearningbasedframeworktoidentifyandclassifylongterminalrepeatretrotransposons