Cargando…

CRISPRidentify: identification of CRISPR arrays using machine learning approach

CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript...

Descripción completa

Detalles Bibliográficos
Autores principales: Mitrofanov, Alexander, Alkhnbashi, Omer S, Shmakov, Sergey A, Makarova, Kira S, Koonin, Eugene V, Backofen, Rolf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7913763/
https://www.ncbi.nlm.nih.gov/pubmed/33290505
http://dx.doi.org/10.1093/nar/gkaa1158
_version_ 1783656877479952384
author Mitrofanov, Alexander
Alkhnbashi, Omer S
Shmakov, Sergey A
Makarova, Kira S
Koonin, Eugene V
Backofen, Rolf
author_facet Mitrofanov, Alexander
Alkhnbashi, Omer S
Shmakov, Sergey A
Makarova, Kira S
Koonin, Eugene V
Backofen, Rolf
author_sort Mitrofanov, Alexander
collection PubMed
description CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.
format Online
Article
Text
id pubmed-7913763
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-79137632021-03-03 CRISPRidentify: identification of CRISPR arrays using machine learning approach Mitrofanov, Alexander Alkhnbashi, Omer S Shmakov, Sergey A Makarova, Kira S Koonin, Eugene V Backofen, Rolf Nucleic Acids Res Methods Online CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array. Oxford University Press 2020-12-08 /pmc/articles/PMC7913763/ /pubmed/33290505 http://dx.doi.org/10.1093/nar/gkaa1158 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Mitrofanov, Alexander
Alkhnbashi, Omer S
Shmakov, Sergey A
Makarova, Kira S
Koonin, Eugene V
Backofen, Rolf
CRISPRidentify: identification of CRISPR arrays using machine learning approach
title CRISPRidentify: identification of CRISPR arrays using machine learning approach
title_full CRISPRidentify: identification of CRISPR arrays using machine learning approach
title_fullStr CRISPRidentify: identification of CRISPR arrays using machine learning approach
title_full_unstemmed CRISPRidentify: identification of CRISPR arrays using machine learning approach
title_short CRISPRidentify: identification of CRISPR arrays using machine learning approach
title_sort crispridentify: identification of crispr arrays using machine learning approach
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7913763/
https://www.ncbi.nlm.nih.gov/pubmed/33290505
http://dx.doi.org/10.1093/nar/gkaa1158
work_keys_str_mv AT mitrofanovalexander crispridentifyidentificationofcrisprarraysusingmachinelearningapproach
AT alkhnbashiomers crispridentifyidentificationofcrisprarraysusingmachinelearningapproach
AT shmakovsergeya crispridentifyidentificationofcrisprarraysusingmachinelearningapproach
AT makarovakiras crispridentifyidentificationofcrisprarraysusingmachinelearningapproach
AT koonineugenev crispridentifyidentificationofcrisprarraysusingmachinelearningapproach
AT backofenrolf crispridentifyidentificationofcrisprarraysusingmachinelearningapproach