Cargando…

CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking

Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used templa...

Descripción completa

Detalles Bibliográficos
Autores principales: Dhakal, Ashwin, Gyawali, Rajan, Wang, Liguo, Cheng, Jianlin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980126/
https://www.ncbi.nlm.nih.gov/pubmed/36865277
http://dx.doi.org/10.1101/2023.02.21.529443
_version_ 1784899853829013504
author Dhakal, Ashwin
Gyawali, Rajan
Wang, Liguo
Cheng, Jianlin
author_facet Dhakal, Ashwin
Gyawali, Rajan
Wang, Liguo
Cheng, Jianlin
author_sort Dhakal, Ashwin
collection PubMed
description Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (~300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp
format Online
Article
Text
id pubmed-9980126
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-99801262023-03-03 CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking Dhakal, Ashwin Gyawali, Rajan Wang, Liguo Cheng, Jianlin bioRxiv Article Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (~300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp Cold Spring Harbor Laboratory 2023-02-22 /pmc/articles/PMC9980126/ /pubmed/36865277 http://dx.doi.org/10.1101/2023.02.21.529443 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Dhakal, Ashwin
Gyawali, Rajan
Wang, Liguo
Cheng, Jianlin
CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
title CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
title_full CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
title_fullStr CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
title_full_unstemmed CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
title_short CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
title_sort cryoppp: a large expert-labelled cryo-em image dataset for machine learning protein particle picking
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980126/
https://www.ncbi.nlm.nih.gov/pubmed/36865277
http://dx.doi.org/10.1101/2023.02.21.529443
work_keys_str_mv AT dhakalashwin cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking
AT gyawalirajan cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking
AT wangliguo cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking
AT chengjianlin cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking