Cargando…
CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking
Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used templa...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980126/ https://www.ncbi.nlm.nih.gov/pubmed/36865277 http://dx.doi.org/10.1101/2023.02.21.529443 |
_version_ | 1784899853829013504 |
---|---|
author | Dhakal, Ashwin Gyawali, Rajan Wang, Liguo Cheng, Jianlin |
author_facet | Dhakal, Ashwin Gyawali, Rajan Wang, Liguo Cheng, Jianlin |
author_sort | Dhakal, Ashwin |
collection | PubMed |
description | Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (~300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp |
format | Online Article Text |
id | pubmed-9980126 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-99801262023-03-03 CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking Dhakal, Ashwin Gyawali, Rajan Wang, Liguo Cheng, Jianlin bioRxiv Article Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (~300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp Cold Spring Harbor Laboratory 2023-02-22 /pmc/articles/PMC9980126/ /pubmed/36865277 http://dx.doi.org/10.1101/2023.02.21.529443 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Dhakal, Ashwin Gyawali, Rajan Wang, Liguo Cheng, Jianlin CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking |
title | CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking |
title_full | CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking |
title_fullStr | CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking |
title_full_unstemmed | CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking |
title_short | CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking |
title_sort | cryoppp: a large expert-labelled cryo-em image dataset for machine learning protein particle picking |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9980126/ https://www.ncbi.nlm.nih.gov/pubmed/36865277 http://dx.doi.org/10.1101/2023.02.21.529443 |
work_keys_str_mv | AT dhakalashwin cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking AT gyawalirajan cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking AT wangliguo cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking AT chengjianlin cryopppalargeexpertlabelledcryoemimagedatasetformachinelearningproteinparticlepicking |