Cargando…

CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning

Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training...

Descripción completa

Detalles Bibliográficos
Autores principales:	Conrad, Ryan, Narayan, Kedar
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	eLife Sciences Publications, Ltd 2021
Materias:	Cell Biology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8032397/ https://www.ncbi.nlm.nih.gov/pubmed/33830015 http://dx.doi.org/10.7554/eLife.65894

_version_	1783676210991071232
author	Conrad, Ryan Narayan, Kedar
author_facet	Conrad, Ryan Narayan, Kedar
author_sort	Conrad, Ryan
collection	PubMed
description	Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 10(6) unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.
format	Online Article Text
id	pubmed-8032397
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	eLife Sciences Publications, Ltd
record_format	MEDLINE/PubMed
spelling	pubmed-80323972021-04-12 CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning Conrad, Ryan Narayan, Kedar eLife Cell Biology Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 10(6) unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz. eLife Sciences Publications, Ltd 2021-04-08 /pmc/articles/PMC8032397/ /pubmed/33830015 http://dx.doi.org/10.7554/eLife.65894 Text en https://creativecommons.org/publicdomain/zero/1.0/This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication (https://creativecommons.org/publicdomain/zero/1.0/) .
spellingShingle	Cell Biology Conrad, Ryan Narayan, Kedar CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title	CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_full	CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_fullStr	CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_full_unstemmed	CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_short	CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
title_sort	cem500k, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning
topic	Cell Biology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8032397/ https://www.ncbi.nlm.nih.gov/pubmed/33830015 http://dx.doi.org/10.7554/eLife.65894
work_keys_str_mv	AT conradryan cem500kalargescaleheterogeneousunlabeledcellularelectronmicroscopyimagedatasetfordeeplearning AT narayankedar cem500kalargescaleheterogeneousunlabeledcellularelectronmicroscopyimagedatasetfordeeplearning

CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning

Ejemplares similares