Cargando…

Curation of over 10 000 transcriptomic studies to enable data reuse

Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe–gene mappings across microarray technologies. Thus, extensive curation an...

Descripción completa

Detalles Bibliográficos
Autores principales: Lim, Nathaniel, Tesar, Stepan, Belmadani, Manuel, Poirier-Morency, Guillaume, Mancarci, Burak Ogan, Sicherman, Jordan, Jacobson, Matthew, Leong, Justin, Tan, Patrick, Pavlidis, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7904053/
https://www.ncbi.nlm.nih.gov/pubmed/33599246
http://dx.doi.org/10.1093/database/baab006
_version_ 1783654854537773056
author Lim, Nathaniel
Tesar, Stepan
Belmadani, Manuel
Poirier-Morency, Guillaume
Mancarci, Burak Ogan
Sicherman, Jordan
Jacobson, Matthew
Leong, Justin
Tan, Patrick
Pavlidis, Paul
author_facet Lim, Nathaniel
Tesar, Stepan
Belmadani, Manuel
Poirier-Morency, Guillaume
Mancarci, Burak Ogan
Sicherman, Jordan
Jacobson, Matthew
Leong, Justin
Tan, Patrick
Pavlidis, Paul
author_sort Lim, Nathaniel
collection PubMed
description Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe–gene mappings across microarray technologies. Thus, extensive curation and data reprocessing are necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10 811 manually curated datasets (primarily human, mouse and rat), over 395 000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA sequencing). Dataset topics were represented with 10 215 distinct terms from 12 ontologies, for a total of 54 316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service and an R package. Database URL: https://gemma.msl.ubc.ca/home.html
format Online
Article
Text
id pubmed-7904053
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-79040532021-03-01 Curation of over 10 000 transcriptomic studies to enable data reuse Lim, Nathaniel Tesar, Stepan Belmadani, Manuel Poirier-Morency, Guillaume Mancarci, Burak Ogan Sicherman, Jordan Jacobson, Matthew Leong, Justin Tan, Patrick Pavlidis, Paul Database (Oxford) Original Article Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe–gene mappings across microarray technologies. Thus, extensive curation and data reprocessing are necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10 811 manually curated datasets (primarily human, mouse and rat), over 395 000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA sequencing). Dataset topics were represented with 10 215 distinct terms from 12 ontologies, for a total of 54 316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service and an R package. Database URL: https://gemma.msl.ubc.ca/home.html Oxford University Press 2021-02-18 /pmc/articles/PMC7904053/ /pubmed/33599246 http://dx.doi.org/10.1093/database/baab006 Text en © The Author(s) 2021. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Lim, Nathaniel
Tesar, Stepan
Belmadani, Manuel
Poirier-Morency, Guillaume
Mancarci, Burak Ogan
Sicherman, Jordan
Jacobson, Matthew
Leong, Justin
Tan, Patrick
Pavlidis, Paul
Curation of over 10 000 transcriptomic studies to enable data reuse
title Curation of over 10 000 transcriptomic studies to enable data reuse
title_full Curation of over 10 000 transcriptomic studies to enable data reuse
title_fullStr Curation of over 10 000 transcriptomic studies to enable data reuse
title_full_unstemmed Curation of over 10 000 transcriptomic studies to enable data reuse
title_short Curation of over 10 000 transcriptomic studies to enable data reuse
title_sort curation of over 10 000 transcriptomic studies to enable data reuse
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7904053/
https://www.ncbi.nlm.nih.gov/pubmed/33599246
http://dx.doi.org/10.1093/database/baab006
work_keys_str_mv AT limnathaniel curationofover10000transcriptomicstudiestoenabledatareuse
AT tesarstepan curationofover10000transcriptomicstudiestoenabledatareuse
AT belmadanimanuel curationofover10000transcriptomicstudiestoenabledatareuse
AT poiriermorencyguillaume curationofover10000transcriptomicstudiestoenabledatareuse
AT mancarciburakogan curationofover10000transcriptomicstudiestoenabledatareuse
AT sichermanjordan curationofover10000transcriptomicstudiestoenabledatareuse
AT jacobsonmatthew curationofover10000transcriptomicstudiestoenabledatareuse
AT leongjustin curationofover10000transcriptomicstudiestoenabledatareuse
AT tanpatrick curationofover10000transcriptomicstudiestoenabledatareuse
AT pavlidispaul curationofover10000transcriptomicstudiestoenabledatareuse