Cargando…

GeMI: interactive interface for transformer-based Genomic Metadata Integration

The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence o...

Descripción completa

Detalles Bibliográficos
Autores principales: Serna Garcia, Giuseppe, Leone, Michele, Bernasconi, Anna, Carman, Mark J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216561/
https://www.ncbi.nlm.nih.gov/pubmed/35657113
http://dx.doi.org/10.1093/database/baac036
_version_ 1784731451987591168
author Serna Garcia, Giuseppe
Leone, Michele
Bernasconi, Anna
Carman, Mark J
author_facet Serna Garcia, Giuseppe
Leone, Michele
Bernasconi, Anna
Carman, Mark J
author_sort Serna Garcia, Giuseppe
collection PubMed
description The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/
format Online
Article
Text
id pubmed-9216561
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92165612022-06-23 GeMI: interactive interface for transformer-based Genomic Metadata Integration Serna Garcia, Giuseppe Leone, Michele Bernasconi, Anna Carman, Mark J Database (Oxford) Original Article The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/ Oxford University Press 2022-06-03 /pmc/articles/PMC9216561/ /pubmed/35657113 http://dx.doi.org/10.1093/database/baac036 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Serna Garcia, Giuseppe
Leone, Michele
Bernasconi, Anna
Carman, Mark J
GeMI: interactive interface for transformer-based Genomic Metadata Integration
title GeMI: interactive interface for transformer-based Genomic Metadata Integration
title_full GeMI: interactive interface for transformer-based Genomic Metadata Integration
title_fullStr GeMI: interactive interface for transformer-based Genomic Metadata Integration
title_full_unstemmed GeMI: interactive interface for transformer-based Genomic Metadata Integration
title_short GeMI: interactive interface for transformer-based Genomic Metadata Integration
title_sort gemi: interactive interface for transformer-based genomic metadata integration
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216561/
https://www.ncbi.nlm.nih.gov/pubmed/35657113
http://dx.doi.org/10.1093/database/baac036
work_keys_str_mv AT sernagarciagiuseppe gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration
AT leonemichele gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration
AT bernasconianna gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration
AT carmanmarkj gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration