Cargando…
GeMI: interactive interface for transformer-based Genomic Metadata Integration
The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence o...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216561/ https://www.ncbi.nlm.nih.gov/pubmed/35657113 http://dx.doi.org/10.1093/database/baac036 |
_version_ | 1784731451987591168 |
---|---|
author | Serna Garcia, Giuseppe Leone, Michele Bernasconi, Anna Carman, Mark J |
author_facet | Serna Garcia, Giuseppe Leone, Michele Bernasconi, Anna Carman, Mark J |
author_sort | Serna Garcia, Giuseppe |
collection | PubMed |
description | The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/ |
format | Online Article Text |
id | pubmed-9216561 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-92165612022-06-23 GeMI: interactive interface for transformer-based Genomic Metadata Integration Serna Garcia, Giuseppe Leone, Michele Bernasconi, Anna Carman, Mark J Database (Oxford) Original Article The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/ Oxford University Press 2022-06-03 /pmc/articles/PMC9216561/ /pubmed/35657113 http://dx.doi.org/10.1093/database/baac036 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Serna Garcia, Giuseppe Leone, Michele Bernasconi, Anna Carman, Mark J GeMI: interactive interface for transformer-based Genomic Metadata Integration |
title | GeMI: interactive interface for transformer-based Genomic Metadata Integration |
title_full | GeMI: interactive interface for transformer-based Genomic Metadata Integration |
title_fullStr | GeMI: interactive interface for transformer-based Genomic Metadata Integration |
title_full_unstemmed | GeMI: interactive interface for transformer-based Genomic Metadata Integration |
title_short | GeMI: interactive interface for transformer-based Genomic Metadata Integration |
title_sort | gemi: interactive interface for transformer-based genomic metadata integration |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216561/ https://www.ncbi.nlm.nih.gov/pubmed/35657113 http://dx.doi.org/10.1093/database/baac036 |
work_keys_str_mv | AT sernagarciagiuseppe gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration AT leonemichele gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration AT bernasconianna gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration AT carmanmarkj gemiinteractiveinterfacefortransformerbasedgenomicmetadataintegration |