Cargando…
Snorkel: rapid training data creation with weak supervision
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitr...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Berlin Heidelberg
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7075849/ https://www.ncbi.nlm.nih.gov/pubmed/32214778 http://dx.doi.org/10.1007/s00778-019-00552-1 |
_version_ | 1783507100237824000 |
---|---|
author | Ratner, Alexander Bach, Stephen H. Ehrenberg, Henry Fries, Jason Wu, Sen Ré, Christopher |
author_facet | Ratner, Alexander Bach, Stephen H. Ehrenberg, Henry Fries, Jason Wu, Sen Ré, Christopher |
author_sort | Ratner, Alexander |
collection | PubMed |
description | Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models [Formula: see text] faster and increase predictive performance an average [Formula: see text] versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to [Formula: see text] speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides [Formula: see text] average improvements to predictive performance over prior heuristic approaches and comes within an average [Formula: see text] of the predictive performance of large hand-curated training sets. |
format | Online Article Text |
id | pubmed-7075849 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Springer Berlin Heidelberg |
record_format | MEDLINE/PubMed |
spelling | pubmed-70758492020-03-23 Snorkel: rapid training data creation with weak supervision Ratner, Alexander Bach, Stephen H. Ehrenberg, Henry Fries, Jason Wu, Sen Ré, Christopher VLDB J Special Issue Paper Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models [Formula: see text] faster and increase predictive performance an average [Formula: see text] versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to [Formula: see text] speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides [Formula: see text] average improvements to predictive performance over prior heuristic approaches and comes within an average [Formula: see text] of the predictive performance of large hand-curated training sets. Springer Berlin Heidelberg 2019-07-15 2020 /pmc/articles/PMC7075849/ /pubmed/32214778 http://dx.doi.org/10.1007/s00778-019-00552-1 Text en © The Author(s) 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. |
spellingShingle | Special Issue Paper Ratner, Alexander Bach, Stephen H. Ehrenberg, Henry Fries, Jason Wu, Sen Ré, Christopher Snorkel: rapid training data creation with weak supervision |
title | Snorkel: rapid training data creation with weak supervision |
title_full | Snorkel: rapid training data creation with weak supervision |
title_fullStr | Snorkel: rapid training data creation with weak supervision |
title_full_unstemmed | Snorkel: rapid training data creation with weak supervision |
title_short | Snorkel: rapid training data creation with weak supervision |
title_sort | snorkel: rapid training data creation with weak supervision |
topic | Special Issue Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7075849/ https://www.ncbi.nlm.nih.gov/pubmed/32214778 http://dx.doi.org/10.1007/s00778-019-00552-1 |
work_keys_str_mv | AT ratneralexander snorkelrapidtrainingdatacreationwithweaksupervision AT bachstephenh snorkelrapidtrainingdatacreationwithweaksupervision AT ehrenberghenry snorkelrapidtrainingdatacreationwithweaksupervision AT friesjason snorkelrapidtrainingdatacreationwithweaksupervision AT wusen snorkelrapidtrainingdatacreationwithweaksupervision AT rechristopher snorkelrapidtrainingdatacreationwithweaksupervision |