Cargando…
Anomaly Detection using the "Isolation Forest" algorithm
<!--HTML--><p style="text-align: justify;"> Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set of credit card transactions, or exotic particles in a set of high-energy collisions. In this t...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2015
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2030962 |
_version_ | 1780947451554824192 |
---|---|
author | Gerster, David |
author_facet | Gerster, David |
author_sort | Gerster, David |
collection | CERN |
description | <!--HTML--><p style="text-align: justify;">
Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set
of credit card transactions, or exotic particles in a set of high-energy collisions. In this talk, we analyze a real dataset of breast
tissue biopsies, with malignant results forming the minority class.
<p style="text-align: justify;">
The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have
more empty space around them, they take fewer steps to memorize. Intuitively, a house in the country can be identified simply as “that
house out by the farm”, while a house in the city needs a longer description like “that house in Brooklyn, near Prospect Park, on Union
Street, between the firehouse and the library, not far from the French restaurant”.
<p style="text-align: justify;">
We first use anomaly detection to find outliers in the biopsy data, then apply traditional predictive modeling to discover rules that
separate anomalies from normal data. These rules provide surprisingly strong clues about which biopsies are malignant. Interestingly,
anomaly detection continues to provide strong clues even when fitted to data with only benign biopsies.
</p>
<h4>About the speaker</h4>
<p style="text-align: justify;">
David Gerster is Vice President of Data Science at BigML, where he promotes the idea that data science is easy by speaking at conferences and teaching. Since joining BigML in July 2013, he has spoken at Big Data Spain, Papis.io, DataLead (UC Berkeley), DataBeat
(VentureBeat), and more than a dozen other venues. Recently he taught a twoday
class at
the Polytechnic University of Valencia that covered supervised and unsupervised learning.
<p style="text-align: justify;">
At Groupon, he built an elite data science team that trained the first machinelearned
models
for mobile deal relevance. At Yahoo, he led the project to collect billions of URL clickstreams
in Hadoop and use them to improve web search ranking, resulting in measurable
improvements to Yahoo’s main web search algorithm. He holds an MBA from the University
of California at Berkeley and a Bachelor’s degree from Harvard University.
|
id | cern-2030962 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2015 |
record_format | invenio |
spelling | cern-20309622022-11-02T22:28:00Zhttp://cds.cern.ch/record/2030962engGerster, DavidAnomaly Detection using the "Isolation Forest" algorithmAnomaly Detection using the "Isolation Forest" algorithmCERN Computing Seminar<!--HTML--><p style="text-align: justify;"> Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set of credit card transactions, or exotic particles in a set of high-energy collisions. In this talk, we analyze a real dataset of breast tissue biopsies, with malignant results forming the minority class. <p style="text-align: justify;"> The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have more empty space around them, they take fewer steps to memorize. Intuitively, a house in the country can be identified simply as “that house out by the farm”, while a house in the city needs a longer description like “that house in Brooklyn, near Prospect Park, on Union Street, between the firehouse and the library, not far from the French restaurant”. <p style="text-align: justify;"> We first use anomaly detection to find outliers in the biopsy data, then apply traditional predictive modeling to discover rules that separate anomalies from normal data. These rules provide surprisingly strong clues about which biopsies are malignant. Interestingly, anomaly detection continues to provide strong clues even when fitted to data with only benign biopsies. </p> <h4>About the speaker</h4> <p style="text-align: justify;"> David Gerster is Vice President of Data Science at BigML, where he promotes the idea that data science is easy by speaking at conferences and teaching. Since joining BigML in July 2013, he has spoken at Big Data Spain, Papis.io, DataLead (UC Berkeley), DataBeat (VentureBeat), and more than a dozen other venues. Recently he taught a twoday class at the Polytechnic University of Valencia that covered supervised and unsupervised learning. <p style="text-align: justify;"> At Groupon, he built an elite data science team that trained the first machinelearned models for mobile deal relevance. At Yahoo, he led the project to collect billions of URL clickstreams in Hadoop and use them to improve web search ranking, resulting in measurable improvements to Yahoo’s main web search algorithm. He holds an MBA from the University of California at Berkeley and a Bachelor’s degree from Harvard University. oai:cds.cern.ch:20309622015 |
spellingShingle | CERN Computing Seminar Gerster, David Anomaly Detection using the "Isolation Forest" algorithm |
title | Anomaly Detection using the "Isolation Forest" algorithm |
title_full | Anomaly Detection using the "Isolation Forest" algorithm |
title_fullStr | Anomaly Detection using the "Isolation Forest" algorithm |
title_full_unstemmed | Anomaly Detection using the "Isolation Forest" algorithm |
title_short | Anomaly Detection using the "Isolation Forest" algorithm |
title_sort | anomaly detection using the "isolation forest" algorithm |
topic | CERN Computing Seminar |
url | http://cds.cern.ch/record/2030962 |
work_keys_str_mv | AT gersterdavid anomalydetectionusingtheisolationforestalgorithm |