Cargando…

Anomaly Detection using the "Isolation Forest" algorithm

<!--HTML--><p style="text-align: justify;"> Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set of credit card transactions, or exotic particles in a set of high-energy collisions. In this t...

Descripción completa

Detalles Bibliográficos
Autor principal: Gerster, David
Lenguaje:eng
Publicado: 2015
Materias:
Acceso en línea:http://cds.cern.ch/record/2030962
_version_ 1780947451554824192
author Gerster, David
author_facet Gerster, David
author_sort Gerster, David
collection CERN
description <!--HTML--><p style="text-align: justify;"> Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set of credit card transactions, or exotic particles in a set of high-energy collisions. In this talk, we analyze a real dataset of breast tissue biopsies, with malignant results forming the minority class. <p style="text-align: justify;"> The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have more empty space around them, they take fewer steps to memorize. Intuitively, a house in the country can be identified simply as “that house out by the farm”, while a house in the city needs a longer description like “that house in Brooklyn, near Prospect Park, on Union Street, between the firehouse and the library, not far from the French restaurant”. <p style="text-align: justify;"> We first use anomaly detection to find outliers in the biopsy data, then apply traditional predictive modeling to discover rules that separate anomalies from normal data. These rules provide surprisingly strong clues about which biopsies are malignant. Interestingly, anomaly detection continues to provide strong clues even when fitted to data with only benign biopsies. </p> <h4>About the speaker</h4> <p style="text-align: justify;"> David Gerster is Vice President of Data Science at BigML, where he promotes the idea that data science is easy by speaking at conferences and teaching. Since joining BigML in July 2013, he has spoken at Big Data Spain, Papis.io, DataLead (UC Berkeley), DataBeat (VentureBeat), and more than a dozen other venues. Recently he taught a twoday class at the Polytechnic University of Valencia that covered supervised and unsupervised learning. <p style="text-align: justify;"> At Groupon, he built an elite data science team that trained the first machinelearned models for mobile deal relevance. At Yahoo, he led the project to collect billions of URL clickstreams in Hadoop and use them to improve web search ranking, resulting in measurable improvements to Yahoo’s main web search algorithm. He holds an MBA from the University of California at Berkeley and a Bachelor’s degree from Harvard University.
id cern-2030962
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2015
record_format invenio
spelling cern-20309622022-11-02T22:28:00Zhttp://cds.cern.ch/record/2030962engGerster, DavidAnomaly Detection using the "Isolation Forest" algorithmAnomaly Detection using the "Isolation Forest" algorithmCERN Computing Seminar<!--HTML--><p style="text-align: justify;"> Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set of credit card transactions, or exotic particles in a set of high-energy collisions. In this talk, we analyze a real dataset of breast tissue biopsies, with malignant results forming the minority class. <p style="text-align: justify;"> The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have more empty space around them, they take fewer steps to memorize. Intuitively, a house in the country can be identified simply as “that house out by the farm”, while a house in the city needs a longer description like “that house in Brooklyn, near Prospect Park, on Union Street, between the firehouse and the library, not far from the French restaurant”. <p style="text-align: justify;"> We first use anomaly detection to find outliers in the biopsy data, then apply traditional predictive modeling to discover rules that separate anomalies from normal data. These rules provide surprisingly strong clues about which biopsies are malignant. Interestingly, anomaly detection continues to provide strong clues even when fitted to data with only benign biopsies. </p> <h4>About the speaker</h4> <p style="text-align: justify;"> David Gerster is Vice President of Data Science at BigML, where he promotes the idea that data science is easy by speaking at conferences and teaching. Since joining BigML in July 2013, he has spoken at Big Data Spain, Papis.io, DataLead (UC Berkeley), DataBeat (VentureBeat), and more than a dozen other venues. Recently he taught a twoday class at the Polytechnic University of Valencia that covered supervised and unsupervised learning. <p style="text-align: justify;"> At Groupon, he built an elite data science team that trained the first machinelearned models for mobile deal relevance. At Yahoo, he led the project to collect billions of URL clickstreams in Hadoop and use them to improve web search ranking, resulting in measurable improvements to Yahoo’s main web search algorithm. He holds an MBA from the University of California at Berkeley and a Bachelor’s degree from Harvard University. oai:cds.cern.ch:20309622015
spellingShingle CERN Computing Seminar
Gerster, David
Anomaly Detection using the "Isolation Forest" algorithm
title Anomaly Detection using the "Isolation Forest" algorithm
title_full Anomaly Detection using the "Isolation Forest" algorithm
title_fullStr Anomaly Detection using the "Isolation Forest" algorithm
title_full_unstemmed Anomaly Detection using the "Isolation Forest" algorithm
title_short Anomaly Detection using the "Isolation Forest" algorithm
title_sort anomaly detection using the "isolation forest" algorithm
topic CERN Computing Seminar
url http://cds.cern.ch/record/2030962
work_keys_str_mv AT gersterdavid anomalydetectionusingtheisolationforestalgorithm