Cargando…

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very c...

Descripción completa

Detalles Bibliográficos
Autores principales: Weber, I, Marian, L, Henzinger, M, Baykan, E
Lenguaje:eng
Publicado: 2011
Materias:
XX
Acceso en línea:https://dx.doi.org/10.1145/1993053.1993057
http://cds.cern.ch/record/1399741
_version_ 1780923616700923904
author Weber, I
Marian, L
Henzinger, M
Baykan, E
author_facet Weber, I
Marian, L
Henzinger, M
Baykan, E
author_sort Weber, I
collection CERN
description Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
id cern-1399741
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2011
record_format invenio
spelling cern-13997412019-09-30T06:29:59Zdoi:10.1145/1993053.1993057http://cds.cern.ch/record/1399741engWeber, IMarian, LHenzinger, MBaykan, EA Comprehensive Study of Features and Algorithms for URL-Based Topic ClassificationXXGiven only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.oai:cds.cern.ch:13997412011
spellingShingle XX
Weber, I
Marian, L
Henzinger, M
Baykan, E
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
title A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
title_full A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
title_fullStr A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
title_full_unstemmed A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
title_short A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
title_sort comprehensive study of features and algorithms for url-based topic classification
topic XX
url https://dx.doi.org/10.1145/1993053.1993057
http://cds.cern.ch/record/1399741
work_keys_str_mv AT weberi acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT marianl acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT henzingerm acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT baykane acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT weberi comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT marianl comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT henzingerm comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification
AT baykane comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification