Cargando…
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very c...
Autores principales: | , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2011
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1145/1993053.1993057 http://cds.cern.ch/record/1399741 |
_version_ | 1780923616700923904 |
---|---|
author | Weber, I Marian, L Henzinger, M Baykan, E |
author_facet | Weber, I Marian, L Henzinger, M Baykan, E |
author_sort | Weber, I |
collection | CERN |
description | Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40. |
id | cern-1399741 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2011 |
record_format | invenio |
spelling | cern-13997412019-09-30T06:29:59Zdoi:10.1145/1993053.1993057http://cds.cern.ch/record/1399741engWeber, IMarian, LHenzinger, MBaykan, EA Comprehensive Study of Features and Algorithms for URL-Based Topic ClassificationXXGiven only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.oai:cds.cern.ch:13997412011 |
spellingShingle | XX Weber, I Marian, L Henzinger, M Baykan, E A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification |
title | A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification |
title_full | A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification |
title_fullStr | A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification |
title_full_unstemmed | A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification |
title_short | A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification |
title_sort | comprehensive study of features and algorithms for url-based topic classification |
topic | XX |
url | https://dx.doi.org/10.1145/1993053.1993057 http://cds.cern.ch/record/1399741 |
work_keys_str_mv | AT weberi acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT marianl acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT henzingerm acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT baykane acomprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT weberi comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT marianl comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT henzingerm comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification AT baykane comprehensivestudyoffeaturesandalgorithmsforurlbasedtopicclassification |