Cargando…
Tractable near-optimal policies for crawling
The problem of maintaining a local cache of [Formula: see text] constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy th...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6094111/ https://www.ncbi.nlm.nih.gov/pubmed/30038026 http://dx.doi.org/10.1073/pnas.1801519115 |
_version_ | 1783347765860892672 |
---|---|
author | Azar, Yossi Horvitz, Eric Lubetzky, Eyal Peres, Yuval Shahaf, Dafna |
author_facet | Azar, Yossi Horvitz, Eric Lubetzky, Eyal Peres, Yuval Shahaf, Dafna |
author_sort | Azar, Yossi |
collection | PubMed |
description | The problem of maintaining a local cache of [Formula: see text] constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of [Formula: see text] , but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in [Formula: see text] operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum. |
format | Online Article Text |
id | pubmed-6094111 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-60941112018-08-17 Tractable near-optimal policies for crawling Azar, Yossi Horvitz, Eric Lubetzky, Eyal Peres, Yuval Shahaf, Dafna Proc Natl Acad Sci U S A Physical Sciences The problem of maintaining a local cache of [Formula: see text] constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of [Formula: see text] , but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in [Formula: see text] operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum. National Academy of Sciences 2018-08-07 2018-07-23 /pmc/articles/PMC6094111/ /pubmed/30038026 http://dx.doi.org/10.1073/pnas.1801519115 Text en Copyright © 2018 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | Physical Sciences Azar, Yossi Horvitz, Eric Lubetzky, Eyal Peres, Yuval Shahaf, Dafna Tractable near-optimal policies for crawling |
title | Tractable near-optimal policies for crawling |
title_full | Tractable near-optimal policies for crawling |
title_fullStr | Tractable near-optimal policies for crawling |
title_full_unstemmed | Tractable near-optimal policies for crawling |
title_short | Tractable near-optimal policies for crawling |
title_sort | tractable near-optimal policies for crawling |
topic | Physical Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6094111/ https://www.ncbi.nlm.nih.gov/pubmed/30038026 http://dx.doi.org/10.1073/pnas.1801519115 |
work_keys_str_mv | AT azaryossi tractablenearoptimalpoliciesforcrawling AT horvitzeric tractablenearoptimalpoliciesforcrawling AT lubetzkyeyal tractablenearoptimalpoliciesforcrawling AT peresyuval tractablenearoptimalpoliciesforcrawling AT shahafdafna tractablenearoptimalpoliciesforcrawling |