Cargando…

Hashes are not suitable to verify fixity of the public archived web

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the f...

Descripción completa

Detalles Bibliográficos
Autores principales: Aturban, Mohamed, Klein, Martin, Van de Sompel, Herbert, Alam, Sawood, Nelson, Michael L., Weigle, Michele C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10256179/
https://www.ncbi.nlm.nih.gov/pubmed/37294783
http://dx.doi.org/10.1371/journal.pone.0286879
_version_ 1785057047268556800
author Aturban, Mohamed
Klein, Martin
Van de Sompel, Herbert
Alam, Sawood
Nelson, Michael L.
Weigle, Michele C.
author_facet Aturban, Mohamed
Klein, Martin
Van de Sompel, Herbert
Alam, Sawood
Nelson, Michael L.
Weigle, Michele C.
author_sort Aturban, Mohamed
collection PubMed
description Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages.
format Online
Article
Text
id pubmed-10256179
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-102561792023-06-10 Hashes are not suitable to verify fixity of the public archived web Aturban, Mohamed Klein, Martin Van de Sompel, Herbert Alam, Sawood Nelson, Michael L. Weigle, Michele C. PLoS One Research Article Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages. Public Library of Science 2023-06-09 /pmc/articles/PMC10256179/ /pubmed/37294783 http://dx.doi.org/10.1371/journal.pone.0286879 Text en © 2023 Aturban et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Aturban, Mohamed
Klein, Martin
Van de Sompel, Herbert
Alam, Sawood
Nelson, Michael L.
Weigle, Michele C.
Hashes are not suitable to verify fixity of the public archived web
title Hashes are not suitable to verify fixity of the public archived web
title_full Hashes are not suitable to verify fixity of the public archived web
title_fullStr Hashes are not suitable to verify fixity of the public archived web
title_full_unstemmed Hashes are not suitable to verify fixity of the public archived web
title_short Hashes are not suitable to verify fixity of the public archived web
title_sort hashes are not suitable to verify fixity of the public archived web
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10256179/
https://www.ncbi.nlm.nih.gov/pubmed/37294783
http://dx.doi.org/10.1371/journal.pone.0286879
work_keys_str_mv AT aturbanmohamed hashesarenotsuitabletoverifyfixityofthepublicarchivedweb
AT kleinmartin hashesarenotsuitabletoverifyfixityofthepublicarchivedweb
AT vandesompelherbert hashesarenotsuitabletoverifyfixityofthepublicarchivedweb
AT alamsawood hashesarenotsuitabletoverifyfixityofthepublicarchivedweb
AT nelsonmichaell hashesarenotsuitabletoverifyfixityofthepublicarchivedweb
AT weiglemichelec hashesarenotsuitabletoverifyfixityofthepublicarchivedweb