Cargando…
Hashes are not suitable to verify fixity of the public archived web
Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the f...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10256179/ https://www.ncbi.nlm.nih.gov/pubmed/37294783 http://dx.doi.org/10.1371/journal.pone.0286879 |
_version_ | 1785057047268556800 |
---|---|
author | Aturban, Mohamed Klein, Martin Van de Sompel, Herbert Alam, Sawood Nelson, Michael L. Weigle, Michele C. |
author_facet | Aturban, Mohamed Klein, Martin Van de Sompel, Herbert Alam, Sawood Nelson, Michael L. Weigle, Michele C. |
author_sort | Aturban, Mohamed |
collection | PubMed |
description | Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages. |
format | Online Article Text |
id | pubmed-10256179 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-102561792023-06-10 Hashes are not suitable to verify fixity of the public archived web Aturban, Mohamed Klein, Martin Van de Sompel, Herbert Alam, Sawood Nelson, Michael L. Weigle, Michele C. PLoS One Research Article Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages. Public Library of Science 2023-06-09 /pmc/articles/PMC10256179/ /pubmed/37294783 http://dx.doi.org/10.1371/journal.pone.0286879 Text en © 2023 Aturban et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Aturban, Mohamed Klein, Martin Van de Sompel, Herbert Alam, Sawood Nelson, Michael L. Weigle, Michele C. Hashes are not suitable to verify fixity of the public archived web |
title | Hashes are not suitable to verify fixity of the public archived web |
title_full | Hashes are not suitable to verify fixity of the public archived web |
title_fullStr | Hashes are not suitable to verify fixity of the public archived web |
title_full_unstemmed | Hashes are not suitable to verify fixity of the public archived web |
title_short | Hashes are not suitable to verify fixity of the public archived web |
title_sort | hashes are not suitable to verify fixity of the public archived web |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10256179/ https://www.ncbi.nlm.nih.gov/pubmed/37294783 http://dx.doi.org/10.1371/journal.pone.0286879 |
work_keys_str_mv | AT aturbanmohamed hashesarenotsuitabletoverifyfixityofthepublicarchivedweb AT kleinmartin hashesarenotsuitabletoverifyfixityofthepublicarchivedweb AT vandesompelherbert hashesarenotsuitabletoverifyfixityofthepublicarchivedweb AT alamsawood hashesarenotsuitabletoverifyfixityofthepublicarchivedweb AT nelsonmichaell hashesarenotsuitabletoverifyfixityofthepublicarchivedweb AT weiglemichelec hashesarenotsuitabletoverifyfixityofthepublicarchivedweb |