Cargando…

NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research

BACKGROUND: A unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. While sufficient for single-site and limited-scale studies, the use of common unique study identifiers has several drawbacks for lar...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Guo-Qiang, Tao, Shiqiang, Xing, Guangming, Mozes, Jeno, Zonjy, Bilal, Lhatoo, Samden D, Cui, Licong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Gunther Eysenbach 2015
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704892/ https://www.ncbi.nlm.nih.gov/pubmed/26554419 http://dx.doi.org/10.2196/medinform.4959

_version_	1782408928164839424
author	Zhang, Guo-Qiang Tao, Shiqiang Xing, Guangming Mozes, Jeno Zonjy, Bilal Lhatoo, Samden D Cui, Licong
author_facet	Zhang, Guo-Qiang Tao, Shiqiang Xing, Guangming Mozes, Jeno Zonjy, Bilal Lhatoo, Samden D Cui, Licong
author_sort	Zhang, Guo-Qiang
collection	PubMed
description	BACKGROUND: A unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. While sufficient for single-site and limited-scale studies, the use of common unique study identifiers has several drawbacks for large multicenter studies, where thousands of research participants may be recruited from multiple sites. An important property of study identifiers is error tolerance (or validatable), in that inadvertent editing mistakes during their transmission and use will most likely result in invalid study identifiers. OBJECTIVE: This paper introduces a novel method called "Randomized N-gram Hashing (NHash)," for generating unique study identifiers in a distributed and validatable fashion, in multicenter research. NHash has a unique set of properties: (1) it is a pseudonym serving the purpose of linking research data about a study participant for research purposes; (2) it can be generated automatically in a completely distributed fashion with virtually no risk for identifier collision; (3) it incorporates a set of cryptographic hash functions based on N-grams, with a combination of additional encryption techniques such as a shift cipher; (d) it is validatable (error tolerant) in the sense that inadvertent edit errors will mostly result in invalid identifiers. METHODS: NHash consists of 2 phases. First, an intermediate string using randomized N-gram hashing is generated. This string consists of a collection of N-gram hashes f (1), f (2), ..., f ( k ). The input for each function f ( i ) has 3 components: a random number r, an integer n, and input data m. The result, f ( i )(r, n, m), is an n-gram of m with a starting position s, which is computed as (r mod \|m\|), where \|m\| represents the length of m. The output for Step 1 is the concatenation of the sequence f (1)(r (1), n (1), m (1)), f (2)(r (2), n (2), m (2)), ..., f ( k )(r ( k ), n ( k ), m ( k )). In the second phase, the intermediate string generated in Phase 1 is encrypted using techniques such as shift cipher. The result of the encryption, concatenated with the random number r, is the final NHash study identifier. RESULTS: We performed experiments using a large synthesized dataset comparing NHash with random strings, and demonstrated neglegible probability for collision. We implemented NHash for the Center for SUDEP Research (CSR), a National Institute for Neurological Disorders and Stroke-funded Center Without Walls for Collaborative Research in the Epilepsies. This multicenter collaboration involves 14 institutions across the United States and Europe, bringing together extensive and diverse expertise to understand sudden unexpected death in epilepsy patients (SUDEP). CONCLUSIONS: The CSR Data Repository has successfully used NHash to link deidentified multimodal clinical data collected in participating CSR institutions, meeting all desired objectives of NHash.
format	Online Article Text
id	pubmed-4704892
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Gunther Eysenbach
record_format	MEDLINE/PubMed
spelling	pubmed-47048922016-01-12 NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research Zhang, Guo-Qiang Tao, Shiqiang Xing, Guangming Mozes, Jeno Zonjy, Bilal Lhatoo, Samden D Cui, Licong JMIR Med Inform Original Paper BACKGROUND: A unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. While sufficient for single-site and limited-scale studies, the use of common unique study identifiers has several drawbacks for large multicenter studies, where thousands of research participants may be recruited from multiple sites. An important property of study identifiers is error tolerance (or validatable), in that inadvertent editing mistakes during their transmission and use will most likely result in invalid study identifiers. OBJECTIVE: This paper introduces a novel method called "Randomized N-gram Hashing (NHash)," for generating unique study identifiers in a distributed and validatable fashion, in multicenter research. NHash has a unique set of properties: (1) it is a pseudonym serving the purpose of linking research data about a study participant for research purposes; (2) it can be generated automatically in a completely distributed fashion with virtually no risk for identifier collision; (3) it incorporates a set of cryptographic hash functions based on N-grams, with a combination of additional encryption techniques such as a shift cipher; (d) it is validatable (error tolerant) in the sense that inadvertent edit errors will mostly result in invalid identifiers. METHODS: NHash consists of 2 phases. First, an intermediate string using randomized N-gram hashing is generated. This string consists of a collection of N-gram hashes f (1), f (2), ..., f ( k ). The input for each function f ( i ) has 3 components: a random number r, an integer n, and input data m. The result, f ( i )(r, n, m), is an n-gram of m with a starting position s, which is computed as (r mod \|m\|), where \|m\| represents the length of m. The output for Step 1 is the concatenation of the sequence f (1)(r (1), n (1), m (1)), f (2)(r (2), n (2), m (2)), ..., f ( k )(r ( k ), n ( k ), m ( k )). In the second phase, the intermediate string generated in Phase 1 is encrypted using techniques such as shift cipher. The result of the encryption, concatenated with the random number r, is the final NHash study identifier. RESULTS: We performed experiments using a large synthesized dataset comparing NHash with random strings, and demonstrated neglegible probability for collision. We implemented NHash for the Center for SUDEP Research (CSR), a National Institute for Neurological Disorders and Stroke-funded Center Without Walls for Collaborative Research in the Epilepsies. This multicenter collaboration involves 14 institutions across the United States and Europe, bringing together extensive and diverse expertise to understand sudden unexpected death in epilepsy patients (SUDEP). CONCLUSIONS: The CSR Data Repository has successfully used NHash to link deidentified multimodal clinical data collected in participating CSR institutions, meeting all desired objectives of NHash. Gunther Eysenbach 2015-11-10 /pmc/articles/PMC4704892/ /pubmed/26554419 http://dx.doi.org/10.2196/medinform.4959 Text en ©Guo-Qiang Zhang, Shiqiang Tao, Guangming Xing, Jeno Mozes, Bilal Zonjy, Samden D Lhatoo, Licong Cui. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 10.11.2015. https://creativecommons.org/licenses/by/2.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/ (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Zhang, Guo-Qiang Tao, Shiqiang Xing, Guangming Mozes, Jeno Zonjy, Bilal Lhatoo, Samden D Cui, Licong NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research
title	NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research
title_full	NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research
title_fullStr	NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research
title_full_unstemmed	NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research
title_short	NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research
title_sort	nhash: randomized n-gram hashing for distributed generation of validatable unique study identifiers in multicenter research
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704892/ https://www.ncbi.nlm.nih.gov/pubmed/26554419 http://dx.doi.org/10.2196/medinform.4959
work_keys_str_mv	AT zhangguoqiang nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch AT taoshiqiang nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch AT xingguangming nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch AT mozesjeno nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch AT zonjybilal nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch AT lhatoosamdend nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch AT cuilicong nhashrandomizedngramhashingfordistributedgenerationofvalidatableuniquestudyidentifiersinmulticenterresearch

NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research

Ejemplares similares