Cargando…

A machine learning approach for missing persons cases with high genotyping errors

Estimating the relationships between individuals is one of the fundamental challenges in many fields. In particular, relationship.ip estimation could provide valuable information for missing persons cases. The recently developed investigative genetic genealogy approach uses high-density single nucle...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Meng, Liu, Muyi, Li, Hongmin, King, Jonathan, Smuts, Amy, Budowle, Bruce, Ge, Jianye
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9573995/
https://www.ncbi.nlm.nih.gov/pubmed/36263419
http://dx.doi.org/10.3389/fgene.2022.971242
_version_ 1784811002861191168
author Huang, Meng
Liu, Muyi
Li, Hongmin
King, Jonathan
Smuts, Amy
Budowle, Bruce
Ge, Jianye
author_facet Huang, Meng
Liu, Muyi
Li, Hongmin
King, Jonathan
Smuts, Amy
Budowle, Bruce
Ge, Jianye
author_sort Huang, Meng
collection PubMed
description Estimating the relationships between individuals is one of the fundamental challenges in many fields. In particular, relationship.ip estimation could provide valuable information for missing persons cases. The recently developed investigative genetic genealogy approach uses high-density single nucleotide polymorphisms (SNPs) to determine close and more distant relationships, in which hundreds of thousands to tens of millions of SNPs are generated either by microarray genotyping or whole-genome sequencing. The current studies usually assume the SNP profiles were generated with minimum errors. However, in the missing person cases, the DNA samples can be highly degraded, and the SNP profiles generated from these samples usually contain lots of errors. In this study, a machine learning approach was developed for estimating the relationships with high error SNP profiles. In this approach, a hierarchical classification strategy was employed first to classify the relationships by degree and then the relationship types within each degree separately. As for each classification, feature selection was implemented to gain better performance. Both simulated and real data sets with various genotyping error rates were utilized in evaluating this approach, and the accuracies of this approach were higher than individual measures; namely, this approach was more accurate and robust than the individual measures for SNP profiles with genotyping errors. In addition, the highest accuracy could be obtained by providing the same genotyping error rates in train and test sets, and thus estimating genotyping errors of the SNP profiles is critical to obtaining high accuracy of relationship estimation.
format Online
Article
Text
id pubmed-9573995
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-95739952022-10-18 A machine learning approach for missing persons cases with high genotyping errors Huang, Meng Liu, Muyi Li, Hongmin King, Jonathan Smuts, Amy Budowle, Bruce Ge, Jianye Front Genet Genetics Estimating the relationships between individuals is one of the fundamental challenges in many fields. In particular, relationship.ip estimation could provide valuable information for missing persons cases. The recently developed investigative genetic genealogy approach uses high-density single nucleotide polymorphisms (SNPs) to determine close and more distant relationships, in which hundreds of thousands to tens of millions of SNPs are generated either by microarray genotyping or whole-genome sequencing. The current studies usually assume the SNP profiles were generated with minimum errors. However, in the missing person cases, the DNA samples can be highly degraded, and the SNP profiles generated from these samples usually contain lots of errors. In this study, a machine learning approach was developed for estimating the relationships with high error SNP profiles. In this approach, a hierarchical classification strategy was employed first to classify the relationships by degree and then the relationship types within each degree separately. As for each classification, feature selection was implemented to gain better performance. Both simulated and real data sets with various genotyping error rates were utilized in evaluating this approach, and the accuracies of this approach were higher than individual measures; namely, this approach was more accurate and robust than the individual measures for SNP profiles with genotyping errors. In addition, the highest accuracy could be obtained by providing the same genotyping error rates in train and test sets, and thus estimating genotyping errors of the SNP profiles is critical to obtaining high accuracy of relationship estimation. Frontiers Media S.A. 2022-10-03 /pmc/articles/PMC9573995/ /pubmed/36263419 http://dx.doi.org/10.3389/fgene.2022.971242 Text en Copyright © 2022 Huang, Liu, Li, King, Smuts, Budowle and Ge. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Huang, Meng
Liu, Muyi
Li, Hongmin
King, Jonathan
Smuts, Amy
Budowle, Bruce
Ge, Jianye
A machine learning approach for missing persons cases with high genotyping errors
title A machine learning approach for missing persons cases with high genotyping errors
title_full A machine learning approach for missing persons cases with high genotyping errors
title_fullStr A machine learning approach for missing persons cases with high genotyping errors
title_full_unstemmed A machine learning approach for missing persons cases with high genotyping errors
title_short A machine learning approach for missing persons cases with high genotyping errors
title_sort machine learning approach for missing persons cases with high genotyping errors
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9573995/
https://www.ncbi.nlm.nih.gov/pubmed/36263419
http://dx.doi.org/10.3389/fgene.2022.971242
work_keys_str_mv AT huangmeng amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT liumuyi amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT lihongmin amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT kingjonathan amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT smutsamy amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT budowlebruce amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT gejianye amachinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT huangmeng machinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT liumuyi machinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT lihongmin machinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT kingjonathan machinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT smutsamy machinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT budowlebruce machinelearningapproachformissingpersonscaseswithhighgenotypingerrors
AT gejianye machinelearningapproachformissingpersonscaseswithhighgenotypingerrors