Cargando…

A community effort to identify and correct mislabeled samples in proteogenomic studies

Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct...

Descripción completa

Detalles Bibliográficos
Autores principales: Yoo, Seungyeul, Shi, Zhiao, Wen, Bo, Kho, SoonJye, Pan, Renke, Feng, Hanying, Chen, Hong, Carlsson, Anders, Edén, Patrik, Ma, Weiping, Raymer, Michael, Maier, Ezekiel J., Tezak, Zivana, Johanson, Elaine, Hinton, Denise, Rodriguez, Henry, Zhu, Jun, Boja, Emily, Wang, Pei, Zhang, Bing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8134945/
https://www.ncbi.nlm.nih.gov/pubmed/34036290
http://dx.doi.org/10.1016/j.patter.2021.100245
Descripción
Sumario:Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct sample mislabels or misannotations in multi-omic studies. Here, we describe a crowdsourced precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge, which provides a framework for systematic benchmarking and evaluation of mislabel identification and correction methods for integrative proteogenomic studies. The challenge received a large number of submissions from domestic and international data scientists, with highly variable performance observed across the submitted methods. Post-challenge collaboration between the top-performing teams and the challenge organizers has created an open-source software, COSMO, with demonstrated high accuracy and robustness in mislabeling identification and correction in simulated and real multi-omic datasets.