Cargando…

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases

MOTIVATION: First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Qingyu, Zobel, Justin, Zhang, Xiuzhen, Verspoor, Karin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4973881/ https://www.ncbi.nlm.nih.gov/pubmed/27489953 http://dx.doi.org/10.1371/journal.pone.0159644

_version_	1782446467378577408
author	Chen, Qingyu Zobel, Justin Zhang, Xiuzhen Verspoor, Karin
author_facet	Chen, Qingyu Zobel, Justin Zhang, Xiuzhen Verspoor, Karin
author_sort	Chen, Qingyu
collection	PubMed
description	MOTIVATION: First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. RESULTS: We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.
format	Online Article Text
id	pubmed-4973881
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-49738812016-08-18 Supervised Learning for Detection of Duplicates in Genomic Sequence Databases Chen, Qingyu Zobel, Justin Zhang, Xiuzhen Verspoor, Karin PLoS One Research Article MOTIVATION: First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. RESULTS: We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material. Public Library of Science 2016-08-04 /pmc/articles/PMC4973881/ /pubmed/27489953 http://dx.doi.org/10.1371/journal.pone.0159644 Text en © 2016 Chen et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Chen, Qingyu Zobel, Justin Zhang, Xiuzhen Verspoor, Karin Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
title	Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
title_full	Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
title_fullStr	Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
title_full_unstemmed	Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
title_short	Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
title_sort	supervised learning for detection of duplicates in genomic sequence databases
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4973881/ https://www.ncbi.nlm.nih.gov/pubmed/27489953 http://dx.doi.org/10.1371/journal.pone.0159644
work_keys_str_mv	AT chenqingyu supervisedlearningfordetectionofduplicatesingenomicsequencedatabases AT zobeljustin supervisedlearningfordetectionofduplicatesingenomicsequencedatabases AT zhangxiuzhen supervisedlearningfordetectionofduplicatesingenomicsequencedatabases AT verspoorkarin supervisedlearningfordetectionofduplicatesingenomicsequencedatabases

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases

Ejemplares similares