Cargando…

Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses

BACKGROUND: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable resul...

Descripción completa

Detalles Bibliográficos
Autores principales: Miotto, Olivo, Tan, Tin Wee, Brusic, Vladimir
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2259408/
https://www.ncbi.nlm.nih.gov/pubmed/18315860
http://dx.doi.org/10.1186/1471-2105-9-S1-S7
_version_ 1782151390165991424
author Miotto, Olivo
Tan, Tin Wee
Brusic, Vladimir
author_facet Miotto, Olivo
Tan, Tin Wee
Brusic, Vladimir
author_sort Miotto, Olivo
collection PubMed
description BACKGROUND: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation. RESULTS: Over 40,000 annotated Influenza A protein sequences were collected by combining information from more than 90,000 documents from NCBI public databases. Metadata values were automatically extracted, aggregated and reconciled from several document fields by applying user-defined structural rules. For each property, values were recovered from ≥88.8% of records, with accuracy exceeding 96% in most cases. Because of semantic heterogeneity, each property required up to six different structural rules to be combined. Significant quality differences between databases were found: GenBank documents yield values more reliably than documents extracted from GenPept. Using a simple set of semantic rules and a reasoner, we reconstructed relationships between sequences from the same isolate, thus identifying 7640 isolates. Validation of isolate metadata against a simple ontology highlighted more than 400 inconsistencies, leading to over 3,000 property value corrections. CONCLUSION: To overcome the quality issues inherent in public databases, automated knowledge aggregation with embedded intelligence is needed for large-scale analyses. Our results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records. Emerging semantic technologies possess desirable features to support today's knowledge aggregation tasks, with a potential to bring immediate benefits to this field.
format Text
id pubmed-2259408
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22594082008-03-04 Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses Miotto, Olivo Tan, Tin Wee Brusic, Vladimir BMC Bioinformatics Proceedings BACKGROUND: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation. RESULTS: Over 40,000 annotated Influenza A protein sequences were collected by combining information from more than 90,000 documents from NCBI public databases. Metadata values were automatically extracted, aggregated and reconciled from several document fields by applying user-defined structural rules. For each property, values were recovered from ≥88.8% of records, with accuracy exceeding 96% in most cases. Because of semantic heterogeneity, each property required up to six different structural rules to be combined. Significant quality differences between databases were found: GenBank documents yield values more reliably than documents extracted from GenPept. Using a simple set of semantic rules and a reasoner, we reconstructed relationships between sequences from the same isolate, thus identifying 7640 isolates. Validation of isolate metadata against a simple ontology highlighted more than 400 inconsistencies, leading to over 3,000 property value corrections. CONCLUSION: To overcome the quality issues inherent in public databases, automated knowledge aggregation with embedded intelligence is needed for large-scale analyses. Our results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records. Emerging semantic technologies possess desirable features to support today's knowledge aggregation tasks, with a potential to bring immediate benefits to this field. BioMed Central 2008-02-13 /pmc/articles/PMC2259408/ /pubmed/18315860 http://dx.doi.org/10.1186/1471-2105-9-S1-S7 Text en Copyright © 2008 Miotto et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Miotto, Olivo
Tan, Tin Wee
Brusic, Vladimir
Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
title Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
title_full Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
title_fullStr Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
title_full_unstemmed Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
title_short Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
title_sort rule-based knowledge aggregation for large-scale protein sequence analysis of influenza a viruses
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2259408/
https://www.ncbi.nlm.nih.gov/pubmed/18315860
http://dx.doi.org/10.1186/1471-2105-9-S1-S7
work_keys_str_mv AT miottoolivo rulebasedknowledgeaggregationforlargescaleproteinsequenceanalysisofinfluenzaaviruses
AT tantinwee rulebasedknowledgeaggregationforlargescaleproteinsequenceanalysisofinfluenzaaviruses
AT brusicvladimir rulebasedknowledgeaggregationforlargescaleproteinsequenceanalysisofinfluenzaaviruses