Cargando…

Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population

The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we compr...

Descripción completa

Detalles Bibliográficos
Autores principales: Shirota, Matsuyuki, Kinoshita, Kengo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009343/
https://www.ncbi.nlm.nih.gov/pubmed/27589963
http://dx.doi.org/10.1093/database/baw124
_version_ 1782451508616364032
author Shirota, Matsuyuki
Kinoshita, Kengo
author_facet Shirota, Matsuyuki
Kinoshita, Kengo
author_sort Shirota, Matsuyuki
collection PubMed
description The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.
format Online
Article
Text
id pubmed-5009343
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-50093432016-09-07 Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population Shirota, Matsuyuki Kinoshita, Kengo Database (Oxford) Original Article The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences. Oxford University Press 2016-09-01 /pmc/articles/PMC5009343/ /pubmed/27589963 http://dx.doi.org/10.1093/database/baw124 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Shirota, Matsuyuki
Kinoshita, Kengo
Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
title Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
title_full Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
title_fullStr Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
title_full_unstemmed Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
title_short Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
title_sort discrepancies between human dna, mrna and protein reference sequences and their relation to single nucleotide variants in the human population
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009343/
https://www.ncbi.nlm.nih.gov/pubmed/27589963
http://dx.doi.org/10.1093/database/baw124
work_keys_str_mv AT shirotamatsuyuki discrepanciesbetweenhumandnamrnaandproteinreferencesequencesandtheirrelationtosinglenucleotidevariantsinthehumanpopulation
AT kinoshitakengo discrepanciesbetweenhumandnamrnaandproteinreferencesequencesandtheirrelationtosinglenucleotidevariantsinthehumanpopulation