Cargando…
Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population
The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we compr...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009343/ https://www.ncbi.nlm.nih.gov/pubmed/27589963 http://dx.doi.org/10.1093/database/baw124 |
_version_ | 1782451508616364032 |
---|---|
author | Shirota, Matsuyuki Kinoshita, Kengo |
author_facet | Shirota, Matsuyuki Kinoshita, Kengo |
author_sort | Shirota, Matsuyuki |
collection | PubMed |
description | The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences. |
format | Online Article Text |
id | pubmed-5009343 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-50093432016-09-07 Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population Shirota, Matsuyuki Kinoshita, Kengo Database (Oxford) Original Article The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences. Oxford University Press 2016-09-01 /pmc/articles/PMC5009343/ /pubmed/27589963 http://dx.doi.org/10.1093/database/baw124 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Shirota, Matsuyuki Kinoshita, Kengo Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population |
title | Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population |
title_full | Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population |
title_fullStr | Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population |
title_full_unstemmed | Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population |
title_short | Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population |
title_sort | discrepancies between human dna, mrna and protein reference sequences and their relation to single nucleotide variants in the human population |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009343/ https://www.ncbi.nlm.nih.gov/pubmed/27589963 http://dx.doi.org/10.1093/database/baw124 |
work_keys_str_mv | AT shirotamatsuyuki discrepanciesbetweenhumandnamrnaandproteinreferencesequencesandtheirrelationtosinglenucleotidevariantsinthehumanpopulation AT kinoshitakengo discrepanciesbetweenhumandnamrnaandproteinreferencesequencesandtheirrelationtosinglenucleotidevariantsinthehumanpopulation |