Cargando…

Improved search heuristics find 20 000 new alignments between human and mouse genomes

Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many v...

Descripción completa

Detalles Bibliográficos
Autores principales: Frith, Martin C., Noé, Laurent
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3985675/
https://www.ncbi.nlm.nih.gov/pubmed/24493737
http://dx.doi.org/10.1093/nar/gku104
_version_ 1782311608973787136
author Frith, Martin C.
Noé, Laurent
author_facet Frith, Martin C.
Noé, Laurent
author_sort Frith, Martin C.
collection PubMed
description Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human–dog and melanogaster–pseudoobscura comparisons, but not for human–mouse, which suggests that we still miss many human–mouse alignments. Our optimized heuristics find ∼20 000 new human–mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.
format Online
Article
Text
id pubmed-3985675
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-39856752014-04-18 Improved search heuristics find 20 000 new alignments between human and mouse genomes Frith, Martin C. Noé, Laurent Nucleic Acids Res Methods Online Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human–dog and melanogaster–pseudoobscura comparisons, but not for human–mouse, which suggests that we still miss many human–mouse alignments. Our optimized heuristics find ∼20 000 new human–mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research. Oxford University Press 2014-04 2014-01-31 /pmc/articles/PMC3985675/ /pubmed/24493737 http://dx.doi.org/10.1093/nar/gku104 Text en © The Author(s) 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methods Online
Frith, Martin C.
Noé, Laurent
Improved search heuristics find 20 000 new alignments between human and mouse genomes
title Improved search heuristics find 20 000 new alignments between human and mouse genomes
title_full Improved search heuristics find 20 000 new alignments between human and mouse genomes
title_fullStr Improved search heuristics find 20 000 new alignments between human and mouse genomes
title_full_unstemmed Improved search heuristics find 20 000 new alignments between human and mouse genomes
title_short Improved search heuristics find 20 000 new alignments between human and mouse genomes
title_sort improved search heuristics find 20 000 new alignments between human and mouse genomes
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3985675/
https://www.ncbi.nlm.nih.gov/pubmed/24493737
http://dx.doi.org/10.1093/nar/gku104
work_keys_str_mv AT frithmartinc improvedsearchheuristicsfind20000newalignmentsbetweenhumanandmousegenomes
AT noelaurent improvedsearchheuristicsfind20000newalignmentsbetweenhumanandmousegenomes