Cargando…

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data

As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows–Wheeler aligner-maximum exact matches) relat...

Descripción completa

Detalles Bibliográficos
Autores principales: Robinson, Kelly M., Hawkins, Aziah S., Santana-Cruz, Ivette, Adkins, Ricky S., Shetty, Amol C., Nagaraj, Sushma, Sadzewicz, Lisa, Tallon, Luke J., Rasko, David A., Fraser, Claire M., Mahurkar, Anup, Silva, Joana C., Dunning Hotopp, Julie C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5643015/
https://www.ncbi.nlm.nih.gov/pubmed/29114401
http://dx.doi.org/10.1099/mgen.0.000122
_version_ 1783271452734128128
author Robinson, Kelly M.
Hawkins, Aziah S.
Santana-Cruz, Ivette
Adkins, Ricky S.
Shetty, Amol C.
Nagaraj, Sushma
Sadzewicz, Lisa
Tallon, Luke J.
Rasko, David A.
Fraser, Claire M.
Mahurkar, Anup
Silva, Joana C.
Dunning Hotopp, Julie C.
author_facet Robinson, Kelly M.
Hawkins, Aziah S.
Santana-Cruz, Ivette
Adkins, Ricky S.
Shetty, Amol C.
Nagaraj, Sushma
Sadzewicz, Lisa
Tallon, Luke J.
Rasko, David A.
Fraser, Claire M.
Mahurkar, Anup
Silva, Joana C.
Dunning Hotopp, Julie C.
author_sort Robinson, Kelly M.
collection PubMed
description As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows–Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium–human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
format Online
Article
Text
id pubmed-5643015
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-56430152017-11-07 Aligner optimization increases accuracy and decreases compute times in multi-species sequence data Robinson, Kelly M. Hawkins, Aziah S. Santana-Cruz, Ivette Adkins, Ricky S. Shetty, Amol C. Nagaraj, Sushma Sadzewicz, Lisa Tallon, Luke J. Rasko, David A. Fraser, Claire M. Mahurkar, Anup Silva, Joana C. Dunning Hotopp, Julie C. Microb Genom Research Article As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows–Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium–human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set. Microbiology Society 2017-07-08 /pmc/articles/PMC5643015/ /pubmed/29114401 http://dx.doi.org/10.1099/mgen.0.000122 Text en © 2017 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Robinson, Kelly M.
Hawkins, Aziah S.
Santana-Cruz, Ivette
Adkins, Ricky S.
Shetty, Amol C.
Nagaraj, Sushma
Sadzewicz, Lisa
Tallon, Luke J.
Rasko, David A.
Fraser, Claire M.
Mahurkar, Anup
Silva, Joana C.
Dunning Hotopp, Julie C.
Aligner optimization increases accuracy and decreases compute times in multi-species sequence data
title Aligner optimization increases accuracy and decreases compute times in multi-species sequence data
title_full Aligner optimization increases accuracy and decreases compute times in multi-species sequence data
title_fullStr Aligner optimization increases accuracy and decreases compute times in multi-species sequence data
title_full_unstemmed Aligner optimization increases accuracy and decreases compute times in multi-species sequence data
title_short Aligner optimization increases accuracy and decreases compute times in multi-species sequence data
title_sort aligner optimization increases accuracy and decreases compute times in multi-species sequence data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5643015/
https://www.ncbi.nlm.nih.gov/pubmed/29114401
http://dx.doi.org/10.1099/mgen.0.000122
work_keys_str_mv AT robinsonkellym aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT hawkinsaziahs aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT santanacruzivette aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT adkinsrickys aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT shettyamolc aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT nagarajsushma aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT sadzewiczlisa aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT tallonlukej aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT raskodavida aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT fraserclairem aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT mahurkaranup aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT silvajoanac aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata
AT dunninghotoppjuliec aligneroptimizationincreasesaccuracyanddecreasescomputetimesinmultispeciessequencedata