Cargando…

Seq: A High-Performance Language for Bioinformatics

The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100—a factor of over 10...

Descripción completa

Detalles Bibliográficos
Autores principales:	SHAJII, ARIYA, NUMANAGIĆ, IBRAHIM, BAGHDADI, RIYADH, BERGER, BONNIE, AMARASINGHE, SAMAN
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9241673/ https://www.ncbi.nlm.nih.gov/pubmed/35775031 http://dx.doi.org/10.1145/3360551

_version_	1784737859869081600
author	SHAJII, ARIYA NUMANAGIĆ, IBRAHIM BAGHDADI, RIYADH BERGER, BONNIE AMARASINGHE, SAMAN
author_facet	SHAJII, ARIYA NUMANAGIĆ, IBRAHIM BAGHDADI, RIYADH BERGER, BONNIE AMARASINGHE, SAMAN
author_sort	SHAJII, ARIYA
collection	PubMed
description	The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100—a factor of over 10(6)—and the amount of data to be analyzed has increased proportionally. Yet, as Moore’s Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python—and is in many cases a drop-in replacement—yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
format	Online Article Text
id	pubmed-9241673
institution	National Center for Biotechnology Information
language	English
publishDate	2019
record_format	MEDLINE/PubMed
spelling	pubmed-92416732022-06-29 Seq: A High-Performance Language for Bioinformatics SHAJII, ARIYA NUMANAGIĆ, IBRAHIM BAGHDADI, RIYADH BERGER, BONNIE AMARASINGHE, SAMAN Proc ACM Program Lang Article The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100—a factor of over 10(6)—and the amount of data to be analyzed has increased proportionally. Yet, as Moore’s Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python—and is in many cases a drop-in replacement—yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software. 2019-10 2019-10-10 /pmc/articles/PMC9241673/ /pubmed/35775031 http://dx.doi.org/10.1145/3360551 Text en https://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/)
spellingShingle	Article SHAJII, ARIYA NUMANAGIĆ, IBRAHIM BAGHDADI, RIYADH BERGER, BONNIE AMARASINGHE, SAMAN Seq: A High-Performance Language for Bioinformatics
title	Seq: A High-Performance Language for Bioinformatics
title_full	Seq: A High-Performance Language for Bioinformatics
title_fullStr	Seq: A High-Performance Language for Bioinformatics
title_full_unstemmed	Seq: A High-Performance Language for Bioinformatics
title_short	Seq: A High-Performance Language for Bioinformatics
title_sort	seq: a high-performance language for bioinformatics
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9241673/ https://www.ncbi.nlm.nih.gov/pubmed/35775031 http://dx.doi.org/10.1145/3360551
work_keys_str_mv	AT shajiiariya seqahighperformancelanguageforbioinformatics AT numanagicibrahim seqahighperformancelanguageforbioinformatics AT baghdadiriyadh seqahighperformancelanguageforbioinformatics AT bergerbonnie seqahighperformancelanguageforbioinformatics AT amarasinghesaman seqahighperformancelanguageforbioinformatics AT seqahighperformancelanguageforbioinformatics

Seq: A High-Performance Language for Bioinformatics

Ejemplares similares