Cargando…

HPTMT Parallel Operators for High Performance Data Science and Data Engineering

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abeykoon, Vibhatha, Kamburugamuve, Supun, Widanage, Chathura, Perera, Niranda, Uyar, Ahmet, Kanewala, Thejaka Amila, von Laszewski, Gregor, Fox, Geoffrey
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Big Data
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8860100/ https://www.ncbi.nlm.nih.gov/pubmed/35198971 http://dx.doi.org/10.3389/fdata.2021.756041

_version_	1784654597139202048
author	Abeykoon, Vibhatha Kamburugamuve, Supun Widanage, Chathura Perera, Niranda Uyar, Ahmet Kanewala, Thejaka Amila von Laszewski, Gregor Fox, Geoffrey
author_facet	Abeykoon, Vibhatha Kamburugamuve, Supun Widanage, Chathura Perera, Niranda Uyar, Ahmet Kanewala, Thejaka Amila von Laszewski, Gregor Fox, Geoffrey
author_sort	Abeykoon, Vibhatha
collection	PubMed
description	Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. Our analysis show that the proposed system architecture is better suited for high performance computing environments compared to the current big data processing systems. Furthermore our proposed system emphasizes the importance of efficient compact data structures such as Apache Arrow tabular data representation defined for high performance. Thus the system integration we proposed scales a sequential computation to a distributed computation retaining optimum performance along with highly usable application programming interface.
format	Online Article Text
id	pubmed-8860100
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-88601002022-02-22 HPTMT Parallel Operators for High Performance Data Science and Data Engineering Abeykoon, Vibhatha Kamburugamuve, Supun Widanage, Chathura Perera, Niranda Uyar, Ahmet Kanewala, Thejaka Amila von Laszewski, Gregor Fox, Geoffrey Front Big Data Big Data Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. Our analysis show that the proposed system architecture is better suited for high performance computing environments compared to the current big data processing systems. Furthermore our proposed system emphasizes the importance of efficient compact data structures such as Apache Arrow tabular data representation defined for high performance. Thus the system integration we proposed scales a sequential computation to a distributed computation retaining optimum performance along with highly usable application programming interface. Frontiers Media S.A. 2022-02-07 /pmc/articles/PMC8860100/ /pubmed/35198971 http://dx.doi.org/10.3389/fdata.2021.756041 Text en Copyright © 2022 Abeykoon, Kamburugamuve, Widanage, Perera, Uyar, Kanewala, von Laszewski and Fox. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Big Data Abeykoon, Vibhatha Kamburugamuve, Supun Widanage, Chathura Perera, Niranda Uyar, Ahmet Kanewala, Thejaka Amila von Laszewski, Gregor Fox, Geoffrey HPTMT Parallel Operators for High Performance Data Science and Data Engineering
title	HPTMT Parallel Operators for High Performance Data Science and Data Engineering
title_full	HPTMT Parallel Operators for High Performance Data Science and Data Engineering
title_fullStr	HPTMT Parallel Operators for High Performance Data Science and Data Engineering
title_full_unstemmed	HPTMT Parallel Operators for High Performance Data Science and Data Engineering
title_short	HPTMT Parallel Operators for High Performance Data Science and Data Engineering
title_sort	hptmt parallel operators for high performance data science and data engineering
topic	Big Data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8860100/ https://www.ncbi.nlm.nih.gov/pubmed/35198971 http://dx.doi.org/10.3389/fdata.2021.756041
work_keys_str_mv	AT abeykoonvibhatha hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT kamburugamuvesupun hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT widanagechathura hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT pereraniranda hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT uyarahmet hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT kanewalathejakaamila hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT vonlaszewskigregor hptmtparalleloperatorsforhighperformancedatascienceanddataengineering AT foxgeoffrey hptmtparalleloperatorsforhighperformancedatascienceanddataengineering

HPTMT Parallel Operators for High Performance Data Science and Data Engineering

Ejemplares similares