Cargando…

Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data

While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is ut...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mohammadpour, Seyed Iman, Khedmati, Majid, Zada, Mohammad Javad Hassan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10032500/ https://www.ncbi.nlm.nih.gov/pubmed/36947539 http://dx.doi.org/10.1371/journal.pone.0281901

_version_	1784910814111596544
author	Mohammadpour, Seyed Iman Khedmati, Majid Zada, Mohammad Javad Hassan
author_facet	Mohammadpour, Seyed Iman Khedmati, Majid Zada, Mohammad Javad Hassan
author_sort	Mohammadpour, Seyed Iman
collection	PubMed
description	While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
format	Online Article Text
id	pubmed-10032500
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-100325002023-03-23 Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data Mohammadpour, Seyed Iman Khedmati, Majid Zada, Mohammad Javad Hassan PLoS One Research Article While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity. Public Library of Science 2023-03-22 /pmc/articles/PMC10032500/ /pubmed/36947539 http://dx.doi.org/10.1371/journal.pone.0281901 Text en © 2023 Mohammadpour et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Mohammadpour, Seyed Iman Khedmati, Majid Zada, Mohammad Javad Hassan Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data
title	Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data
title_full	Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data
title_fullStr	Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data
title_full_unstemmed	Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data
title_short	Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data
title_sort	classification of truck-involved crash severity: dealing with missing, imbalanced, and high dimensional safety data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10032500/ https://www.ncbi.nlm.nih.gov/pubmed/36947539 http://dx.doi.org/10.1371/journal.pone.0281901
work_keys_str_mv	AT mohammadpourseyediman classificationoftruckinvolvedcrashseveritydealingwithmissingimbalancedandhighdimensionalsafetydata AT khedmatimajid classificationoftruckinvolvedcrashseveritydealingwithmissingimbalancedandhighdimensionalsafetydata AT zadamohammadjavadhassan classificationoftruckinvolvedcrashseveritydealingwithmissingimbalancedandhighdimensionalsafetydata

Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data

Ejemplares similares