Introduction
Machine learning (ML) is a valuable and increasingly common tool for modern health care systems. Across many medical areas, ML algorithms have aided health care professionals to provide an accurate diagnosis for a disease, make an appropriate clinical decision for a case, and predict an outcome for a treatment intervention. Reference May1,Reference Obermeyer and Emanuel2 However, many of the applications of ML algorithms have focused on improving the safety and quality of care for patients within the in-hospital setting. Reference Obermeyer and Emanuel2–Reference Chang, Yu and Yoon4 The potential benefits of ML applications in the prehospital setting are not yet well-established. In-depth knowledge regarding ML applications on prehospital patient care is important for identifying effective technologies and providing guidance for ML-driven research in this specific domain.
Personnel from Emergency Medical Services (EMS) often respond to prehospital cases requiring a rapid response time, early recognition of a life-threatening condition, and timely treatments. Across these time-sensitive tasks, there is evidence that ML algorithms have the potential to augment EMS performance. For example, the response time of a dynamic ambulance re-deployment strategy has been reduced by 1.5 minutes using ML algorithms in reference to conventional methods that are manually designed. Reference Ji, Zheng and Wang5 In addition, dispatchers of EMS who used a clinical decision support algorithm to recognize out-of-hospital cardiac arrest (OHCA) were able to capture more cases within a shorter time interval than those who used the conventional “No-No-Go” approach. Reference Blomberg, Christensen and Lippert6 Despite this, according to the 2015-2020 report for approving ML-based medical products in the United States and Europe, none of the EMS-guided ML applications have been translated into a real product. Reference Muehlematter, Daniore and Vokinger7
To date, few reviews have summarized the application of ML in emergency medicine, but none is exclusive to prehospital emergency medical care or ambulance service. Reference Kirubarajan, Taher and Khan8–Reference Kareemi, Vaillancourt and Rosenberg10 While many EMS-guided ML applications have been developed, it is difficult to determine the value of each application in the absence of consolidated evidence. This review, therefore, aimed to systematically review and describe the literature on EMS-guided ML applications in both clinical and operational purposes, and further to describe the characteristics of ML algorithms and their performance.
Methods
The study approach involved conducting a scoping review of ML applications in EMS, complemented by a quantitative synthesis of their performance metrics. This review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses for Scoping Reviews (PRISMA-ScR) guidelines. Reference Tricco, Lillie and Zarin11 The protocol of this review is registered with PROSPERO (ID: CRD42021271256).
Search Strategy
The literature was systematically reviewed from inception until January 9, 2024 across four databases, including: Medline (US National Library of Medicine, National Institutes of Health; Bethesda, Maryland USA); Scopus (Elsevier; Amsterdam, Netherlands); CINAHL (EBSCO Information Services; Ipswich, Massachusetts USA); and Computers and Applied Sciences (EBSCO Information Services; Ipswich, Massachusetts USA). A comprehensive search strategy was created using relevant keywords, phrases, text terms, and subject headings (Supplementary Material S1 Table; available online only). The search strategy was designed to capture all reports related to EMS, ambulance, or paramedic and ML algorithms, deep learning, or neural networks. Search results were imported to the EndNote reference software (Version X9; Clarivate Analytics; Philadelphia, Pennsylvania USA). After removing duplicates, search results were then imported to Covidence software (Covidence; Melbourne, VIC, Australia) for managing and streamlining the review. 12
Study Selection
Two independent reviewers (AA and ZA) initially screened the titles and abstracts for eligibility. At this stage, a study was deemed pertinent if it was written in English; original research; peer-reviewed trials, observational, or experimental studies; and utilized a data-driven methodology involving at least one ML algorithm aimed to classify or predict an EMS-specific outcome related to either clinical or operational applications specific to EMS use.
For a study to be classified under clinical application, the implemented ML algorithm must have been engineered with the aim of augmenting the predictive or classifying capabilities concerning a clinical condition, a clinical intervention, or a potential clinical outcome. For operational applications, the ML algorithm must have been geared towards enhancing the operational efficiency of an EMS system’s resources such as optimizing, streamlining, or improving the use of EMS resources.
Conference proceedings, news, case reports, editorial letters, commentaries, reviews, and non-English reports were excluded. Studies that have used EMS data to aid emergency department and in-hospital outcomes were also excluded. After screening the titles and abstracts, a full-text screening was carried out by the two reviewers (AA and ZA) for final inclusion. Any disagreement was resolved through consensus.
Data Collection
A standardized extraction form was developed to collect relevant data from the included studies. Baseline characteristics such as the first author, year of publication, study objective and design, location, duration, data source, and the number of cases were extracted. Information related to ML algorithms such as the number and types of ML algorithms used, input features, output outcomes, and model performance were also extracted. Additionally, the values of the performance metrics were extracted. This includes the values of the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and positive predictive value. One reviewer extracted the relevant data (ZA), and this was cross-checked by a second reviewer (AA). Disagreements were resolved through consensus.
Data Synthesis
The applications of ML algorithms across EMS studies were dichotomized into clinical and operational domains. Study characteristics such as the year of publication, geographic region, study design, data source, and number of ML algorithms employed across these domains were summarized using descriptive statistics. Numerical variables were, as appropriate, reported as medians and ranges and categorical variables were reported as counts and percentages.
The studies were then classified within the clinical domain into subdomains based on the medical condition, including OHCA, cardiovascular diseases (CVDs), trauma, and others (ie, including unspecified medical conditions). Similarly, the studies were classified within the operational domain into subdomains, and this was based on the operational task, including ambulance allocation (ie, ambulance demand forecasting and relocation), deployment (ie, ambulance dispatching), detection (ie, ambulance recognition by image or siren capturing), route optimization (ie, route selection), and quality assurance (ie, quality indicators or documentation assessment). All subdomains were assessed, using descriptive statistics as described above, against the following variables: number of studies, clinical tasks, number of cases, the number and type of input features, methods of ML algorithm, and performance metrics.
The values of performance metrics were evaluated by estimating the unweighted medians and ranges. The performance metrics estimated included AUC, accuracy, sensitivity, and specificity for the best ML algorithm as declared by individual studies. STATA statistical software, version 16.0 (Stata Corp; College Station, Texas USA) was used to carry out all statistical analyses.
Results
The search strategy identified 6,661 records, of which 1,478 were duplicates and 5,183 were excluded in the title and abstract screening. A total of 647 full-text articles were retrieved and assessed for eligibility. Finally, a total of 164 studies were included in this review (Figure 1).
Characteristics of Included Studies
Table 1 summarizes the characteristics of the included studies by domain. Overall (n = 164), the median year of publication was 2020 ranging from 2005 through 2024; the number of publications exponentially increased after 2017. Figure 2 shows the distribution of the included studies over the years for different characteristics of the included studies. More EMS-guided ML applications were reported in the Asia-Pacific region (37.2%) than those that were reported in Europe (29.3%) and the United States (28.7%); most studies were retrospective in design (94.5%) and the data were mainly derived from the EMS records (45.1%) and registries (40.9%); and 51.2% of the included studies used one ML algorithm, whereas 48.8% of them used more than one. More than two-thirds of the included studies deployed ML for a clinical application (125 studies; 76.2%), while 39 (23.8%) studies applied ML for an operational application. All included studies are listed and described in Table S2 and Table S3 in the Supplementary Material (available online only). Operational studies were reported mostly in the Asia-Pacific (51.3%), whereas the United States and Europe had a similar number of studies, with a predominant focus on clinical research.
Abbreviations: EMS, Emergency Medical Services; ML, machine learning.
a Reported in 125 studies (n = 104 for clinical domain and n = 21 for operational).
Characteristics of ML Algorithms in the Clinical Domain
Table 2 presents the characteristics of EMS-guided ML algorithms applied in the clinical domain studies, categorized by medical conditions, including OHCA (n = 62 studies; 49.6%), CVDs (n = 19 studies; 15.2%), trauma (n = 24 studies; 19.2%), and others (n = 20 studies; 16.0%). The purpose of most studies in this domain was to primarily triage or diagnose a medical condition (50.0%), followed by predicting a clinical outcome (40.3%) and a clinical intervention (9.7%).
Abbreviations: OHCA, out-of-hospital cardiac arrest; CVD, cardiovascular diseases; ECG, electrocardiography; AUC, area under the receiver operating characteristic curve; ML, machine learning.
a Number of cases/instances is missing in 23 studies.
b Number of input features is missing in 54 studies.
Among OHCA studies, the most frequent purposes were predicting a clinical outcome (n = 38; 61.3%) and detecting or triaging OHCA (n = 22; 35.5%). Almost all CVD studies employed ML for triage and diagnosis (n = 18; 94.7%). The purpose of most studies concentrating on trauma was to triage traumatic cases (n = 11; 45.8%) or to predict a clinical outcome (n = 9; 37.5%).
The median number of cases/instances for most studies within the clinical domain was 3,405 (n = 102 studies; range: 37-8,981,181). The sample size was larger among studies focusing on trauma and other medical conditions than studies focusing on OHCA and CVDs. When reported, the median number of input features for studies within the clinical domain was 19 (n = 71 studies; range: 1-886). The number of input features was larger for CVD studies compared to other clinical conditions. Among all clinical domain studies, the most frequent type of input features was clinical (ie, demographics, vitals, and findings of physical examination) obtained from EMS medical records (53.7%), followed by electrocardiograph (ECG) strips (35.0%), text scripts (11.0%), and audio recordings (5.1%). The majority of the studies focused on OHCA involved ECG input features (n = 36; 59.0%), while most studies related to all other conditions involved clinical features.
The most frequently employed ML algorithms were neural network (n = 60 studies; 50.0%), followed by random forest (n = 57 studies; 48.3%), logistic regression (n = 47 studies; 39.8%), and support vector machine (n = 43 studies; 36.4%). This distribution was similar within all clinical conditions. To evaluate the performance of the ML algorithms, the most frequently used performance metric was AUC (n = 76; 60.8%), followed by sensitivity (n = 73; 58.4%) and specificity (n = 67; 53.6%).
Performance of ML Algorithms in the Clinical Domain
In the clinical domain, the median AUC for the best ML algorithms was 85.6% (n = 76 studies; range: 64.0%-99.9%). The median accuracy was 88.1% (n = 40 studies; range: 57.8%-99.2%), sensitivity was 86.0% (n = 73 studies; range: 38.8%-99.7%), and specificity was 86.5% (n = 67 studies; range: 11.0%-99.9%).
Figure 3 depicts the performance of the best ML algorithms employed to triage and diagnose clinical conditions and predict clinical outcomes across the clinical conditions and ML algorithms. Overall, the medians of accuracy, sensitivity, and specificity of the ML algorithms designed to triage and diagnose a clinical condition (≥90.0%) were higher than the algorithms designed to predict a clinical outcome (<90.0%).
The median values of most performance metrics for ML algorithms deployed to triage or diagnose clinical conditions were highest in the context of OHCA and lowest in traumatic emergencies. However, compared to algorithms predicting a clinical outcome after OHCA, algorithms predicting a clinical outcome after traumatic emergencies had a higher median AUC (87.3% versus 90.5%, respectively) and sensitivity (81.7% versus 91.0%), but lower accuracy (86.3% versus 87.6%) and specificity (80.3% versus 61.0%). Neural network, as well as ensemble algorithms, appeared to outperform other ML algorithms across clinical purposes and conditions.
Characteristics of ML Algorithms in the Operational Domain
Table 3 presents the characteristics of EMS-guided ML algorithms applied in the operational domain studies, categorized by the operational task, including ambulance allocation (n = 21 studies), ambulance detection (n = 5 studies), ambulance deployment (n = 5 studies), route optimization (n = 5 studies), and quality assurance (n = 3 studies).
Abbreviations: EMS, Emergency Medical Service; ML, machine learning; AUC, area under the receiver operating characteristic curve.
a Number of cases is missing in 19 studies.
b Number of input features is missing in 29 studies.
The overall median sample size was 11,899 (n = 20 studies; range: 77-7,364,275) and the largest sample sizes were reported in studies focusing on ambulance allocation and deployment. The overall median number of input features was 10 (n = 10 studies; range: 3-9,224) and the largest median was found in ambulance deployment studies (18.0; range: 8-9,224). Among studies within the operational domain, the most frequent types of input features were EMS-related data (n = 17 studies; 43.6%), spatiotemporal (n = 13 studies; 33.3%), clinical findings (n = 10 studies; 25.6%), audio recordings (n = 3 studies; 7.7%), and image (n = 3 studies; 7.7%).
The neural network was the most frequently employed ML algorithm (n = 24 studies; 61.5%) across all operational purposes. This was followed by ensemble algorithms (n = 7 studies; 17.9%), support vector machine (n = 6 studies; 15.4%), random forest (n = 6 studies; 15.4%), and decision tree (n = 4 studies; 10.3%). To evaluate the performance of the ML algorithms, accuracy (n = 17 studies; 43.6%) was the most frequently used performance metric, followed by sensitivity (n = 10 studies; 25.6%) and specificity (n = 6 studies; 15.4%). Notably, a total of 18 (46.2%) studies did not report the performance of their ML models.
Performance of ML Algorithms in the Operational Domain
For the operational domain, the median AUC of the best ML algorithms was 96.1% (n = 6 studies; range: 89.8%-99.0%), accuracy was 90.0% (n = 17; range: 24.5%-98.7%), sensitivity was 94.4% (n = 10; range: 83.0%-99.5%), and specificity was 87.7% (n = 6; range: 66.8%-99.4%).
Figure 4 illustrates the performance of the best ML models categorized by operational tasks and types of ML algorithms. The medians of accuracy metrics across studies related to ambulance detection and deployment were higher than those related to ambulance allocation, route optimization, and quality assurance. The frequently applied algorithm was a neural network which, along with the ensemble, out-performed other algorithms to a certain level.
Discussion
The journey of prehospital emergency care for a patient, including call receipt, dispatch, travel, triage, diagnosis, treatment, and transport to appropriate definitive care, is efficiently managed by multiple operational and medical components of EMS. This review showed that ML algorithms were implemented in almost all EMS components, including dispatch and communication system, first responders, ambulances and paramedics, medical oversight, and quality improvement. In addition, ML has the potential to augment the decision-making capabilities of EMS personnel in both clinical and operational domains or even supplement their judgment in specific aspects of EMS components. However, the integration of ML in EMS is still evolving, and further research, development, and validation may be needed to fully leverage its capabilities in clinical and operational applications. Considerable heterogeneity was observed in the included studies in terms of population characteristics, the ML algorithms employed, input features, and performance metrics which limited a more comprehensive review. The implications and performance of ML should be in-depth evaluated for each clinical condition, outcome, or operational task.
Several studies showed the potential of ML algorithms in improving emergency medical response to OHCA. More than 15 studies concentrated on implementing ML for classifying cardiac arrest ECG rhythms or detecting the return of spontaneous circulation aiming to improve resuscitation performance and provide timely guidance for delivering electrical defibrillation. Multiple ML algorithms exhibit outstanding performance (AUC > 0.90), demonstrating remarkable accuracy and practical applicability in the real-world context. For example, Jaureguibeitia, et al proposed novel deep learning architectures (AUC = 98.6%) for shock decisions using ECG segments as brief as one second, which likely shorten interruptions in resuscitation for rhythm analysis. Reference Jaureguibeitia, Zubia and Irusta13 Elola, et al also implemented deep neural networks for pulse detection using only the ECG (AUC = 86.2%), Reference Elola, Aramendi and Irusta14 while Alonso, et al (AUC = 92.6%) combined ECG and thoracic impedance signals for superior pulse detection. Reference Alonso, Irusta and Aramendi15 Further, deep neural networks have demonstrated the ability to detect OHCA at the time of calling for help, arguably with greater accuracy and speed than the EMS telecommunicators. Reference Blomberg, Christensen and Lippert6,Reference Blomberg, Folke and Ersbøll16,Reference Byrsell, Claesson and Ringh17
Predictive ML models were also developed for three clinical outcomes after OHCA, including the return of spontaneous circulation, survival, and neurological recovery. Among many ML models, deep neural networks achieved the highest performance (AUC = 0.84 to 0.90) in predicting the success of defibrillation using ECG features. Reference He, Lu and Zhang18–Reference Shandilya, Kurz and Ward20 Acknowledging the limited number of cases in most of these studies (n <500 cases), larger and more diverse samples of cases are likely essential to limit the risk of bias and to enhance the performance, reliability, and applicability of ML algorithms in predicting defibrillation success. Reference Yang21
Deep neural networks and ensemble ML algorithms performed exceptionally (AUC = 0.90 to 0.96) among others to predict survival and neurological recovery using prehospital clinical and EMS data. Reference Cheng, Chiu and Zeng22–Reference Kawai, Okuda and Kinoshita24 However, these predictive models exhibit considerable heterogeneity in quality, type of population, input and output features, and algorithm selection, potentially affecting generalizability. Additionally, the complexity and lack of transparency in certain ML algorithms might obstruct their integration into clinical practice. Reference Yang21 Future ML studies should discuss features attribution or explain the ML models output by using several methods such as Shapley Additive Explanation, Local Interpretable Model-agnostic Explanations, and integrated gradients.
Most ML models have also been shown to improve the detection of CVDs in the prehospital setting. A deep learning model achieved an impressive AUC score of 0.997 in detecting ST-elevation myocardial infarction (STEMI). Reference Chen, Wang and Liu25 This model has been deployed in a prehospital 12-lead ECG device and resulted in a shorter median time between first medical contact and hospital arrival (18.5 minutes), compared to the weighted mean estimated by a previous meta-analysis (41 minutes). Reference Alrawashdeh, Nehme and Williams26 Further, Bouzid, et al have developed a fusion model to detect non-STEMI using prehospital 12-lead ECG which significantly outperformed the performance of expert clinicians and commercial software. Reference Bouzid, Faramand and Gregg27 In addition, prehospital stroke diagnostic ML algorithms have been shown to be accurate in identifying and differentiating stroke using prehospital clinical presentation variables. Reference Chen, Zhang and Xu28–Reference Uchida, Kouno and Yoshimura31 Two studies have shown that neural network algorithms out-performed other available prehospital stroke prediction scales, such as the Prehospital Acute Stroke Severity scale and the Cincinnati Prehospital Stroke Severity scale. Reference Chen, Zhang and Xu28,Reference Zhang, Zhou and Zhang29 However, a larger number of parameters (ie, 18 variables) were included in these neural network models, which demand more effort and time to be collected by ambulance clinicians.
For traumatic conditions, ML models have been increasingly applied to primarily improve the triage decision by dispatchers, first responders, and ambulance clinicians in various out-of-hospital contexts (ie, general trauma, road accidents, and military). These models were developed either to classify traumatic patients by their level of severity (ie, Injury Severity Score) Reference Candefjord, Sheikh Muhammad and Bangalore32 or to predict the level of treatment needed (ie, advanced life-saving interventions, trauma team activation, and critical care) Reference Morris, Karam and Zolfaghari33,Reference Kang, Cho and Kwon34 or the clinical outcomes (ie, survival). Reference Kim, You and So35 The input features in most models were relatively limited and mainly consisted of demographics, vital signs, other clinical manifestations, or accident-related variables. Neural networks and ensemble techniques (random forest, XGBoosting) were frequently used and had the best performance compared to other models. Two studies found that binary logistic regression performed better than ML algorithms when classifying the severity of injuries. Reference Candefjord, Sheikh Muhammad and Bangalore32,Reference Larsson, Berg and Gellerfors36 This is likely attributed to the limited number of input features in these studies. Overall, the contribution of ML seems to be promising in this context, but further validation is required to better understand the applicability of ML in supporting triage decisions and optimizing resource allocation.
Several ML models have been implemented to solve operational problems and augment the decision making of EMS personnel in operational tasks. The over-arching aims of the operational studies were to reduce ambulance travel time and deploy the appropriate level of prehospital care to the right patients. A variety of supervised, unsupervised, and reinforcement models have been used to improve EMS demand forecasting in different settings to optimize ambulance allocation strategies, and eventually reduce response times. The characteristics of ML models forecasting EMS demand varied in terms of time horizons (hourly, daily, weekly, and monthly demand) and input features (geospatial, temporal, metrological, and EMS data). Several studies aimed to facilitate ambulance passage through traffic congestion by either audio-based and/or visual-based detection. The audio-based models using random forest, support vector machine, and neural networks achieved AUC scores between 0.97-1.00. Reference Dhakal, Damacharla and Javaid37–Reference Tran and Tsai40 Visual-based models mainly used deep learning with variable performance (AUC = 0.78-1.00) Reference Tran and Tsai40–Reference Roy and Rahman42 while hybrid models incorporated both audio and image inputs and typically relied on deep learning yield strong results (0.86-0.98). Reference Tran and Tsai40,Reference Raman, Kaushik and Rao43 The performance of these models indicates the viability of employing ML to detect ambulances and expedite their passage through intersections and traffic congestion.
Finally, the implementation of the developed ML models in real time was rarely discussed in the included studies. Several challenges such as the generalizability and explainability of these ML models may complicate their implication in clinical and real time. Several studies reported perfect discrimination, especially for operational purposes, which warrant a critical examination of the underlying data and methodologies employed. While perfect discrimination can be attributed to factors like over-fitting, reliance on internal validation, data leakage, or imbalanced datasets, it is imperative to consider the broader context of its applicability, generalizability, and potential limitations in real-world EMS scenarios. For example, the generalization of ML algorithms for ambulance detection using audio or image data may pose significant challenges due to the wide variation in ambulance sirens and morphologies across different EMS systems.
It is also important to acknowledge the complications and limitations associated with the retrospective nature of most included studies. Rich prehospital datasets are still necessary; however, they are likely challenging to obtain, owing to limitations in data quality and availability, patient privacy, and the heterogeneity of EMS system data frameworks. Reference Aggarwal, Sounderajah and Martin44 This necessitates collaborative efforts among researchers, EMS systems, and other health care institutions, as well as policymakers to establish data-sharing frameworks, best practices, and governance models that facilitate the responsible and impactful application of these technologies in the challenging and dynamic landscape of EMS. For better explainability, future ML studies should also discuss features attribution or explain the output of ML models by using available methods such as Shapley additive explanation, local interpretable model-agnostic explanations, and integrated gradients. Reference Petch, Di and Nelson45
Limitations
This review has several limitations. Of note, this review may have some selection bias. It only included English peer-reviewed articles identified by the search strategy. The variation in the terminology of ML and EMS in the literature may also contribute to selection bias. The wide scope of this review may hinder proper summarization of the applications and performance of the ML in specific clinical conditions and operational tasks. For example, the approaches for features selection, model validation, and optimization used in the included studies were not detailed. Future systematic reviews could be done to quantify the use of ML in specific clinical conditions such as detecting OHCA or predicting the clinical outcomes after OHCA.
Also not examined was the risk of bias and quality of the included studies. This is attributed to the large number and the wide variation in the objectives and characteristics of the included studies. Although the performance of the ML algorithms performed well in most applications, the issue of risk of bias should be considered in most ML models, particularly within the analysis domain. Reference Andaur Navarro, Damen and Takada46,Reference Dhiman, Ma and Andaur Navarro47 Several studies had not reported all relevant details such as sampling, input features selection, validation procedures, and model performance, particularly for studies within the operational domain. These missing data also limit the overall analysis and the generalizability of the findings. Finally, this review did not assess whether the ML models deployed in the included studies had obtained regulatory approval from agencies such as the United States Food and Drug Administration (FDA; Silver Spring, Maryland USA) or other relevant authorities. This limitation should be acknowledged when considering the broader applicability of these models in clinical practice.
Conclusion
This systematic review highlights the diverse applications and potential benefits of ML in EMS. Most studies focused on the use of ML algorithms to enhance the clinical domain, particularly in detecting and predicting the severity, treatment, and clinical outcome of patients with OHCA, CVDs, and trauma. The performance of ML models varied widely, indicating the need for further research and standardization of ML algorithms in EMS. Additionally, the review found that ML can improve the operational management of ambulance services, including optimizing ambulance allocations, reducing travel time, and enhancing resource utilization. This review is expected to provide useful benchmarks and implications for possible future research. Further research is needed to investigate the implementation and quality of ML in specific clinical conditions or operational tasks within the EMS.
Conflicts of interest/funding
This work was supported by the Jordan University of Science and Technology under Grant number 20210307. Ziad Nehme is funded by a National Heart Foundation fellowship. The authors report there are no competing interests to declare.
Supplementary Materials
To view supplementary material for this article, please visit https://doi.org/10.1017/S1049023X24000414
Author Contribution
AA and SA conceived the idea; AA, ZA, and NM conducted study selection and data extraction; AA, SA, and ZA synthesized the findings and wrote the first draft. All authors reviewed and confirmed the final manuscript.