Enhancing healthcare data privacy and interoperability with federated learning

View article
PeerJ Computer Science

Introduction

The digitalization of the modern world has led to major technological advances that impact various aspects of human life. Today, many devices are interconnected via the Internet in a network called the Internet of Things (IoT) (Aledhari et al., 2022). Connectivity not only improves functionality but also revolutionizes the way we interact with technology on a daily basis. These networks help solve many problems, with healthcare being a major sector requiring cutting-edge solutions and specific care (Famá, Faria & Portugal, 2022). The digitalization of healthcare opens immense opportunities for improving the quality of life through electronic mobile health applications, medical imaging, medical records, low-cost genetic sequencing, and the diffusion of sensors and wearable devices (Lehne et al., 2019; Gupta & Gupta, 2019). Combined with emerging technologies such as artificial intelligence (AI), big data analytics and cloud computing, this wealth of digital health data promises to improve the lives of millions of patients worldwide (Lehne et al., 2019). An individual can achieve digitalization of healthcare by using wearable devices that allow the user to measure their weight, calories, sleep, exercise, heart rate, body temperature, and other vital signs (White, Liang & Clarke, 2019; Pardamean et al., 2020; Sharma & Rani, 2021). Today, the wearable technology market is in a golden period, and therefore they are very economical (White, Liang & Clarke, 2019; Pardamean et al., 2020; Akdevelioglu, Hansen & Venkatesh, 2021). This has allowed many people interested in digitally tracking their health and behavior for personal improvement through self-tracking to form a community of enthusiasts known as the Quantified Self (QS) (White, Liang & Clarke, 2019; Akdevelioglu, Hansen & Venkatesh, 2021; Liang, 2022). This term generally encompasses anything that can be measured (Sharma & Rani, 2021). Recently, growing academic interest in machine learning has given rise to an interdisciplinary research field called personal informatics. This field necessitates the collaboration of various disciplines, encompassing consumer informatics, health informatics, ubiquitous computing, and human-computer interaction (HCI) (Liang, 2022).

Despite the rapid evolution of connected objects, analyzing large amounts of QS data remains a challenge. QS practice remains primarily focused on data collection, while data analysis is limited to basic visualization and correlation analysis (Liang, 2022). More effective use of a large amount of collected QS data requires machine learning (ML) algorithms (Sharma & Rani, 2021). However, a significant portion of medical data lacks interoperability. It is stored in isolated databases, hosted in incompatible systems, and enmeshed in proprietary software, making it difficult to share, analyze, and interpret. Current healthcare systems use varied data formats, individualized specifications, and unclear meanings. This hinders the development of technologies that rely on this data, such as machine learning, AI, and big data (Lehne et al., 2019; Xu et al., 2021). Therefore, data interoperability is essential to fully harness the capabilities of smart, interconnected healthcare, which has the potential to improve patient outcomes and reduce costs (Seneviratne, 2023).

Syntactic interoperability ensures that data from diverse sources, including sensors and wearable devices, follow a standard format, which is critical to facilitate seamless integration and analysis. Using standardized data formats through the adoption of protocols such as Fast Healthcare Interoperability Resources (FHIR), healthcare systems can share and use patient data more effectively. Such standardization allows machine learning models to receive and process precise and detailed datasets. The benefits of such interoperability extend beyond data integration; it also significantly enhances the predictive capabilities of ML algorithms as well as the scalability and reliability of Federated Learning (FL). This is because FL models can be trained on varied datasets from different institutions without infringing on data privacy. Syntactic interoperability is therefore at the heart of the successful deployment of machine learning and AI in healthcare, serving as the critical bridge between heterogeneous data sources and holistic, actionable intelligence.

This study compares FL and centralized learning (CL), leading to the development of a web application for predictive analysis of data from sensors and mobile devices, interoperable in a clinical setting. It involves the analysis of multiple datasets through the application of machine learning algorithms trained using federated and centralized methods. The results are evaluated using appropriate metrics to propose a solution that improves data interoperability. The main objectives of this research are to improve the ability of healthcare systems to effectively manage and analyze data, thereby improving the quality of care and operational efficiency through the integration of advanced AI technologies.

Our proposed system significantly improves the performance and privacy of wearable IoT sensors by adopting federated learning for decentralized processing. This approach avoids the centralization of sensitive data, thus improving privacy and interoperability. The main objective of our approach is to reduce the vulnerabilities associated with centralizing sensitive patient records for analytics. Furthermore, the proposed solution combines FL and FHIR to extract valuable insights from dispersed health data without centralizing all patient information. This is a key distinction from traditional centralized learning and a significant privacy improvement in the healthcare context. Additionally, it supports real-time applications that use both real and synthetic data, ensuring compliance with FHIR standards, which promotes more dynamic and secure interactions within health technology systems.

The main contributions of this study are summarized as follows: We developed and validated an innovative application that uses AutoML to train machine learning models on interoperable healthcare data, using both FL and CL. This advancement improves data privacy and interoperability within healthcare systems. Our study provides an in-depth comparative analysis of FL and CL, evaluating their performance and effectiveness. Furthermore, it presents an an intuitive tool for healthcare professionals to optimize and personalize patient care through cutting-edge AI technology.

The remainder of this article is organized into several sections for in-depth analysis. “Related Works” reviews the existing literature, focusing on key subtopics such as connected objects and IoT, Quantified Self, electronic medical records, and federated learning. “Methodology” details the programming stack used and describes the implementation process of the proposed solution. “Simulation Setup and Evaluation” presents the results, including a comparative analysis of the evaluation indicators of models trained with CL and FL. Finally, “Conclusion” summarizes the entire study, highlighting the critical need to address interoperability challenges in the healthcare sector through the application of FL, and also explores the implications for future research in this area.

Related works

This section covers several critical topics, including wearable devices and IoT, Quantified Self, electronic health records, and federated learning. There are a number of studies on the distributed intelligence in smart cities (Hashem et al., 2024) and healthcare using FL and FHIR (Naithani et al., 2024). Hashem et al. (2024) highlight the improvement in efficiency and functionality of smart cities by integrating distributed intelligence in IoT with a wide range of architectures, methodologies and applications. Similarly, Naithani et al. (2024) highlight the key challenges of implementing FL in healthcare, such as data heterogeneity, computational resources, and regulatory compliance. These technologies are fundamental to a comprehensive understanding of the problems to be solved and crucial to designing effective solutions.

Wearable devices and IoT

Just as the Internet and smartphones have transformed our lives, wearables and IoT devices are swiftly reshaping how we live (Lo, Ip & Yang, 2016). Today, individuals not only share photos of their activities but also detailed data such as heart rate metrics, step counts, and maps of their travel routes, all facilitated by wearable technology. Therefore, understanding the nature of IoT and wearable devices is crucial for exploring the extent of their potential applications.

Wearable biosensors, often referred to as “wearables,” represent a significant interdisciplinary effort within health services to harness mobile health (mHealth) technology for improved data collection, diagnosis, treatment monitoring, and enhanced health insights. These devices, now readily available as consumer products, are increasingly used to collect and analyze basic physiological data such as weight, calorie intake, exercise, sleep patterns, body temperature, and heart rate (White, Liang & Clarke, 2019; Pardamean et al., 2020; Sharma & Rani, 2021), and even more complex data like energy expenditure, arrhythmia, cardiovascular, continuous glucose monitoring, diabetes mellitus, and many more (Witt et al., 2019). Due to the complexity of the field, the topics researchers study vary significantly. Below, some of the applications of wearables are discussed.

Wearable health technology has been the focus of research in recent years. Witt et al. (2019) analyzes the algorithms used in popular wearable devices from 2014 to 2017, while in Dian, Vahidnia & Rahmati (2020) the authors surveys recent developments in wearable IoT. Another study Haleem et al. (2022) introduces Medical 4.0 as a significant advancement in healthcare, utilizing IoT in data collection and promoting patient-centred therapy. Pathak, Mukherjee & Misra (2023) develop SemBox, a wireless system that enhances interoperability among diverse wearable health monitoring devices. The authors of Ahmed et al. (2024) discuss the opportunities and challenges of the Internet of Medical Things (IoMT), highlighting the benefits of patient empowerment, healthcare collaboration, and data sharing. Fei & Ur-Rehman (2020) develop a low-power smart wristband that can monitor heart rate, count steps, and detect abnormal hand movements. A user-end program is also created for data analysis and presentation. In a study involving 46 participants (de la Casa Pérez et al., 2022), the Xiaomi Mi Band 4 fitness tracker is found to be accurate and precise in tracking step count and heart rate. Another study (Liang & Chapa Martell, 2018) shows that consumer sleep tracking devices such as Fitbit Charge 2 and Neuroon eye mask are good for non-clinical use, but not suitable for diagnosing sleep disorders.

Quantified-Self

Quantified-Self involves using technology to track various aspects of one’s life, such as biology, physical activity, behaviour, or environmental factors (Swan, 2013). The term ‘Quantified-Self’ was coined in 2007 by Wired writers Gary Wolf and Kevin Kelly, Kelly being the founder of the QS movement, a global community focused on self-tracking for gaining self-knowledge through data (Sharma & Rani, 2021; Akdevelioglu, Hansen & Venkatesh, 2021). Self-monitoring, measuring, and recording for self-improvement or reflection have historical roots, but the advent of digital technologies has sparked new interest and expanded the domains and applications of self-tracking. This subsection discusses the practical applications of the QS community and the potential impact it may have on improving mHealth.

Swan (2013) notes the rise of the QS movement, where individuals track various aspects of their well-being. The long-term vision is to optimize personal performance in real-time. QS encompasses quantitative and qualitative data and is evolving into a “qualified self” by tracking qualitative aspects and promoting behavior change. Another article discusses the rise of self-tracking applications and devices and how they can be used to improve personal productivity (White, Liang & Clarke, 2019). A netnographic study was conducted on the Fitbit user community to explore wearable tech’s impact on the quantified self (Akdevelioglu, Hansen & Venkatesh, 2021). The study uncovers social engagement mechanisms, such as empowering motivation and friendly rivalry, that are built on this foundation. Billis, Batziakas & Bamidis (2015) explore how seniors have integrated self-tracking, and how it can be a habitual practice. They developed a web application to visualize sensor data collected over 6 months in real-life settings. Almalki, Gray & Sanchez (2015) introduce the concept of the Personal Health Information Self-Quantification Systems (PHI-SQS) and systematically review data management processes in eleven self-quantification tools. The article suggests that self-quantification for personal health maintenance holds promise, but further research is needed to support its use in this context. Sharon (2017) studies how self-tracking devices can be used for individual health management and raise ethical concerns. The author suggests a practice-based theory for revealing diverse enactments of values by self-trackers within the QS community. Three novel recommendation methods for QS applications: Virtual Coach, Virtual Sleep Regulator, and Virtual Nurse are introduced in Erdeniz et al. (2020). The aim of these methods is to improve the health status of individuals through personalized exercise planning, considering medical history and intended health status for physical activity recommendation, and addressing sleep quality issues. Lastly, Pardamean et al. (2020) discuss recent literature on the use of wearable devices to promote physical activity, mental wellness, and health awareness.

Federated learning

Federated learning is a technological innovation in distributed learning, addressing growing concerns about data security in the era of big data. As an alternative to centralized deep learning frameworks, federated learning offers a privacy-friendly approach by avoiding large-scale data centralization. Compared to traditional machine learning solutions, federated learning allows machine learning models to be trained directly on devices. This is achieved through coordination between multiple devices, called FL clients, and a central server, called an FL aggregator, without the exchange of raw data. This approach is particularly crucial for protecting sensitive information (Yang et al., 2023).

Tyagi, Rajput & Pandey (2023) present a survey of FL, a decentralized solution for collaborative model training with data privacy guarantees. Another article (Xu et al., 2021) explains that FL can be an effective way of linking and analyzing private health data from different sources. The method can be used to solve the problems caused by privacy concerns in exchanging electronic health records among different organizations. In their work, Khan et al. (2024) propose Fed-Inforce-Fusion, a privacy-preserving FL-based Intrusion Detection System (IDS) for IoMT. The model utilizes reinforcement learning for finding relationships in medical data and federatively trains a composite IDS model on the nodes of a smart healthcare system (SHS). An FL approach to depression detection with the guarantee of patient data privacy is proposed in Gupta et al. (2022). The cluster-based model is an enhancement over conventional machine learning models and outperforms other collaborative learning approaches.

Interoperability of electronic health records

Electronic medical records (EMRs) have revolutionized the healthcare industry by establishing common standards for health data, making it easier to share. EMRs do not specify any standards, but rather standardize health information. Standardization is a prerequisite for interoperability—seamless communication between different applications without sacrificing data (Bhartiya, Mehrotra & Girdhar, 2016). EMRs consolidate patient data for efficient sharing, in accordance with ISO standards. To bridge the gap in data interoperability, various application standards, such as OpenEHR and Fast Healthcare Interoperability Resources (FHIR) (Famá, Faria & Portugal, 2022), play a vital role in IoT architectures in digital health. This subsection explores the various applications of EMRs and their importance.

Roehrs et al. (2019) propose a personal health record (PHR) interoperability model named OmniPHR that utilizes a standard ontology and AI with natural language processing (NLP) to achieve interoperability. The model is evaluated using a real database of anonymized patient records, demonstrating the feasibility of harmonizing data from various standards into a unified format. Another semantic ontology-based model that achieves interoperability in EHRs is proposed in Adel et al. (2022). The model unifies data formats, accommodating five healthcare data standards and allowing physicians to interact with diverse systems through a single interface. The integrated ontology facilitates improved patient care. Data from 68 oncology sites using five different EHR vendor products are analyzed in Bernstam et al. (2022) to quantify the interoperability of real-world EHR implementations concerning clinically relevant structured data. The results show that intra-EHR vendor interoperability is notably higher than intersystem interoperability, highlighting the lack of standardization for clinically relevant data. In Bhartiya, Mehrotra & Girdhar (2016), discuss the challenges of accessing and sharing EHR in team-based healthcare. They highlight issues related to privacy, data security, and interoperability caused by heterogeneity in EHR systems. The article identifies different approaches and challenges in achieving interoperability in EHR sharing. Evans (2016) analyzes the evolution of EHR from 1992 to 2015 and speculates on their expected state in the next 25 years. EHRs have been used for a long time, but technical, procedural, social, political, ethical, and security concerns overshadow their usage. Although current EHRs do not entirely meet the evolving needs of the healthcare landscape, emerging EHR technology will help establish international standards for interoperable applications, facilitating precision medicine and a learning health system. Daliya & Ramesh (2019) propose a hybrid approach to handle heterogeneous data from various sources in an IoT-based healthcare system. The approach extracts data meaning from diverse sources while ensuring uniformity in data format, contributing to enhanced data interoperability. Another study (Khalique, Khan & Nosheen, 2019) proposes a framework for managing public healthcare data based on standardized protocols. It utilizes EHR at basic health facilities, consolidates data from multiple sources, and incorporates contextual information using HL7 as an interoperability standard. Researchers from Italy developed a national IT platform for the interoperability of EHR systems (Silvestri et al., 2019). They introduce a Big Data architecture for EHR that offers valuable insights for healthcare professionals, patients, and decision-makers.

In contrast to earlier research, the current study stands out because it aims to achieve interoperability by employing FL architecture and EHR protocol. The proposed architecture functions not just with refined datasets but also with wearable devices.

Methodology

This section describes the design and implementation of the system, which includes data sources (agents), a FHIR server, a web user interface, and an FL server.

System design: centralized vs federated learning

Traditionally, CL requires all edge devices to share their data with some centralized node or server, which then trains the model on the aggregated data. This approach is efficient but raises privacy concerns, as the data is centralized. Figure 1 shows the design of CL.

Centralized learning design.

Figure 1: Centralized learning design.

FL, on the other hand, allows edge devices to train the model locally and share only the model updates with the FL aggregator (or FL server) as shown in Fig. 2. The FL aggregator then takes these model parameters and aggregates them (e.g., by averaging), converging all updates in one global model, which is then again distributed among clients. This approach is more privacy-preserving, but it is more complex and may require more communication between devices.

Federated learning design.

Figure 2: Federated learning design.

System implementation

The application adheres to free and open-source software (FOSS) principles, avoiding non-FOSS dependencies, and features a modular architecture with replaceable components. The proof-of-concept solution focuses on Mi Band and CSV file agents, is low-code, configurable, includes an understandable GUI, and supports supervised learning for classification and regression.

Figure 3 illustrates an FL system integrating FHIR for processing healthcare data, highlighting interactions between edge devices, the FL server, and the FHIR server. Each edge device, representing local systems like wearable devices, holds local data and models and includes an FL client for local model training, an FHIR client for managing data, and a data service (DS) agent for translating and managing data exchange. Currently, the mapping and translation specifications within the DS agents are implemented manually for each type of wearable data we integrate (e.g., steps, heart rate, sleep data). The FHIR server stores healthcare data using FHIR standards, while the FL server aggregates local model parameters from edge devices, trains a global model, and distributes global model parameters back to the edge devices. Data flows involve edge devices sending FHIR data to the FHIR server, using local data for training, and exchanging model parameters to ensure privacy and collective learning. The system ensures syntactic interoperability through FHIR, enabling seamless data exchange, and leverages FL to train models across institutions without sharing sensitive data, maintaining scalability and privacy. The integration of FHIR and FL facilitates effective data utilization, privacy preservation, and improved model accuracy, supported by a graphical user interface (GUI) for user interaction and monitoring. To have a more comprehensive analysis, we use classification to determine the activity, such as walking and running, whereas the speed and the calorie expenditure are quantified in regression.

The diagram of the proposed solution.

Figure 3: The diagram of the proposed solution.

Methodology for privacy and interoperability

The approach used on our platform to improve the privacy and interoperability of health information uses FL in combination with FHIR. This section describes the integration of the two technologies to achieve these goals.

Privacy preservation with federated learning

FL supports data privacy in our system. FL allows for local training of machine learning models from the original data (e.g., from connected objects). Importantly, sensitive data is not transmitted to a central server. Instead, model updates, such as weights or gradients, are transmitted to a central server, which averages the updates to generate a global model. This decentralized training process inherently eliminates the need for data centralization. Storing data at its source and not transmitting raw patient data between facilities significantly enhances data privacy.

Data interoperability with FHIR

To address interoperability challenges, our system uses the FHIR standard. FHIR provides a standardized and widely accepted model for the digital representation and transmission of health information. With FHIR, our system enables consistent interpretation and processing of health data from diverse sources, including wearable devices and disparate health systems. Our platform integrates the FHIR application by mapping all health data in the system to the standardized FHIR model. Data is stored and retrieved via the FHIR protocol, ensuring consistency and interoperability.

Integration of fast healthcare interoperability resources and federated learning

Our system provides seamless integration of FL and FHIR through a modular design. This integration is achieved through local agents running on devices (e.g., portable devices as FL clients). These agents have the following features:

  • FHIR client: Each agent has a FHIR client module that enforces all data management within the agent to meet FHIR standards.

  • Data service (DS) agent for FHIR translation: The DS agent is arguably the most crucial element, as it converts the device’s local data formats to the standardized FHIR format. This automatic data translation process translates local formats to FHIR standards, eliminating the need for manual translation on a case-by-case basis.

This tight integration of FHIR within our federated learning platform enables seamless data exchange and significantly improves interoperability between heterogeneous healthcare systems and wearable devices without losing the intrinsic privacy benefits of federated learning and minimizing the need for manual data intervention.

Dataset

Since the solution focuses only on two types of machine learning problems, classification and regression, we only focus on datasets suitable for both and retrieved by wearable devices.

Classification

For classification tasks, the dataset includes well-known datasets used to predict physical activities. These activities are classic classification tasks where the data is labeled, allowing for the training and evaluation of machine learning models.

Harvard’s Apple Watch dataset

The dataset (https://www.kaggle.com/datasets/aleespinosa/apple-watch-and-fitbit-data/data) from Harvard University’s research in Fuller et al. (2020) is utilized for solving classification problems. The research aims to investigate if commercial wearable devices can effectively forecast sitting, lying, and varying levels of physical activity. Scientists enlisted 46 volunteers, including 26 women, in a convenience sample to utilize three gadgets: a GENEActiv, an Apple Watch Series 2, and a Fitbit Charge HR2. The research shows that Apple Watch and Fitbit, popular wearable devices, can accurately predict the type of physical activity. The findings endorse the utilization of real-time data from Apple Watch and Fitbit with machine learning methods for large-scale classification of physical activity types among the general population. The Apple Watch dataset exhibits the following characteristics: 18 variables, 3,656 observations, no missing cells (0.0%) or duplicate rows (0.0%), a total size in memory of 514.2 KiB, comprising 16 numeric and two categorical variable types.

HARTH dataset

The HARTH dataset (https://www.kaggle.com/datasets/joebeachcapital/harth-dataset) (Logacjov et al., 2023) comprises 3-axial accelerometer data from 22 participants, with sensors placed on the thigh and lower back, recording acceleration in three dimensions at a high sampling rate for detailed motion tracking. The time-series data is accompanied by annotations of various activities and includes raw signals that allow for custom processing and feature extraction. Additionally, the dataset provides metadata on participant demographics and experiment protocols, supporting comprehensive analysis for human activity recognition research. The HARTH dataset exhibits the following characteristics: seven variables, 110,116 observations, no missing cells (0.0%) but a small number of duplicate rows (618, representing 0.6% of the data), a total size in memory of 5.9 MiB, and comprising seven numeric variable types.

Regression

For regression tasks, the dataset includes data to calculate calorie expenditure from physical activity data. These are continuous values, allowing models to predict numerical outcomes.

Mi band

Compared to classification, regression is much simpler. Regression models do not require specific target data (at a certain level). Thus, raw data from the device itself can be used without heavy data processing. The solution is tested using activity data from one of the author’s Mi Band 3, 4, and 7 devices. The models attempt to predict the number of calories burned during activity, which include steps, the walking distance, and the running distance. The Mi Band dataset exhibits the following characteristics: five variables, 2,454 observations, no missing cells (0.0%) or duplicate rows (0.0%), a total size in memory of 96.0 KiB, and comprising 1 DateTime and four numeric variable types.

DAT263x

To demonstrate the robustness of the results, it was decided to include another similar dataset, DAT263x. This dataset is mentioned in the EDX publication “Microsoft DAT263x Introduction to Artificial Intelligence (AI)” and is intended for use with Azure, which can be found on Kaggle (https://www.kaggle.com/datasets/fmendes/fmendesdat263xdemos). It contains gender, age, weight, height, duration, heart rate, body temperature, and calories burned from each participant during physical activity. The DAT263x dataset exhibits the following characteristics: eight variables, 12,000 observations, no missing cells (0.0%) but a minimal number of duplicate rows (1, representing less than 0.1% of the data), a total size in memory of 750.1 KiB, and comprising one categorical and seven numeric variable types.

Synthetic dataset

Since real data may be non-independent and identically distributed (non-IID), partitioning can strongly affect the results of CL and FL analyses. Therefore, it was decided to use the synthetic minority oversampling technique (SMOTE) to synthesize the data. Specifically, SMOTE and adaptive synthetic sampling (ADASYN) were chosen as practical methods to address potential class imbalance, a key characteristic of non-IID data, in our simulated FL environment. As shown in Chua, Sii & Ellyza Nohuddin (2022), these techniques can significantly improve the performance of different fitness data ML models when applied, providing a valuable starting point to mitigate non-IID effects. SMOTE is used to synthesize data for classification tasks. In this context, SMOTE allows balancing the class distribution within each client’s data partition before FL, thus aiming to create more robust local models despite possible non-IID class distributions across clients. ADASYN is used for regression problems, adaptively generating synthetic samples for the minority class and focusing on hard-to-learn examples. For the Mi Band dataset, ADASYN is applied to address potential imbalances in the distribution of the regression target variable (calorie expenditure) among customers in a non-IID setting. Although SMOTE and ADASYN are not exhaustive solutions to all non-IID challenges, they represent a pragmatic and effective initial strategy to mitigate class imbalance, a common manifestation of non-IID data in wearable sensor applications.

SMOTE is a commonly used method for dealing with imbalanced datasets by generating synthetic samples for the minority class. This technique involves creating new instances by interpolating between existing minority class samples. SMOTE identifies the nearest neighbors of each minority class instance and creates new synthetic examples by interpolating between these neighbors. These synthetic samples are designed to balance the class distribution, thus avoiding overfitting that can occur with standard oversampling methods.

In our study, we applied SMOTE to the Apple Watch dataset. Figure 4 represents the data distribution in the Age column. As the density plot shows, the synthetic and real data fit almost perfectly.

Classification: SMOTE vs real data distribution.

Figure 4: Classification: SMOTE vs real data distribution.

Unfortunately, SMOTE cannot be used in regression problems because the target column is not expressed by categorical values. Therefore, other oversampling techniques are tested. Among them, adaptive synthetic sampling (ADASYN) has proven to be the most suitable for real data. Therefore, ADASYN is used for the Mi Band dataset.

ADASYN is an algorithm used to mitigate the problem of unbalanced class distributions in machine learning. It works by adaptively generating synthetic samples, primarily for the minority class, focusing on the samples that are most difficult to learn. Unlike simple oversampling techniques that can replicate minority class data, ADASYN improves on this process by creating new synthetic data points that are similar, but not identical, to existing data. The number of synthetic samples generated for each minority class example is weighted based on how difficult that example was to learn, thus promoting better model generalization.

The adaptive nature of ADASYN makes it particularly useful in situations where some minority class examples are more difficult to classify than others. By generating more synthetic data for these complex examples, ADASYN results in a more balanced dataset, contributing to better performance in classification tasks.

The proposed solution utilizes ADASYN implementation from ImbalancedLearningRegression library (Wu, Kunz & Branco, 2022). It is applied to the Mi Band dataset. As shown in Fig. 5, even though the distribution is not identical, synthetic data closely resembles the original data.

Regression: ADASYN vs real data distribution.

Figure 5: Regression: ADASYN vs real data distribution.

Simulation setup and evaluation

This section describes the simulation environment setup and the methods used to evaluate the performance of FL and CL. The setup details the hardware, software, and settings used, as well as the evaluation metrics applied to assess the model accuracy and efficiency.

Experiments setup

The proof-of-concept uses open-source, low-code technologies. The GUI is built with Streamlit, an open-source Python library, enabling straightforward code writing similar to Python scripts and integrating well with popular libraries like Pandas and PyCaret. The AutoML framework chosen is PyCaret, which simplifies model development with a simple API. Flower is the FL framework known for handling large-scale experiments and supporting diverse edge devices. For FHIR, HAPI FHIR is used for data sharing, with FhirPy as the lightweight client for RESTful API access. The source code is available on GitHub repository (https://gitfront.io/r/weeebdev/gnm1dXbMqiHh/auto-fl-fit) under the MIT license.

The experiments were conducted on a MacBook Pro 14 M1. The machine is powered by an 8-core Apple M1 Pro processor, including six performance cores and two efficiency cores, 16 GB of unified memory, and a 14-core GPU. It runs macOS Sonoma 14.3.

To avoid the complexity of network constraints (latency and bandwidth limitations), heterogeneity of devices with different computing resources, security challenges, and the overhead of a distributed environment, we prioritize local simulation, focused on validating the key features of the proposed approach (i.e., the feasibility of FHIR with FL). Local simulation also aims to ensure the reproducibility of this work. Since experiments are simulated locally using only one dataset, it is crucial to clarify the approach for partitioning the data. One of the main advantages of using a single dataset is that it allows us to rigorously establish and evaluate the proposed methodology and platform for FL integrated with FHIR. Moreover, the use of a single, well-characterized dataset allows us to systematically study the main performance characteristics of our approach and ensures the robustness of our experimental setup. Assessing the solution requires utilizing various partition proportions and considering different numbers of clients. To guarantee a fair comparison between centralized and FL models, the data is divided into pairs of 60%/40%, 70%/30%, 80%/20%, and 90%/10%, resulting in two distinct partitions. It is important to note that these partitioning proportions are used to assess the sensitivity of model performance to different volumes of training data and to simulate scenarios with varying degrees of data sparsity at each client. This approach allows us to explore the robustness of federated and centralized learning under different data availability conditions and to observe the effect of different proportions on the comparative performance of the two learning paradigms. The other major reason for these proportions is to provide a range of more balanced partitionings for highly imbalanced scenarios, which presents a range of data distribution challenges. The first partition is trained using traditional centralized techniques. The resultant model, termed base model is then tuned and tested with the second partition for the CL evaluation, as shown in Fig. 6.

Evaluation of centralized learning.

Figure 6: Evaluation of centralized learning.

As shown in Fig. 7, the FL evaluation base model acts as the initial global model and is distributed to k clients. In this study, we use k clients as 2, 4 and 8 to demonstrate the fundamentals of FL in a simulated environment and observe the performance trends as the number of participating clients increases to provide meaningful insight into the system behavior with the different number of participating clients k. The second partition is then split using KFold (StratifiedKFold is excluded as it does not represent realistic scenarios where the data is not IID and may not reflect the entire dataset) and distributed among these k clients. Each client refines the global model with their new data fold. These results are aggregated and compared to the performance of the optimized centralized models.

Evaluation of federated learning.

Figure 7: Evaluation of federated learning.

Regarding synthetic data, the evaluations are performed exactly as in CL and FL, but the initial dataset followed the synthetic data generation process as shown in Fig. 8.

Generation of synthetic data.

Figure 8: Generation of synthetic data.

Performance evaluation

Not all models can be trained in a federated manner since it depends on the support of incremental (or online) learning, such that the models are continually updated and trained as new data becomes available, without the need to retrain from scratch. PyCaret provides five such models: Extra Trees, LightGBM, Extreme Gradient Boosting, Random Forest, and CatBoost. Thus, the results are obtained using those models. Additionally, to simplify the representation of results, models are referred to by codenames instead of their full names. For example, et stands for Extra Trees, xgboost for Extreme Gradient Boosting, rf for Random Forest, lightgbm for LightGBM, and catboost for CatBoost. While the experiments are conducted on various data partitions, only the 80%/20% split is discussed here. Note, that in the upcoming figures, the x-axis represents the number of clients k and the y-axis represents the metric.

Classification

This subsection discusses the simulation results for classification experiments, using metrics described in Table 1 for evaluation. This table presents the key metrics for evaluating classification models. Accuracy indicates overall accuracy, precision and recall focus on the model’s ability to correctly identify positive cases, F1 score balances precision and recall, and kappa and Matthews correlation coefficient (MCC) measure agreement beyond chance. For a detailed explanation of these evaluation metrics, see Chua, Sii & Ellyza Nohuddin (2022).

Table 1:
Evaluation metrics for classification.
Metric Formula
Accuracy TP+TNTP+TN+FP+FN
Precision TPTP+FP
Recall TPTP+FN
F1 2×Precision×RecallPrecision+Recall
Log Loss 1Ni=1N[yilog(pi)+(1yi)log(1pi)]
Kappa pope1pe where po is the observed agreement and pe is the expected agreement
MCC TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)
DOI: 10.7717/peerj-cs.2870/table-1

Note:

TP, true positive; FP, false negative; TN, true negative; FN, false negative.
Apple Watch dataset

In the analysis of the Apple Watch dataset, the FL approach demonstrated compatible performance compared to CL across multiple evaluation metrics and data partitions. For instance, in the 80%/20% partition, the FL models consistently achieved higher F1-scores than their CL counterparts, with the Random Forest and CatBoost algorithms exhibiting an F1-score of 0.8168 and 0.76065 for FL vs. 0.7951 and 0.7277 for CL as shown in Table 2. Note that the values highlighted in bold represent the best results. Additionally, the FL approach exhibited higher MCC scores, as proven by the Extra Trees model’s MCC of 0.8233 for FL, outperforming the CL model’s MCC of 0.8196. Furthermore, the FL models’ advantage in accuracy, with the Extra Trees algorithm achieving 0.8750 accuracy for FL compared to 0.8503 for CL (k = 8). Lastly, the FL models’ performance matches their CL counterparts in terms of Kappa values, exemplified by the CatBoost model’s Kappa of 0.7699 (FL) vs. 0.6714 (CL).

Table 2:
Different metrics comparison of CatBoost, Extra Trees, LightGBM, Random Forest, and XGBoost using CL and FL on Apple Watch dataset with partition 80%/20%.
Type Model k Accuracy F1 Kappa MCC
CL xgboost 2 0.7823 0.7829 0.7371 0.7377
FL xgboost 2 0.7905 0.7801 0.7471 0.7506
CL lightgbm 2 0.7007 0.7029 0.6376 0.6383
FL lightgbm 2 0.7095 0.6967 0.6493 0.6537
CL catboost 2 0.7279 0.7277 0.6714 0.6715
FL catboost 2 0.7635 0.7457 0.7142 0.7195
CL rf 2 0.7959 0.7951 0.7532 0.7535
FL rf 2 0.8176 0.8156 0.7789 0.7825
CL et 2 0.8503 0.8518 0.8189 0.8196
FL et 2 0.8244 0.8134 0.7878 0.7910
CL xgboost 4 0.7823 0.7829 0.7371 0.7377
FL xgboost 4 0.7770 0.7734 0.7293 0.7346
CL lightgbm 4 0.7007 0.7029 0.6376 0.6383
FL lightgbm 4 0.6892 0.6909 0.6215 0.6267
CL catboost 4 0.7279 0.7277 0.6714 0.6715
FL catboost 4 0.7635 0.7607 0.7123 0.7156
CL rf 4 0.7959 0.7951 0.7532 0.7535
FL rf 4 0.8176 0.8168 0.7785 0.7827
CL et 4 0.8503 0.8518 0.8189 0.8196
FL et 4 0.8514 0.852 0.8198 0.8233
CL xgboost 8 0.7823 0.7829 0.7371 0.7377
FL xgboost 8 0.7566 0.7467 0.7044 0.7158
CL lightgbm 8 0.7007 0.7029 0.6376 0.6383
FL lightgbm 8 0.7631 0.7516 0.711 0.7203
CL catboost 8 0.7279 0.7277 0.6714 0.6715
FL catboost 8 0.8092 0.8008 0.7671 0.7775
CL rf 8 0.7959 0.7951 0.7532 0.7535
FL rf 8 0.7961 0.7908 0.7516 0.7588
CL et 8 0.8503 0.8518 0.8189 0.8196
FL et 8 0.875 0.8694 0.8486 0.8571
DOI: 10.7717/peerj-cs.2870/table-2

Note:

Values in bold represent the best results.

Experiments on the Apple Watch dataset reveal that FL outperforms or matches CL in key metrics such as F1 score, MCC, accuracy, and Kappa values across various algorithms and data partitions, demonstrating its viability for processing wearable sensor data.

HARTH dataset

In the next experiment utilizing the HARTH dataset, FL models are being slightly outperformed by CL models in terms of F1-score across different models, but Random Forest and Extra Trees showed the best results achieving an F1-score of 0.92075 and 0.9222 compared to the CL model’s 0.9293 and 0.9287 as shown in Table 3. In the next experiment, FL models demonstrated analogous Kappa scores, particularly with the FL Random Forest and Extra Trees models scoring 0.89915 and 0.9028 against the CL model’s 0.9089 and 0.9089. Another experiment showed that FL models are compatible with CL in terms of accuracy, with the FL Extra Trees model reaching an accuracy of 0.93495 compared to the CL model’s 0.9397. Finally, FL models achieved similar MCC values, exemplified by the FL Random Forest model’s MCC of 0.905825 almost equal to the CL model’s 0.9105.

Table 3:
Different metrics comparison of CatBoost, Extra Trees, LightGBM, Random Forest, and XGBoost using CL and FL on HARTH dataset with partition 80%/20%.
Type Model k Accuracy F1 Kappa MCC
CL xgboost 2 0.9339 0.9263 0.9009 0.9020
FL xgboost 2 0.9281 0.9203 0.8923 0.8933
CL lightgbm 2 0.8547 0.8521 0.7863 0.7865
FL lightgbm 2 0.7421 0.7450 0.6469 0.6581
CL catboost 2 0.9339 0.9270 0.9011 0.9020
FL catboost 2 0.9294 0.9218 0.8942 0.8952
CL rf 2 0.9395 0.9293 0.9089 0.9105
FL rf 2 0.9314 0.9207 0.8966 0.8982
CL et 2 0.9397 0.9287 0.9089 0.9108
FL et 2 0.9349 0.9222 0.9018 0.9038
CL xgboost 4 0.9339 0.9263 0.9009 0.9020
FL xgboost 4 0.9212 0.9140 0.8821 0.8829
CL lightgbm 4 0.8547 0.8521 0.7863 0.7865
FL lightgbm 4 0.7607 0.7617 0.6510 0.6528
CL catboost 4 0.9339 0.9270 0.9011 0.9020
FL catboost 4 0.9274 0.9197 0.8910 0.8921
CL rf 4 0.9395 0.9293 0.9089 0.9105
FL rf 4 0.9332 0.9214 0.8992 0.901
CL et 4 0.9397 0.9287 0.9089 0.9108
FL et 4 0.9357 0.9242 0.9028 0.9049
CL xgboost 8 0.9339 0.9263 0.9009 0.9020
FL xgboost 8 0.9198 0.9094 0.8791 0.8808
CL lightgbm 8 0.8547 0.8521 0.7863 0.7865
FL lightgbm 8 0.5807 0.5754 0.4213 0.4303
CL catboost 8 0.9339 0.9270 0.9011 0.9020
FL catboost 8 0.9263 0.9169 0.8890 0.8907
CL rf 8 0.9395 0.9293 0.9089 0.9105
FL rf 8 0.9363 0.9262 0.9039 0.9058
CL et 8 0.9397 0.9287 0.9089 0.9108
FL et 8 0.9367 0.9246 0.9042 0.9064
DOI: 10.7717/peerj-cs.2870/table-3

Note:

Values in bold represent the best results.

The experiment with the HARTH dataset shows that while FL slightly trails CL in F1-score, it maintains competitive performance in Kappa, accuracy, and MCC values, demonstrating the robustness and near-equivalence of FL in wearable sensor data analysis.

Regression

This subsection discusses the simulation results for regression experiments, using metrics described in Table 4 for evaluation. This table presents key metrics for evaluating regression models. Root mean square error (RMSE) and mean absolute error (MAE) measure the average magnitude of prediction errors. R-squared (R2) indicates the proportion of variance explained by the model, and mean absolute error (MAPE) expresses the error as a percentage.

Table 4:
Evaluation metrics for regression.
Metric Formula
R2 1(yiy^i)2(yiy¯)2
MAE 1Ni=1N|yiy^i|
MSE 1Ni=1N(yiy^i)2
RMSE 1Ni=1N(yiy^i)2
MAPE 1Ni=1N|yiy^iyi|
MPE 1Ni=1N(yiy^iyi)
DOI: 10.7717/peerj-cs.2870/table-4
Mi Band dataset

Observing the Mi Band dataset results, it is unclear whether the results of FL models are comparable with CL models. Note that in terms of RMSE, MAE, and MAPE it is better to have lower values. In the first experiment, with emphasis on RMSE, the FL versions of Random Forest and Extra Trees performed notably better than the CL versions as shown in Table 5. When comparing the MAE, the FL model of Random Forest consistently shows better results with an MAE of 35.833938, which is much lower than the CL version’s MAE of 58.9688. When evaluating the R-squared metric, the FL version of Random Forest demonstrates a superior R-squared value of 0.84735, suggesting improved model fit and generalization abilities in comparison to the CL version’s R-squared of 0.5712. Finally, the FL versions of Random Forest and XGBoost surpass the CL version in terms of MAPE, with scores of 0.162725 and 0.289463 to 0.1808 and 0.299, respectively.

Table 5:
Different metrics comparison of CatBoost, Extra Trees, LightGBM, Random Forest, and XGBoost using CL and FL on Mi Band dataset with partition 80%/20%.
Type Model k MAE MAPE RMSE R2
CL xgboost 2 80.7234 0.2990 175.7207 0.3732
FL xgboost 2 84.6606 0.2887 190.6013 0.1757
CL lightgbm 2 56.2537 0.2494 102.3229 0.7875
FL lightgbm 2 55.0917 0.2616 104.5867 0.7604
CL catboost 2 59.5378 0.2006 133.4747 0.6384
FL catboost 2 66.4891 0.2168 177.1278 0.2835
CL rf 2 58.9688 0.1808 145.3486 0.5712
FL rf 2 54.4051 0.1673 136.3432 0.5863
CL et 2 60.8821 0.1798 178.1365 0.3559
FL et 2 63.1281 0.1867 167.8168 0.3367
CL xgboost 4 80.7234 0.2990 175.7207 0.3732
FL xgboost 4 80.5563 0.2619 185.8783 0.0215
CL lightgbm 4 56.2537 0.2494 102.3229 0.7875
FL lightgbm 4 68.4518 0.3542 116.4871 0.7082
CL catboost 4 59.5378 0.2006 133.4747 0.6384
FL catboost 4 72.1597 0.2821 141.9439 0.6008
CL rf 4 58.9688 0.1808 145.3486 0.5712
FL rf 4 57.8445 0.1788 112.599 0.7283
CL et 4 60.8821 0.1798 178.1365 0.3559
FL et 4 69.8300 0.2215 131.7045 0.6446
CL xgboost 8 80.7234 0.2990 175.7207 0.3732
FL xgboost 8 68.2938 0.2895 137.1146 0.3150
CL lightgbm 8 56.2537 0.2494 102.3229 0.7875
FL lightgbm 8 45.2727 0.2716 68.7009 0.7929
CL catboost 8 59.5378 0.2006 133.4747 0.6384
FL catboost 8 48.6732 0.1928 87.6797 0.6603
CL rf 8 58.9688 0.1808 145.3486 0.5712
FL rf 8 35.8339 0.1627 62.9584 0.8474
CL et 8 60.8821 0.1798 178.1365 0.3559
FL et 8 44.2294 0.201 82.3557 0.7238
DOI: 10.7717/peerj-cs.2870/table-5

Note:

Values in bold represent the best results.

The analysis of the Mi Band dataset indicates that FL models generally outperform CL models in RMSE, MAE, R-squared, and MAPE metrics, showcasing their superior performance and efficiency in wearable sensor data processing.

DAT263x dataset

Based on the results for the DAT263x dataset, we observe that FL models generally outperform their CL counterparts across different partitions and evaluation metrics. Focusing on the RMSE, the FL model of Random Forest demonstrates better results than the CL as shown in Table 6. The same situation happens for MAE metric, only FL versions of Random Forest and Extra Trees may outperform CL, while the FL variant of CatBoost achieves an MAE of 0.4802, slightly higher than its CL counterpart’s MAE of 0.4466. Assessing the R-squared metric, the FL variant of Random Forest and Extra Trees exhibits an impressive R-squared value of 0.9982 and 0.998725, indicating better model fit and generalization capabilities compared to its CL counterparts. Lastly, focusing on the MAPE, the FL variants of Random Forest sometimes outperform their CL counterparts, achieving lower values of MAPE, while for other values of k, the results remain pretty equitable.

Table 6:
Different metrics comparison of CatBoost, Extra Trees, LightGBM, Random Forest, and XGBoost using CL and FL on DAT263x dataset with partition 80%/20%.
Type Model k MAE MAPE RMSE R2
CL xgboost 2 1.6568 0.0295 2.3841 0.9986
FL xgboost 2 1.6566 0.0278 2.4216 0.9985
CL lightgbm 2 1.1604 0.0237 1.6357 0.9993
FL lightgbm 2 1.3611 0.0277 1.9127 0.999
CL catboost 2 0.4466 0.0116 0.5867 0.9999
FL catboost 2 0.4802 0.0126 0.6343 0.9999
CL rf 2 1.7734 0.0285 2.7522 0.9981
FL rf 2 1.703 0.0277 2.6049 0.9983
CL et 2 1.5011 0.0234 2.3334 0.9987
FL et 2 1.5064 0.024 2.3356 0.9986
CL xgboost 4 1.6568 0.0295 2.3841 0.9986
FL xgboost 4 1.7225 0.032 2.4611 0.9985
CL lightgbm 4 1.1604 0.0237 1.6357 0.9993
FL lightgbm 4 1.3370 0.0299 1.8577 0.9992
CL catboost 4 0.4466 0.0116 0.5867 0.9999
FL catboost 4 0.512 0.0159 0.7042 0.9999
CL rf 4 1.7734 0.0285 2.7522 0.9981
FL rf 4 1.7735 0.0312 2.7088 0.9982
CL et 4 1.5011 0.0234 2.3334 0.9987
FL et 4 1.4987 0.0257 2.2598 0.9987
CL xgboost 8 1.6568 0.0295 2.3841 0.9986
FL xgboost 8 1.8606 0.0327 2.6008 0.9983
CL lightgbm 8 1.1604 0.0237 1.6357 0.9993
FL lightgbm 8 1.4603 0.0283 1.9754 0.999
CL catboost 8 0.4466 0.0116 0.5867 0.9999
FL catboost 8 0.5355 0.0132 0.7166 0.9999
CL rf 8 1.7734 0.0285 2.7522 0.9981
FL rf 8 1.6954 0.0264 2.4565 0.9985
CL et 8 1.5011 0.0234 2.3334 0.9987
FL et 8 1.4682 0.0221 2.2063 0.9988
DOI: 10.7717/peerj-cs.2870/table-6

Note:

Values in bold represent the best results.

The results from the DAT263x dataset indicate that FL models tend to outperform CL models on most metrics, such as RMSE, MAE, R-squared, and MAPE, demonstrating their better model fit and generalization performance in processing wearable sensor data.

Synthetic datasets

The evaluation is also performed on synthetic data from Apple Watch shown in Table 7 and Mi Band datasets shown in Table 8. Performance on these synthetic data sets was found to be broadly similar to that of the real data, albeit with a slight drop-off. This drop-off in performance can be attributed to the inherent limitations of synthetic data, which do not model the complexities of real-world variability as well. Models learned on synthetic datasets therefore cannot generalize as much to new real-world configurations, leading to issues such as overfitting or poor generalization. This makes it difficult to use synthetic data for training robust models and requires great care when simulating data for model testing.

Table 7:
Different metrics comparison of CatBoost, Extra Trees, LightGBM, Random Forest, and XGBoost using CL and FL on Apple Watch Synthetic dataset with partition 60%/40%.
Type Model k Accuracy F1 Kappa MCC
CL xgboost 2 0.8624 0.8626 0.8347 0.8357
FL xgboost 2 0.8254 0.8240 0.7897 0.7903
CL lightgbm 2 0.8598 0.8607 0.8316 0.8323
FL lightgbm 2 0.7936 0.7914 0.7516 0.7522
CL catboost 2 0.8704 0.8709 0.8443 0.8449
FL catboost 2 0.8307 0.8304 0.7961 0.7970
CL rf 2 0.8307 0.8304 0.7966 0.7971
FL rf 2 0.8175 0.8163 0.7806 0.7816
CL et 2 0.8333 0.8352 0.7999 0.8003
FL et 2 0.8571 0.8569 0.8283 0.8291
CL xgboost 8 0.8624 0.8626 0.8347 0.8357
FL xgboost 8 0.7786 0.7779 0.7323 0.7364
CL lightgbm 8 0.8598 0.8607 0.8316 0.8323
FL lightgbm 8 0.7266 0.7238 0.6693 0.6722
CL catboost 8 0.8704 0.8709 0.8443 0.8449
FL catboost 8 0.7995 0.7985 0.7574 0.7618
CL rf 8 0.8307 0.8304 0.7966 0.7971
FL rf 8 0.8255 0.8235 0.7897 0.7947
CL et 8 0.8333 0.8352 0.7999 0.8003
FL et 8 0.8281 0.8259 0.7929 0.7964
DOI: 10.7717/peerj-cs.2870/table-7

Note:

Values in bold represent the best results.
Table 8:
Different metrics comparison of CatBoost, Extra Trees, LightGBM, Random Forest, and XGBoost using CL and FL on Mi Band Synthetic dataset with partition 60%/40%.
Type Model k MAE MAPE RMSE R2
CL xgboost 2 61.0546 0.2505 111.8685 0.6926
FL xgboost 2 61.2773 0.2709 113.4315 0.6851
CL lightgbm 2 51.9401 0.2293 92.1887 0.7913
FL lightgbm 2 54.9092 0.2253 96.0003 0.7660
CL catboost 2 47.1213 0.1725 94.2360 0.7819
FL catboost 2 56.4724 0.2162 132.6429 0.5359
CL rf 2 51.3832 0.2004 103.1989 0.7384
FL rf 2 51.8509 0.192 105.7146 0.7183
CL et 2 55.2975 0.2249 105.8004 0.7251
FL et 2 48.528 0.1739 97.7883 0.7513
CL xgboost 8 61.0546 0.2505 111.8685 0.6926
FL xgboost 8 65.8528 0.2936 116.7600 0.6813
CL lightgbm 8 51.9401 0.2293 92.1887 0.7913
FL lightgbm 8 58.0431 0.3915 98.2643 0.7832
CL catboost 8 47.1213 0.1725 94.2360 0.7819
FL catboost 8 49.5009 0.2178 95.9313 0.7838
CL rf 8 51.3832 0.2004 103.1989 0.7384
FL rf 8 48.9094 0.1706 98.8223 0.7783
CL et 8 55.2975 0.2249 105.8004 0.7251
FL et 8 51.9627 0.1757 99.5822 0.7653
DOI: 10.7717/peerj-cs.2870/table-8

Note:

Values in bold represent the best results.

Random Forest and Extra Trees (or Extremely Randomized Trees) are two ensemble learning methods that often produce robust outcomes and are frequently mentioned together due to their similarities and effectiveness. Both methods are based on constructing multiple decision trees, which collectively contribute to more stable and accurate predictions than a single tree could achieve.

The core reason behind their strong performance is their foundational approach, where both use an ensemble of decision trees to perform classification or regression tasks. However, they differ in how they sample the data used to build these trees. Extra Trees introduces additional randomness into the model by selecting random thresholds for each feature rather than searching for the best possible thresholds like Random Forest does. Random Forest, on the other hand, uses bootstrapping to create different subsets of the original data for training each tree, meaning each tree in a Random Forest model learns from a slightly different sample of the data points.

Together, these methods leverage the strengths of decision trees while mitigating their tendency toward overfitting through averaging, resulting in consistent and reliable predictions. This explains why Random Forest and Extra Trees are both popular and effective for a wide range of data science tasks.

Statistical significance testing

We performed Wilcoxon Signed-Rank tests to rigorously test for performance differences between FL and CL models. This nonparametric test was chosen because it is well-suited to comparing paired data without assuming a normal distribution of differences, making it appropriate for our performance metric comparisons. We employed one-sided test (alternative=greater) to test whether FL demonstrated statistically significant performance improvement over CL for each measure. Our analysis revealed many examples where FL performed statistically significantly better. In the case of the Apple Watch dataset using the 60%/40% split and k = 4 clients, FL models performed significantly better on several measures of classification, i.e., recall, precision, F1-score, Kappa, and MCC (p < 0.05). In the case of the HARTH dataset using the 60%/40% split, FL demonstrated statistically significantly smaller values for Log Loss at both k = 2 and k = 8 (p < 0.05), which indicates better calibration in the models. Additionally, for the Mi Band dataset using the 60%/40% split, FL models achieved significantly smaller values for the error rates of MAE, MSE, and RMSE at k = 4, and for MSE at k = 8 (p < 0.05), which indicates smaller prediction error. In the case of the DAT263x dataset, significantly using the 70%/30% and 90%/10% splits, FL performed better than CL in all tests using statistically significant decreases in the measures of the errors MAE, MSE, RMSE, MAPE, root mean squared logarithmic error (RMSLE) using all the client configurations (p < 0.05). While these specific examples indicate statistically significant gains for FL, in some other combinations of measures and configurations, Wilcoxon Signed-Rank tests did not show statistically significant differences in performance between FL and CL and therefore indicate the same level of performance in these scenarios.

Combining classification and regression algorithms

The integration of classification and regression models provides a sophisticated analytical framework that could significantly enhance the usefulness of wearable technologies in health monitoring. By using classification models to determine the type of physical activity and regression models to quantify specific outcomes such as calorie expenditure, this approach creates a multi-layered analytical process that optimizes accuracy and personalization. In other words, classification determines activity (e.g., walking, running), while regression quantifies aspects like intensity or duration (e.g., calorie expenditure, speed). Integrating these two elements allows for a more comprehensive and nuanced analysis of the user’s behavior and physiological response, providing richer insights than each task alone. While empirical testing is beyond the scope of this work, this combination represents a logical step to fully exploit the analytical potential of our platform and cover a broader spectrum of real-world healthcare monitoring applications.

In the absence of a suitable dataset, we can only hypothesize a potential integration of classification and regression models. One solution would be to use the results of classification models as input data for regression models, effectively combining the strengths of both approaches. The pipeline could begin with a classification model, which would leverage sensor data from wearable devices and identify the type of physical activity being performed. This could be walking, running, cycling, or other exercises. Classification could be based on movement patterns such as speed, frequency, and pace. Once the activity is classified, it would provide valuable information to the regression model. It would be complemented by other parameters that may influence calorie expenditure, such as activity duration, heart rate, and environmental factors such as temperature or altitude. The regression model, based on this data, provides an estimate of calories burned. This model is not only tailored to the type of activity, but also takes into account individual differences such as age, weight and physical condition of users, making the estimation more personalized and accurate.

In addition to individual fitness and health monitoring, this approach can also be used in clinical practice, where accurate monitoring of physical activity and energy expenditure is crucial for patient care, such as in rehabilitation or chronic disease management. It can also be used in sports science, where analyzing individual data is essential for optimizing athletes’ performance and recovery.

Discussion

An intuitive application was designed and created, integrating AutoML technologies, to facilitate the training of machine learning models on interoperable datasets, using FL and CL methodologies. The application developed during this study was specifically designed for end users without specific technical expertise in machine learning who wish to use an intuitive and user-friendly interface that simplifies model development, improvement, and testing. Thanks to intuitive design principles and automation capabilities, it allows for the management of complex machine learning processes and functionalities without specific technical skills.

The application not only simplifies the complexity of model training but also offers robust support for managing various datasets and model parameters, improving the overall user experience. Thanks to its integration with AutoML, the platform automatically selects the best algorithms and tuning parameters based on the data, significantly reducing the barriers to efficient model development and deployment. This approach allows even novice users to achieve high-quality results, making sophisticated data analysis tools more accessible.

Additionally, the application’s environment encourages experimentation and continuous learning. Users can visualize the effects of their adjustments in real time, fostering a better understanding of machine learning processes and driving innovation. The application also includes features for advanced users, such as options to manually adjust or extend AutoML configurations, providing flexibility and control where needed.

Overall, this application represents a significant step forward in democratizing machine learning, making it accessible and manageable for users of varying skill levels while ensuring that the models produced are both powerful and relevant to the user’s specific needs. However, this study can be extended to multiple datasets, a wider range of partition proportions, and a larger number of customers to enhance generalizability and robustness in the future.

Overall, it can be said that FL has its advantages over CL. This study shows that FL not only equals CL but also outperforms it in some cases. Table 9 compares FL and CL across various features and aspects, highlighting the strengths of each approach. FL excels in preserving data privacy as it avoids central data collection, supports reduced data movement by training models locally, and offers model personalization based on local data characteristics. It is also highly scalable to a large number of users since the learning process is distributed across multiple nodes. Moreover, FL promotes decentralized data repositories and utilizes distributed and parallel computation, which can optimize bandwidth usage by transmitting only model updates instead of raw data.

Table 9:
Comparison between federated and centralized learning.
Feature/Aspect FL CL
Data privacy
Reduced data movement
Model personalization
Scalability to large number of users
Data repository is decentralized
Distributed and parallel computation
Bandwidth optimization
Easier to implement and manage
Consistent data quality
Simpler data access patterns
Consistent data aggregation
Streamlined model training process
DOI: 10.7717/peerj-cs.2870/table-9

On the other hand, CL tends to be easier to implement and manage due to its centralized nature. It generally ensures consistent data quality, benefits from simpler access patterns to the centralized data repository, and provides uniform data distribution, which can lead to more stable model training. Additionally, CL can offer higher model training efficiency due to the centralized processing power and streamlined data handling.

Conclusion

This study demonstrated how FL can significantly improve the privacy and interoperability of wearable sensor data in the healthcare sector. The research focused on integrating FL with the Fast Healthcare Interoperability Resources (FHIR) standard to enable seamless and privacy-compliant data analysis across heterogeneous healthcare systems. A comprehensive comparison between FL and CL demonstrated that FL outperforms CL in many aspects, including privacy and comparable performance in terms of efficiency in predictive tasks. The web application created with tools such as Streamlit, PyCaret, and Flower effectively illustrates the practical implementation of FL in the healthcare sector. Although this study demonstrated the effectiveness of FL and FHIR in a local simulation environment, several important limitations need to be addressed by future research. One of the main ones is the reliance on a single dataset and local execution of experiments. Future work could strengthen this approach by integrating multiple heterogeneous datasets, thus more closely replicating a distributed environment, adapted to network constraints and device heterogeneity. This extension would allow for a more realistic capture of the challenges and potential inherent in deploying FL in diverse real-world environments. Furthermore, further exploration of alternative FL approaches, including more advanced cryptographic methods such as homomorphic encryption and differential privacy, would strengthen privacy protection without unnecessarily compromising data utility. Another important avenue of research concerns the development of adaptive learning algorithms capable of dynamically reacting to variations in data distribution or device activity, thus preserving the relevance and accuracy of models over time. Addressing these identified areas of improvement can be essential to make the FL-based healthcare system even more robust, secure, and ultimately feasible for large-scale deployment in heterogeneous medical applications.