The use of wrist worn wearable fitness trackers has been growing rapidly over the last decade (Phaneuf, 2019). People across all demographics are using fitness trackers to measure and analyse physical activity and body functions of the wearer (Pothitos, 2016). Fitness trackers typically monitor activity level, track steps, distance, heart rate (HR), sleep, peripheral capillary oxygen saturation (SpO2), and more, as the technology continuously is advancing (Pothitos, 2016). Based on a recent Australian market analysis by Statista Global Consumer Survey, the revenue from wearables was estimated to be approximately $173 million in 2019 and the number of wearable users is expected to increase to 2.1 million by 2023 (Statista, 2020). Consumers across all age groups are expressing rising interest in this field (Statista, 2020). In 2019, 31.9% of the users were between 25 and 34 years old, which identifies the target market (Statista, 2020). The market is expanding not only in Australia, but globally with wearable technology topping the worldwide fitness trends since 2016 (Thompson, 2015; Thompson, 2019). The growing popularity can be partly attributed to the improvements in technology, making activity trackers more affordable, comfortable and convenient for use in different fitness and environmental applications (Phaneuf, 2019).
In terms of measuring HR, photoplethysmography (PPG) is a relatively new technology utilised in wearables. PPG estimates HR through an optical technique that monitors changes in blood volume beneath the skin (Shin, Shin & Lee, 2011). The ability of these watches to measure important cardiovascular parameters such as SpO2, HR, and heart rate variability (HRV) indicates that thorough research must be conducted on this topic to determine their reliability and validity (Alharbi et al., 2019). It is imperative to inform clinicians who might prescribe these devices and populations with medical conditions on the validity and reliability of these devices. If data from these monitors are to be used to monitor or guide patient activity and therapy, the monitors’ accuracy must be validated (Etiwy et al., 2019).
Currently, peer reviewed research with the aim to assess the validity of these devices to measure HR spans across many brands including multiple models of the Fitbit, Apple Watch, Garmin, Mio FUSE, TomTom and other brands (Etiwy et al., 2019; Abt, Bray & Benson, 2018; Bai et al., 2018; Boudreaux et al., 2018; Cadmus-Bertram et al., 2017; Claes et al., 2017; Collins et al., 2019; Dooley, Golaszewski & Bartholomew, 2017; Gillinov et al., 2017; Gorny et al., 2017; Haghayegh et al., 2019; Hernando et al., 2018; Jo et al., 2016; Khushhal et al., 2017; Lee & Gorelick, 2011; Leth et al., 2017; Nelson & Allen, 2019; Pope et al., 2019; Powierza et al., 2017; Reddy et al., 2018; Shcherbina et al., 2017; Stahl et al., 2016; Stove et al., 2019; Tedesco et al., 2019; Thiebaud et al., 2018; Thomson et al., 2019; Wang & Fu, 2016). Most of those studies involve healthy adults performing a treadmill protocol using an electrocardiogram (ECG) as a reference standard (Cadmus-Bertram et al., 2017; Claes et al., 2017; Gillinov et al., 2017; Lee & Gorelick, 2011; Leth et al., 2017; Nelson & Allen, 2019; Powierza et al., 2017; Shcherbina et al., 2017; Thiebaud et al., 2018; Thomson et al., 2019). Many studies reported good-excellent validity as defined by Intraclass Correlation Coefficients (ICC) > 0.60 or a Pearson Correlation Coefficient value of r > 0.50 (Etiwy et al., 2019; Abt, Bray & Benson, 2018; Bai et al., 2018; Cadmus-Bertram et al., 2017; Claes et al., 2017; Collins et al., 2019; Dooley, Golaszewski & Bartholomew, 2017; Gillinov et al., 2017; Gorny et al., 2017; Hernando et al., 2018; Khushhal et al., 2017; Powierza et al., 2017; Shcherbina et al., 2017; Stahl et al., 2016). For relative measures, the following criteria were used ICC: Poor = ICC < 0.40, Fair = ICC 0.40–0.59, Good = ICC 0.60–0.74, Excellent ≥ 0.75 (Fleiss, Levin & Paik, 2003); for r: negligible r = 0–0.29, low r = 0.30–0.49, moderate r = 0.50–0.69, high r = 0.70–0.89, very high r = 0.90–1 (Mukaka, 2012). Measurements of HR tended to be more valid at low treadmill intensities (Boudreaux et al., 2018; Gorny et al., 2017; Jo et al., 2016; Reddy et al., 2018) however one study demonstrated good validity at high intensities (12.1 km/h) (Stove et al., 2019). Limitations of the current research include varying devices tested, exercise protocols utilised, intensities tested, reference standards used, and inconsistent statistical analyses performed. Due to vast variation in devices tested, protocols used and statistical interpretation, results on validity cannot be generalised to all fitness trackers.
One of the more recent fitness trackers on the market released in September 2018, is the Polar Vantage M (PVM) watch. It is advertised as a high performance watch that ‘measures HR from the wrist with the Precision Prime™ sensor fusion technology which combines optical HR measurement with other sensor technologies in order to rule out involuntary movement that might disturb the HR signal and produce unreliable readings’ (Polar USA, 2020). Despite this claim, no peer reviewed studies have been conducted on the validity of the PVM watch to date. Therefore, the aim of this study was to assess the validity of the PVM watch to measure HR compared to medical grade ECG on a healthy population during a range of exercise intensities.
Materials and Methods
The current study was an observational design assessing the validity of measuring HR by the Polar Vantage M watch during incremental treadmill exercise. To assess the validity of measuring HR using the Polar Vantage M watch, participants were asked to attend the human performance laboratory on two occasions separated by 1–2 days.
A convenience sample of 30 healthy and physically active participants (n = 17 males, n = 13 females) were recruited from a university student population through advertised electronic and printed posters on the University campus, as well as through social media. Research has provided evidence that a sample size of at least 15–20 is considered adequate for reliability studies which collect continuous data and therefore the current sample size is justified (Lexell & Downham, 2005). Ethical approval was granted through Bond University Human Research Ethics committee (ML01928). Informed consent was obtained from all participants via a signed consent form prior to participation.
Participants were excluded from participation if they had any of the following criteria: under the age of 18, any pre-existing respiratory, cardiovascular, or metabolic conditions, taking medications that affected their HR, pregnant, or any musculoskeletal injuries, current or within the last 6 weeks such as low back pain, ankle sprain, osteoarthritis, etc. Patients were also excluded if they refused to give informed consent to perform the protocol or were unable to answer ‘no’ to all six questions in the Exercise and Sport Science Australia ESSA (ESSA) Adult Pre-Exercise Screening Form (Essa.org.au, 2019).
The PVM watch (Polar Electro, Kempele, Finland) was used to compare HR measurements to the gold standard ECG while exercising at different intensities on a treadmill. A real time 7-lead (I, II, III, AvR, AvL, AvF, V5) ECG was utilised with the Mortora XScribe Cardiac Stress Test System (Serial Number: NO8200) which is a regulated medical device with intended use in clinical settings (Welch Allyn, 2018). The following five electrodes were placed respectively following standard anatomical landmarks: right arm (RA), left arm (LA), left leg (LL), right leg (RL) and V5 (chest). The PVM watch is a multi-sport watch that uses optical HR tracking. The PVM watch uses an optical measurement called photoplethysmography (PPG), which identifies volumetric changes in the microvascular structure of tissue. PPG then uses a signal that the optical HR solution measures to interpret and calculate HR (Polar Australia, 2020). During testing, the mode ‘running’ was utilised to help validate the watches’ function and accuracy. The Bruce Protocol was conducted on a Valiant Lode BV Treadmill Type: 932900 (Serial Number:20060061). The Bruce protocol (Queensland Health, 2019) was chosen because it is a standard graded exercise test, commonly used for clinical and laboratory-based assessments of cardiorespiratory fitness in athletes and patients. Additionally, the Bruce Protocol has been used in similar validation studies (Bruce, Kusumi & Hosmer, 1973) and allows for easy comparison of results against past, and future investigations.
Participants were contacted prior to testing via mobile phone to ensure exclusion criteria did not apply. The participants were required to visit the laboratory on two separate occasions. Each participant completed both tests within 24–48 h and at approximately the same time of day. These requirements aimed to reduce any possible variance caused by change in fitness status and circadian variation.
On the first day of testing, each participant was introduced to the laboratory and required to complete the ESSA Adult Pre-Exercise Screening Tool and consent form. Personal information was collected including name, age, as well as anthropometric data including height and mass were recorded. Participants were then prepped for the ECG placement (Fig. 1). To ensure consistency, the PVM watch was placed on all participants right hand. The device’s display was oriented away from the participant to allow the researcher to easily read the HR. The tightness of the PVM watch was standardised for each patient with the number of watchband notches recorded and repeated for the second day of testing. Resting HR was recorded from the ECG and PVM watch as the participant was seated for a quiescent period for 3 min.
Heart rate readings were taken from both ECG and PVM watch every 30 s throughout the entire treadmill protocol. To obtain resting HR, the protocol started with the participants sitting stationary in a chair for 3 min, followed by 2 min of standing on the treadmill. Next, the standard Bruce Protocol was conducted increasing both treadmill speed (Stage 1, 2.7 km/h; Stage 2, 4.0 km/h; Stage 3, 5.4 km/h; Stage 4, 6.7 km/h; Stage 5, 8.0 km/h) and incline (starting at 10% grade and increasing 2% every Stage) every 3 min until volitional exhaustion (Queensland Health, 2019). Participants were not allowed to hold onto the handrail throughout the duration of the protocol, however, instructions of ‘hand’ were communicated 5 s prior to each 30 s reading to allow accurate data collection by the researcher. When participants indicated they had reached their volitional exhaustion, displayed any criteria for terminating test based upon Queensland Health indications for terminating exercise testing (Queensland Health, 2019), or ECG read the participant had attained their age-predicted maximal HR, participants were instructed to place both hands onto the handrail whilst the treadmill was put into the active cooldown stage.
Statistical analyses were performed using SPSS (Version 24.0; IBM Corp, Armonk, NY, USA). To determine if anthropometric data was normally distributed visual inspection of the histograms, normal Q–Q plots and a Shapiro–Wilks test was conducted to determine the appropriate descriptive statistics (mean or, median). A scatter plot of combined data for Day 1 and Day 2 was used to depict the strength of the relationship between ECG measured HR and PVM watch measured HR for different intensities with the associated r2 value. To assess criterion validity between Polar and ECG HR data for Day 1 and Day 2, Intraclass correlation Coefficients (ICC) were used. Historically Koo & Mae (2017). According to Koo & Li (2016) the model type and definition should be reported when using an ICC. Applying the guidelines the model used within this statistical analysis was two-way mixed effects, the type used was the average measures and the definition used was the absolute agreement. The interpretation of the ICCs for validity were defined as the following: an ICC value of less than 0.50 indicated poor validity; ICC values in the range of 0.50–0.65 indicated moderate to good validity, and an ICC value of greater than 0.65 identified good validity (Fleiss, Levin & Paik, 2003). We presented the level of agreement between the PVM watch and ECG through Bland–Altman plots with the associated 90% limits of agreement calculated. The the equation used to determine limits of agreement (LoA) was:
LoA = mean difference ± 1.65 × (standard deviation of difference) (Bland & Altman, 1986).
In total, 30 participants completed the protocol, with data from 28 participants utilised for statistical analysis (16 males, 14 females, 26.10 ± 3.39 years, height 52.36 m ± 7.40 cm, mass 73.59 ± 11.90 kg). One participant was removed from data analysis due to a sinus tachycardia which was discovered following testing which significantly influenced their HR results. Another participants’ data was an extreme outlier with over 40 bpm error margin between EGC and Polar Watch HR was removed. Average HRmax, (HRM) as calculated by 220-age, was 193.93 ± 3.39 bpm and average resting HR, as measured while sitting upright in a chair for 3 min was 71.49 ± 10.29 bpm. Percent HRmax was calculated for stages 0–5 for day 1 using the equation: (average HR for stage/total average HRmax) × 100. Percentage of HRmax for stage 0–5 were found to be 43%, 53%, 61%, 79%, 93%, and 95% respectively. Interpretations for percent intensities were made following The American College of Sports Medicine (ACSM) guidelines for exercise prescription where low intensity is defined HRM < 55%, moderate intensity as 55–75% and high intensity as HRM > 75% (Deborah Riebe, Liguori & Magal, 2018). Therefore, the Bruce Protocol stages 0–1 were classified as low intensity, stage 2 as moderate intensity and stages 3–5 as high intensity.
Figure 2 depicts the positive linear association between the ECG and PVM watch. Linear regression analysis was performed which calculated the value of r2 = 0.87, which demonstrated that approximately 87% of variation of HR obtained by the PVM watch can be explained by variation in HR obtained from the ECG.
Bland–Altman plots were produced to graphically depict the level of agreement between the ECG and PVM watch in measuring HR throughout the different exercise intensities in the protocol (Figs. 3, 4 and 5).
In Fig. 3, which represents Stage 1 of the Bruce Protocol, the mean difference between ECG and the PVM watch was 0.98 beats (SD 10.04, SE 1.89) and the upper and lower limits of agreements were 17.50 and −15.54 beats, respectively. Figure 3 displays majority of data points within the 90% limits of agreement with few outliers. Data points collected in Stage 1 varied between the ECG and PVM watch was 33.04 beats.
Stage 3 of the Bruce Protocol is represented by Fig. 4. The mean difference between the ECG and the PVM watch was 12.25 beats (SD 17.95, SE 3.39) with corresponding upper and lower limits of agreements of 41.78 and 17.28. This stage as seen in Fig. 4, graphically demonstrates a less cohesive set of data points as seen in the last stage with a larger variance in beats of 59.06.
Figure 5, graphically displays stage 5 of the Bruce protocol. The mean difference between the ECG and PVM watch was 9.97 (SD 13.07, SE 2.99) and the upper and lower limits of agreements were 31.47 and −11.53 respectively. Figure 4 details the majority of the data points within the 90% limits of agreement and close to the mean difference; however, data points can vary by 43.00 beats per min.
Intraclass correlation Coefficients were used to assess criterion validity between Polar and ECG HR data for each stage on Day 1 and Day 2. Good validity was seen for Day 1 and Day 2 for stage 0 (ICC = 0.83; 95% CI [0.63–0.92], ICC = 0.74; 95% CI [0.37–0.88]), stage 1 (ICC = 0.78; 95% CI [0.52–0.90], ICC = 0.88; 95% CI [0.74–0.95]), and stage 2 (ICC = 0.88; 95% CI [0.73–0.94], ICC = 0.80; 95% CI [0.40–0.92]). Poor validity was demonstrated on Day 1 and Day 2 for stages 3–5 (ICC < 0.50). Table 1 shows complete results for ICCs.
|Day 1||95% Confidence Interval||Day 2||95% Confidence Interval|
|Stages||ICC||Lower Bound||Upper Bound||Stages||ICC||Lower Bound||Upper Bound|
The overall correlation between HR measurements from PVM watch vs ECG for both days combined was strong however results from ICC’s, and Bland–Altman Plots had wide limits of agreement (LoA), which raises concern about the safety of using the PVM watch in a clinical setting where precision is vital.
Our first main finding was that when combining data from Day 1 and 2, there is a strong correlation between HRs measured by the PVM watch and the ECG. This indicated that HR measures from the PVM watch and ECG were related; meaning that the watch can delineate between high and low HRs through a range of treadmill intensities. However, this high correlation does not necessarily mean these two measures agree. As explained by Bland & Altman (1986), correlation measures the strength of the relationship of variables and will be high if the points lie along any straight line. However, to agree, the points must lie along the line of equality (Bland & Altman, 1986). In the current literature, validity is concluded based upon results from correlation coefficients and mean biases of Bland–Altman Plots without including the implications of wide LoA in the conclusion (Etiwy et al., 2019; Jo et al., 2016; Delgado-Gonzalo et al., 2015). This can result in devices being deemed valid, when in fact, there are large differences in HR measurements for the individual (Cadmus-Bertram et al., 2017).
Looking closer at the variability of each stage, ICC values showed varied degrees of agreement between stages. Low to moderate intensities demonstrated moderate- good agreement, while high intensities showed poor to good agreement based upon upper and lower limits. This finding is consistent with other similar research. Boudreaux et al. (2018) determined the validity of eight HR monitors during a cycling regiment (up to 220 watts) and resistance exercise (10 RM) and found that HR measures from wearable devices were more accurate at rest and lower exercise intensities than at higher intensities. For example, Boudreaux et al. (2018) who investigated several wearable devices found ICCs ranging from 0.92 at rest to 0.12 at 200 watts on a cycle ergometer when using a FitBit. At rest, HR from most of the wearable devices had strong relationships to ECG values (R = 0.76–0.99). When exercise began as well as during each increase in exercise intensity, ICC were reduced in most devices (Boudreaux et al., 2018). Similar results were found by a study investigating the accuracy of the Fitbit Charge HR compared to Polar H6 HR monitor during free living condition by Gorny et al. (2017). Results demonstrated strong ICC coefficients for low intensity (ICC = 0.77(95% CI [0.55–0.87])) and moderate values for moderate to vigorous intensities (ICC = 0.56 (95% CI [0–0.79])) (Gorny et al., 2017).
Results from the Bland–Altman plots revealed a wide LoA seen across all intensities with increasing error as intensity increased. Although the mean differences were fairly small, there was large variation in HR throughout the protocol. Even at a low intensity, the error margin was large. In line with our finding, Wallen et al. (2016) reported limit of agreement (LoA) ranging between −27.30 bpm and 13.10 bpm in a study investigating the accuracy of four PPG-based watches at rest, walking/running and cycling. As intensity changed from low to moderate during stage 3, the largest mean difference and largest variation in bpm was observed. This is supported by a study investigating the validity of the Garmin Forerunner at different walking intensities. Bland–Altman plots from Claes et al. (2017) displayed a LoA range from −32.53 to 29.40 bpm, with largest error seen at moderate intensities.
The trend of observing decreasing agreement in HR as intensity increases may be explained by external factors to which PPG is sensitive. As the protocol progressed, the elevation and speed increased, resulting in higher levels of upper limb movement, higher transmitted vibration and transition to a flexed elbow arm swing pattern during running which occurred around stages 3–4 for most participants. This would support the hypothesis that Claes et al. (2017) made that running would result in less accurate HR measurements. Other studies have shown that PPG sensors are susceptible to poor accuracy during high intensity exercises that involve repetitive contractions of skeletal muscles which may result in decreased contact between the sensor and the skin (Rafolt & Gallasch, 2004; Allen, 2007; Spierer et al., 2015). Potential misalignment between the skin and optical sensor, variations in skin colour/skin tone, ambient light and poor tissue perfusion could also explain error (Alzahrani et al., 2015). Another factor which could influence PPG readings is perspiration (Maeda, Sekine & Tamura, 2011), however, care was taken to standardise room temperatures to 21 degrees Celsius as sweat is a factor that also affects PPG readings (Maeda, Sekine & Tamura, 2011).
Strengths and Limitations
To the author’s knowledge, the PVM watch has not been validated against the gold standard of a medical grade ECG for determining rest and exercise HR which is a strength of this study. All testing was conducted in a highly controlled laboratory setting. The devices were assessed in young, healthy volunteers exercising in a standardised laboratory setting. The study methodology included the use of a standardised treadmill protocol (which included walking, jogging and running), tightly controlled room temperature and humidity as well as room lighting. The treadmill-stress testing system was stationary and not moved throughtout the testing period (or at any time), Additionally, our testing protocol was standardised such that we used four researchers performing the exact same role(s) during all testing sessions. The PVM watch was placed in a standard position, with the notches being recorded so the watch was worn on the same hand, same hand position, and same tightness for all testing. Although this study has a sufficient number of participants and included over 350 HR measurements over the 2 days of testing, it has its limitations. The study findings may vary for different categories of individuals such as cardiac patients, geriatric population or an unhealthy population with other co-morbidities. Although we accounted for participant factors including gender, age and body mass index, the narrow distribution of these characteristics does not enable us to rule out a potential factor on the accuracy of HR measurement. In addition, these results may not be indicative of the validity on other modes of exercise that are included on the watch (ex. biking, swimming, walking, resistance training, etc.). Another limitation was the time frame that participants returned to complete their second round of testing. Many participants noted symptoms of delayed onset of muscle soreness, especially in their gastrocnemius and soleus. This muscle fatigue or soreness may have factored into the data on the second day of testing influencing or skewing these results. Our participants were somewhat limited in that they were all Caucasian, however with varying skin tones only attributed to sun exposure. During our data analysis, we noticed a consistent outlier on the Bland–Altman plots which corresponded to an individual with a darker complexion. Upon removal of the outlier, values for the limits of agreement and mean difference changed. It is hypothesised that PPG signals may vary by skin tone and that failure to code for variation in skin complexity may have impacted the current study results (Fallow, Tarumi & Tanaka, 2013).
Results from our study can only be generalised to a generally healthy, young population due to the demographics of the participants. Given the large LoA and poor ICC’s at high intensities, caution is warranted from a clinical perspective if users are relying upon accurate HR readings for monitoring safe exercise intensities. There are no agreed upon standards for the level of error which is acceptable for monitoring clinical populations and depends upon the specific case due to the large spectrum of applications where HR monitoring can be used. One guideline from the American National Standard of Cardiac Monitors, Heart Rate Meters and Alarms has defined accuracy as ‘more readout error of no greater than ±10% of the input rate or ±5 bpm, whichever is greater’ (American National Standards Institute, 2002). Following these guidelines, our findings indicate the PVM watch is not appropriate for clinical use. Kottner et al. (2011) brings a discussion about acceptable limits of difference between the two measures of data points. The article deliberates that the decision is rather a clinical decision vs a statistical decision (Kottner et al., 2011). For populations with chronic heart conditions or need for exact HR measurement, these underlying pathologies could interfere with accuracy of HR measurement from the PVM watch. Cardiac patients with arrhythmias could be led astray from the HR readings and misinterpret their exertion levels which could be a detriment to their condition or pose a risk during physical activity or exercise. The levels of agreement between the ECG and PVM watch do not reach a safe enough clinical level of agreement to be used outside a healthy, recreational population.
Where to from here
In order to determine safety for use in clinical populations studies involving such participants are needed. Larger sample sizes are needed to confirm our finding of large LoA at varying intensities. Additionally, further research should be conducted with a larger sample size of both genders, with genders stratified. Other modes of exercise need to be tested, such as cycle ergometer, rowing or resistance training, to investigate the affect of more and less upper limb movement to confirm the hypothesis that it is a factor that effects HR accuracy. Previous conclusions on accuracy of other devices cannot be generalised to all PPG devices. As more fitness trackers appear on the market, research needs to be conducted on accuracy of devices because PPG-based HR monitors may use different sensors, algorithms and infra-red wavelengths to detect HR (El-Amrawy & Nounou, 2015; Zhang, Pi & Liu, 2015). The impact of skin complexity on accuracy of PPG readings is poorly supported in the literature and more research to confirm variation in HR detection using this technology among different ethnicities is recommended (El-Amrawy & Nounou, 2015).
Wrist worn fitness trackers have become increasingly popular throughout the fitness industry. With continual advancement in technology, it is important for researchers to continue to assess the validity of the trackers. This study revealed that the PVM watch has a strong correlation with the ECG throughout the entire Bruce Protocol, however, the LoA was widely dispersed as exercise intensities increase. Due to the large LoA between the ECG and PVM watch, the watch may not be advisable to use in settings where an accurate HR measure is needed. However, the watch could be utilised in a recreational population in which HR measures are not monitored as closely.