A new diagnostic method for chronic obstructive pulmonary disease using the photoplethysmography signal and hybrid artificial intelligence

Engin Melekoglu; Umit Kocabicak; Muhammed Kürşad Uçar; Cahit Bilgin; Mehmet Recep Bozkurt; Mehmet Cunkas

doi:10.7717/peerj-cs.1188

A new diagnostic method for chronic obstructive pulmonary disease using the photoplethysmography signal and hybrid artificial intelligence

Engin Melekoglu¹, Umit Kocabicak¹, Muhammed Kürşad Uçar², Cahit Bilgin³, Mehmet Recep Bozkurt², Mehmet Cunkas ⁴

1Computer Engineering, Sakarya University, Sakarya, Turkey

2Electrical and Electronics Engineering, Sakarya University, Sakarya, Turkey

3Faculty of Medicine, Sakarya University, Sakarya, Turkey

4Electrical and Electronics Engineering, Selcuk University, Konya, Turkey

DOI: 10.7717/peerj-cs.1188

Published: 2022-12-19
Accepted: 2022-11-22
Received: 2022-10-07

Academic Editor: Muhammad Asif

Subject Areas: Bioinformatics, Artificial Intelligence, Brain-Computer Interface, Data Science
Keywords: Signal processing in biomedical, Photoplethysmography signal, Machine learning algorithm, Chronic obstructive pulmonary disease

Copyright: © 2022 Melekoglu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Melekoglu E, Kocabicak U, Uçar MK, Bilgin C, Bozkurt MR, Cunkas M. 2022. A new diagnostic method for chronic obstructive pulmonary disease using the photoplethysmography signal and hybrid artificial intelligence. PeerJ Computer Science 8:e1188 https://doi.org/10.7717/peerj-cs.1188

Abstract

Background and Purpose

Chronic obstructive pulmonary disease (COPD), is a primary public health issue globally and in our country, which continues to increase due to poor awareness of the disease and lack of necessary preventive measures. COPD is the result of a blockage of the air sacs known as alveoli within the lungs; it is a persistent sickness that causes difficulty in breathing, cough, and shortness of breath. COPD is characterized by breathing signs and symptoms and airflow challenge because of anomalies in the airways and alveoli that occurs as the result of significant exposure to harmful particles and gases. The spirometry test (breath measurement test), used for diagnosing COPD, is creating difficulties in reaching hospitals, especially in patients with disabilities or advanced disease and in children. To facilitate the diagnostic treatment and prevent these problems, it is far evaluated that using photoplethysmography (PPG) signal in the diagnosis of COPD disease would be beneficial in order to simplify and speed up the diagnosis process and make it more convenient for monitoring. A PPG signal includes numerous components, including volumetric changes in arterial blood that are related to heart activity, fluctuations in venous blood volume that modify the PPG signal, a direct current (DC) component that shows the optical properties of the tissues, and modest energy changes in the body. PPG has typically received the usage of a pulse oximeter, which illuminates the pores and skin and measures adjustments in mild absorption. PPG occurring with every heart rate is an easy signal to measure. PPG signal is modeled by machine learning to predict COPD.

Methods

During the studies, the PPG signal was cleaned of noise, and a brand-new PPG signal having three low-frequency bands of the PPG was obtained. Each of the four signals extracted 25 features. An aggregate of 100 features have been extracted. Additionally, weight, height, and age were also used as characteristics. In the feature selection process, we employed the Fisher method. The intention of using this method is to improve performance.

Results

This improved PPG prediction models have an accuracy rate of 0.95 performance value for all individuals. Classification algorithms used in feature selection algorithm has contributed to a performance increase.

Conclusion

According to the findings, PPG-based COPD prediction models are suitable for usage in practice.

Introduction and literature review

Chronic obstructive pulmonary disease (COPD) is characterized by breathing signs and symptoms and airflow challenges because of anomalies in the airways and the alveoli that occurs as a result of significant exposure to harmful particles and gases. COPD is a widespread, preventable, and curable disease (Zubaydi et al., 2017; Melekoğlu et al., 2021; Batum et al., 2015). COPD constitutes a significant portion of chronic respiratory diseases. COPD is one of the maximum crucial reasons for mortality and morbidity, and with every passing day, it keeps inflicting a growing sizable financial and social burden (Lopez, 2006; Arslan & Ünsar, 2021). With the expected prolongation of life expectancy and increased exposure worldwide, the burden of COPD is predicted to increase further (Arslan & Ünsar, 2021). According to an investigation carried out with the aid of using the WHO (López-Campos, Tan & Soriano, 2016), COPD is the fourth leading reason for demise worldwide. Every year, 2.9 million people worldwide die from COPD-related diseases.

COPD is a lung complaint that prevents comfortable and healthy breathing due to the narrowing of the airways. The most common symptoms of COPD, as a progressive chronic disease for which a definitive treatment has not been found yet, are cough with phlegm and shortness of breath. In this disease, which manifests itself in the form of different symptoms depending on its stages, shortness of breath, even with light effort, indicates that the disease has progressed. One of the most important features of the symptoms is cigarette smoke. Both active cigarette smokers and nonsmokers or passive smokers around them are affected.

COPD is a progressive disease that develops due to non-microbial inflammation in the airways caused by prolonged exposure to tobacco smoke, noxious gases, and particles. As a result of this inflammation, while the airways are gradually narrowing, irreversible enlargement and destruction of the air sacs (alveoli) occur in the lung tissue (Amaral et al., 2015).

Because there is insufficient information, approximately COPD, analysis, diagnosis, and treatment are delayed. The specialized doctor makes the diagnosis based on the information collected from the spirometer gadget, which is the approach utilized to diagnose the ailment. These methods are only applied in hospitals and performed by technicians. It is important to monitor the patient’s illness after diagnosis and to monitor the damage the disease has caused to the patient’s body. Early diagnosis and intervention in COPD can stop or slow the progression of the disease. COPD, or chronic obstructive pulmonary disease, is caused by the narrowing of the airways in the lungs that make breathing difficult, and because the disease is often permanent and progressive, diagnosing the disease in its early stages can leave less harm to the patient. Monitoring at regular intervals is very important in terms of the course of the disease. This process can only be performed in hospitals. It is a complicated and time-consuming process (Zubaydi et al., 2017).

The diagnosis of COPD is made by using the spirometer device. The spirometer should measure forced vital capacity (FVC), and volume exhaled (FEV1) within 1 s of this maneuver and calculated the FEV1/FVC ratio. A medical professional can make a diagnosis by comparing spirometry measurements with reference values determined by age, height, weight, and BMI. When we divide FEV1 by FVC, it is considered to be less than 70% a COPD patient (Melekoğlu et al., 2021; Isik, Guven & Buyukoglan, 2015; Uçar et al., 2018b). The difficulties of using the spirometer device can be experienced, especially in small children, the disabled, and patients with advanced illnesses. This necessitates shortening and facilitating the diagnosis time (Melekoğlu et al., 2021; Er & Temurtas, 2008; Er et al., 2009). Because of these drawbacks, there is a need to design methods that are simple to use and follow in order to diagnose COPD more effectively (Uçar et al., 2018b, 2018c). To overcome these problems, in order to make the COPD diagnoses process faster and then patient monitoring easier, it is taken into consideration that the usage of the photoplethysmography (PPG) signal can be beneficial withinside the diagnosis of COPD (Melekoğlu et al., 2021; Moraes et al., 2018). The PPG is a biological signal that may be measurable anywhere near the heart.

Heart signals convey vital information about the body and illness. Therefore, based on the obtained results, it has been evaluated that it can be used in the diagnosis of COPD. A PPG signal-based COPD diagnostic method is suggested in this study. It is expected that the developed method will also create an infrastructure for the production of portable devices for the diagnosis of the disease and be low in cost.

Studies in the literature show that artificial intelligence algorithms can be used in the detection of asthma and COPD (Joumaa et al., 2022). In the related study, the use of open source datasets is also recommended. In addition to machine learning algorithms, the use of deep learning algorithms is increasing in the diagnosis of medical diseases. The development of computer infrastructures increases deep learning applications. Deep learning has higher performance compared to classical machine learning algorithms (Ghorbanzadeh et al., 2019; Sahoo, Pradhan & Das, 2020; Zhang et al., 2017).

In recent years, in the diagnosis of diseases, various types of research areas have been carried out on the usability of some new and helpful classifiers, decision-making software, and tools (de Mesquita et al., 2022; Lazazzera et al., 2021). One of these areas is artificial intelligence applications (Valente et al., 2016; Rodrigues et al., 2018). It is clear that these systems will provide advantages such as assistance in making the diagnosis, shortening the diagnosis time, efficiency, and increased productivity, which will benefit the medical field (Filho et al., 2014). This study intends to diagnose COPD with the machine learning algorithm only by using the PPG signal belonging to a patient.

One of the overall goals of this study is to facilitate the diagnosis of COPD through machine learning, which helps to confirm the diagnosis of COPD. In addition, the improvement of parameters such as diagnosis duration, efficiency, and time are among the objectives (Isik, Guven & Buyukoglan, 2015). This study was carried out using the PPG signal in compliance with the principles of the GOLD (Global Initiative for Chronic Obstructive Lung Disease).

The aim of this study is to diagnose COPD quickly and reliably with artificial intelligence-based PPG signal. In this study, a different and improved model from the literature is proposed. PPG records were collected from patients and healthy individuals for model formation. PPG signals are noise-free and split into sub-frequency bands. Then, features in the time domain are extracted from each frequency band. Feature selection algorithm is used to improve performance and eliminate unnecessary features. With the obtained feature sets, classification was made with the help of machine learning algorithms. The results showed that the diagnosis can be made with a two-second PPG signal.

Method and material

To explain the purpose of the study, the diagram in Fig. 1 was followed. Firstly, PPG is separated into sub-frequency bands with the help of digital filters. It is then split into two-second epochs. Time domain features were extracted from each epoch. In order to increase performance, the best features are selected with the help of a feature selection algorithm. Selected features are classified by the Ensemble Tree algorithms (ET), k-nearest neighbor algorithm (kNN), support vector machines (SVMs), and hybrid methods.

Figure 1: Diagram flow.

Download full-size image

DOI: 10.7717/peerj-cs.1188/fig-1

Data collection

The data used in the study were obtained from the Sleep Laboratory of Sakarya Hendek State Hospital. The data in question; has been examined and diagnosed by a medical professional according to the criteria for COPD and is classified as either diseased or healthy. In order to carry out the research, permissions were obtained from the ethics committee report numbered 1614662/050.01.04/70 from the Dean of the Faculty of Medicine University from Sakarya and R.T Ministry of Health Republic of Turkey, Turkey Public Hospitals Institution Sakarya Province Public Hospitals Association General Data Secretary, and usage permission numbered 94556916/904/151.5815. A consent form was obtained from all participants. The data used in the study were collected in 2015–2016.

Within the scope of the study, the studies have been made on identified patients, six healthy and eight patient, 12 of them male and two female, fourteen people in total. Personal demographic records and COPD registry records are given withinside Table 1.

Table 1:

Distribution of demographic information and records about individuals.

	Female			Male			All individuals
	n1 = 2			n2 = 12			n = n1 + n2 = 14
	Mean		SD	Mean		SD	Mean		SD
Age (year)	55.50	$\pm$	4.95	53.17	$\pm$	9.43	53.50	$\pm$	8.82
Weight (kg)	105.50	$\pm$	6.36	101.92	$\pm$	8.08	102.43	$\pm$	7.75
Height (cm)	170.00	$\pm$	7.07	173.42	$\pm$	6.52	172.93	$\pm$	6.43
BMI (kg/m2)	36.70	$\pm$	5.23	33.75	$\pm$	2.54	34.17	$\pm$	2.96
Photoplethysmography time distribution record (Sec)
	Mean		SD	Mean		SD	Mean		SD
COPD group	–	$\pm$	–	28,643.50	$\pm$	11,082.52	2,8643.50	$\pm$	11,082.52
Control group	2,6041.00	$\pm$	4,963.89	32,611.00	$\pm$	5,351.56	3,0421.00	$\pm$	5,798.47

DOI: 10.7717/peerj-cs.1188/table-1

Note:

BMI, Body Mass Index.

Signal pre-processing

A digital filter is applied to the PPG signal, and a new PPG signal is obtained, which has a sub-frequency band PPG signal. In an attempt to eliminate noise from the PPG signal, a Chebyshev type II bandpass filter with frequencies ranging from 0.1 to 20 Hz was used, followed by a “Moving Average” filter, and the PPG signal was received without noise (Şahan et al., 2007). During the study, three sub-frequency bands for the PPG signal were acquired throughout the investigation. These are sub-frequency (LF) band range of 0.04 to 0.15 Hz, (MF) mid-frequence (MF) band range of 0.09 to 0.15 Hz, and high-frequency (HF) band range of 0.15 to 6 Hz (Uçar et al., 2018a). At the end of the filtering operations, the obtained signals (PPG, $P P G_{L F}$ , $P P G_{M F}$ , $P P G_{H F}$ ) were split into $T = 2$ s epochs, and 25 features were obtained from the time domain of each epoch. Obtained epoch information is shown in Table 2.

Table 2:

Epoch distribution.

	No	Gender	Epoch count
COPD	1	F	14,323
	2	F	22,213
	3	M	2,248
	4	M	16,228
	5	M	13,978
	6	M	15,673
	7	M	14,428
	8	M	13,093
	Total		112,184
Healthy	9	M	17,263
	10	M	19,393
	11	M	15,463
	12	M	15,463
	13	M	15,463
	14	M	14,773
	Total		97,818

DOI: 10.7717/peerj-cs.1188/table-2

Figure 2 shows the PPG record of the COPD and Control groups and the periodogram with a Fast Fourier Transform graphic. As can be visible withinside the figure, there are variations among the sign amplitudes. Graph is a performance indicator used to show visual difference.

Figure 2: Periodogram graph of the photoplethysmography signal.

Download full-size image

DOI: 10.7717/peerj-cs.1188/fig-2

Feature extraction

Four signals were obtained in the preceding process. Each of the four signals has 25 features extracted. Many features have been used for PPG signals in the literature (Uçar et al., 2021; Uçar et al., 2017). In our study, we retrieved 25 characteristics from the PPG signal. The 25 extracted features are shown in Table 3. The first three columns contain the properties number, property name, and formula information. The x shown in formulas represents the signal. These operations are computed using the MATLAB library (Uçar et al., 2021; Wallisch et al., 2009). An aggregate of 100 features have been extracted.

Table 3:

Photoplethysmography properties.

No	Features name	The formula [b]
1	Kurtosis	$x_{k u r} = \frac{\sum_{i = 1}^{n} (x (i) - \bar{x})^{4}}{(n - 1) S^{4}}$
2	Skewness	$x_{s k e} = \frac{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{3}}{(n - 1) S^{3}}$
3	*Interquartile width	$I Q R = i q r (x)$
4	Coefficient of variation	$D K = (S / \bar{x}) 100$
5	Geometric average	$G = \sqrt[n]{x_{1} \times \dots \times x_{n}}$
6	Harmonic average	$H = n / (\frac{1}{x_{1}} + \dots + \frac{1}{x_{n}})$
7	Hjort activity coefficient	$A = S^{2}$
8	Hjort mobility coefficient	$M = S_{1}^{2} / S^{2}$
9	Hjort complexity coefficient	$C = \sqrt{{(S_{2}^{2} / S_{1}^{2})}^{2} - {(S_{1}^{2} / S^{2})}^{2}}$
10	*Maximum	$x_{m a x} = m a x (x_{i})$
11	Median	$\bar{x} = {\begin{matrix} x_{\frac{n + 1}{2}} : x t e k \\ \frac{1}{2} (x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}) : x i f t \end{matrix}$
12	*Median absolute deviation	$M A D = m a d (x)$
13	*Minimum	$x_{m i n} = m i n (x_{i})$
14	*Moment, central moment	$C M = m o m e n t (x, 10)$
15	Average	$\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} = \frac{1}{n} (x_{1} + \dots + x_{n})$
16	Average curve length	$C L = \frac{1}{n} \sum_{i = 2}^{n} \| x_{i} - x_{i - 1} \|$
17	Average energy	$E = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}$
18	Average square root RMS value	$X_{r m s} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} \| x_{i} \|^{2}}$
19	Standard error	$S_{\bar{x}} = S / \sqrt{n}$
20	Standard deviation	$S = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}$
21	Shape factor	$S F = X_{r m s} / (\frac{1}{n} \sum_{i = 1}^{n} \sqrt{\| x_{i} \|})$
22	*Singular value decomposition	$S V D = s v d (x)$
23	*25% Trimmed mean value	$T 25 = t r i m m e a n (x, 25)$
24	*50% Trimmed mean value	$T 50 = t r i m m e a n (x, 50)$
25	Average teager energy	$T E = \frac{1}{n} \sum_{i = 3}^{n} (x_{i - 1}^{2} - x_{i} x_{i - 2})$

DOI: 10.7717/peerj-cs.1188/table-3

Notes:

* The property was computed using MATLAB.

IQR, Interquartile Range; CV, Coefficient of Variation.

$S^{2}$ , variance of the signal x. S2

$S_{1}^{2}$ , Variance of the 1st derivative of the signal x.

$S_{2}^{2}$ , Variance of the 2nd derivative of the signal x.

After 25 features were extracted from each signal as indicated in the diagram flow, by using kNN, SVMs, and Ensemble Tree classification algorithm, the operations were performed on MATLAB. Additionally, by combining the algorithm, a hybrid machine learning algorithm was created.

Statistical features help to describe samples better. While a single feature does not make sense, more than one feature can become meaningful with artificial intelligence methods. Here, it is aimed to use more than one statistical parameter with artificial intelligence that has been tried before. These features have been preferred because they have been used in different studies in the literature and are efficient.

Feature selection

The number of features in a feature selection method has an impact on machine learning performance in both positive and negative ways (Uçar et al., 2020). Negative impacts are isolated through feature selection. According to the tag prediction of any feature, the feature selection procedure ranks from relevant to irrelevant. The researcher can add as many features to the dataset as he wants, ranking them from most relevant to least relevant. As a consequence, he may receive more detailed findings and run faster program cycles without unnecessary data usage. Object selection methods are often used to select a smaller subset of more distinct objects, and in this way, the goal is to improve classification performance (Kohavi & John, 1997; Isabelle & Elisseeff, 2000; Eskidere, 2012). In this research, Fisher’s feature selection algorithm was used due to its high performance (Uçar et al., 2020). The features selected in the study are summarized in Table 4. The table shows the features’ correlation level (R), and F displays feature numbers. R indicates the level of association of attributes with the tag. F represents the feature number. The features in the table are ranked with the features with the best correlation at the top.

Table 4:

Feature selection from signals for the entire data set.

S	PPG		PPG LF		PPG MF		PPG HF
No	F	R	F	R	F	R	F	R
1	17	0.082	2	0.027	2	0.029	8	0.081
2	8	0.062	1	0.021	1	0.022	1	0.060
3	25	0.041	11	0.021	6	0.022	25	0.042
4	11	0.039	4	0.021	11	0.022	14	0.042
5	14	0.039	8	0.019	8	0.020	7	0.042
6	22	0.039	16	0.019	25	0.020	3	0.041
7	2	0.039	6	0.018	18	0.019	24	0.040
8	9	0.038	25	0.018	20	0.018	9	0.039
9	3	0.033	22	0.018	14	0.018	18	0.039
10	19	0.032	17	0.003	17	0.007	19	0.037
11	1	0.017	13	0.003	13	0.007	11	0.021
12	21	0.007	15	0.002	15	0.002	4	0.005
13	16	0.005	14	0.002	4	0.002	13	0.005
14	6	0.003	20	0.001	3	0.001	21	0.003
15	4	0.003	3	0.000	16	0.000	6	0.002
16	7	0.001	12	0.000	7	0.000	20	0.002
17	12	0.001	7	0.000	22	0.000	17	0.002
18	13	0.001	21	0.000	12	0.000	12	0.001
19	15	0.001	19	0.000	21	0.000	15	0.001
20	10	0.000	5	0.000	5	0.000	5	0.001
21	20	0.000	18	0.000	19	0.000	16	0.001
22	23	0.000	10	0.000	10	0.000	2	0.000
23	5	0.000	23	0.000	23	0.000	10	0.000
24	18	0.000	24	0.000	24	0.000	22	0.000
25	24	0.000	9	0.000	9	0.000	23	0.000

DOI: 10.7717/peerj-cs.1188/table-4

Note:

S, Signal; F, Feature; R, Correlation coefficient.

Correlation coefficients range from 0 to 1.1 indicates the highest correlation. The correlation ranges are expressed as: These are $0 < R < 0.19$ —the relationship is negligible, $0.2 < R < 0.39$ weak relationship, $0.4 < R < 0.69$ moderate relationship, $0.7 < R < 0.89$ strong relationship, and $0.9 < R < 1$ is a very strong relationship.

Machine learning

Machine learning is the modeling of systems with computers that make predictions by making inferences from operations on data by using mathematics and statistics (Arslankaya & Toprak, 2021). One of the problems that can be solved by machine learning is the classification of problems with a wide range area of uses. Today many problems can be somehow considered and solved as a classification problems. kNN, SVMs, and ET models were employed in this work. The reasons for choosing these methods are the short training duration, and the high accuracy rates (Rasool et al., 2019; Uçar et al., 2017).

During the analysis performed in order to avoid errors, a hybrid machine learning algorithm structure was created (Aydilek & Arslan, 2013, 2012; Tosunoğlu et al., 2021). Due to the fact that these methods have attained successful results in the literature, they are the most frequently used machine learning algorithms. In addition, these algorithms are suitable for transferring to embedded systems (Roscher et al., 2020; Santos, Moreno & Estombelo-Montesco, 2019; Saguil & Azim, 2019). From the data used for the training of the models, 50% was used during training and 50% during the testing phase.

Hyperparameter optimization has been made for all machine learning algorithms used. Parameters were not changed manually. All parameters are made automatically by Matlab to reduce 5-fold cross-validation loss.

Support vector machines algorithm (SVMs)

SVMs are among the best-supervised learning algorithms. Proposed by Cortes & Vapnik (1995) and Uçar (2017). It is predominantly based on the principle of establishing the maximum distance between the examples defined as support vectors of the decision surface of two linearly separable classes and determining the class boundaries. The maximization of the distance is written as a quadratically constrained optimization problem and converted to a dual form. Developed for linear problems, this approach can be generalized for nonlinear parsing problems using kernel transformations (Akben, Subasi & Kiymik, 2010; Fernandes de Mello & Antonelli Ponti, 2018).

In the selection of the machine learning algorithm developed for the solution of classification problems, one of the essential criteria to be considered is the generalization performance of the algorithm (Ayhan & Erdoğmuş, 2014). In order to separate points placed on the plane, a line is drawn (Fig. 3). It intends to have this line at the maximum distance for the points of both classes. To draw the border, two lines close and parallel to each other are drawn, and these lines are brought closer together to produce a boundary line. The SVMs method is based on estimating the most appropriate function to separate the data from each other. SVMs have a simple structure and high performance in terms of practical applications, and it is pretty user-friendly. The number of samples to be used in SVMs is not essential. During training, SVMs also classify unseen data without problems. This demonstrates the generalization ability of the SVMs. The generalization feature makes the SVMs a good alternative compared to the other techniques (Kecman, 2002).

Figure 3: SVMs algorithm general flow diagram.

Download full-size image

DOI: 10.7717/peerj-cs.1188/fig-3

During the studies, for all classification processors, the features were divided into 10 different feature sets (5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%). While applying the SVMs, the parameter optimization method was used. Using this method improves performance.

k-nearest neighborhood algorithm (KNN)

kNN is a basic machine-learning method that uses the supervised learning approach (Şahan et al., 2007; Uçar, 2017). Although it is used in solving both classification and regression problems, it is mostly used in solving classification problems in the industry.

kNN algorithms were proposed in 1967 by Cover & Hart (1967). The algorithm is used by utilizing the data in a sample set with certain classes. The new data that will be added to the sample data set, in accordance with the available data, the distance is calculated, and k number of close neighbors are checked. Three types of distance functions are generally used for distance calculations/these are (1) “Euclidean” Distance, (2) “Manhattan” Distance, (3) “Minkowski” Distance.

The kNN algorithm may be used for both regression and classification; however, it is more commonly utilized for classification tasks (Şahan et al., 2007). By calculating the similarity of the data to be classified to the standard behavior data in the learning set; classes are assigned to the classes according to the threshold value determined by the mean of the k data, which is thought to be the closest (Fig. 4) (Duman et al., 2021). The important thing is that the characteristics of each class are clearly defined in advance. The performance of the method criteria is affected by the number of neighbors closest to k, the threshold value, the similarity measurement, and the sufficient number of expected behaviors in the learning set. Initially, for classification with kNN, the k value is selected. A large selection of k may result in the grouping of dissimilar data sets. In studies, the k value is generally preferred as 3, 5, or 7 (Uçar, 2017; Khan, Ding & Perrizo, 2002).

Figure 4: kNN algorithm general flow diagram.

Download full-size image

DOI: 10.7717/peerj-cs.1188/fig-4

Ensemble Tree (ET)

An ensemble of trees makes predictions by gathering the results from individual decision trees (DT) (Huang, Zhao & Huang, 2021). ET is a machine learning approach that is commonly utilized in regression and classification issues (Breiman, 2001, 1996). The basic working principle of Ensemble is based on the principle of performing a simple decision-making process by making any classification problem multi-stepped state (Çölkesen & Kavzoğlu, 2017). With classification algorithms, we try to predict which class an object will be included in. Many classification methods select one suitable problem makes the necessary optimizations, and tries to achieve high accuracy rates.

Ensemble methods; It combines the prediction results of multiple base models to produce more robust and generalizable results compared to a single model. The success of these methods is based on two criteria; the learning success of the base learner and their differences from each other. Performance can sometimes drop on models.

The Ensemble Tree classifier is a system that was constructed by merging many classification methods to give more consistent and dependable predictions. The system is built with N odd or even classifiers. During classification, the output values produced by each classifier are counted. The decision of the ensemble classifier is determined by the principle of majority vote.

In this study, three classifiers were used; ensemble classifier SVMs, kNN, and Ensemble Tree, and the study were prepared in the MATLAB environment.

Hybrid artificial intelligence method (HAI)

Today, it is seen that organizations are increasingly positioning artificial intelligence instead of operational solutions and rapidly integrating it into their business processes (Deliloğlu & Çakmak Pehlivanlı, 2021). Hybrid Artificial Intelligence combines the received classification processes to produce the answer given by the majority (Fig. 5). Bringing together the weak classification to reveal the robust classification. As the number of classifiers increases, the model stability increases.

Figure 5: Hybrid artificial intelligence model algorithm general flow diagram.

Download full-size image

DOI: 10.7717/peerj-cs.1188/fig-5

Performance assessment criteria

Various performance assessment criteria were utilized to examine the accuracy rates of the suggested systems. Specificity, sensitivity, kappa coefficient, accuracy rates, receiver operating characteristic (ROC), area under the ROC curve (area under a ROC—AUC), and k-fold cross-validation accuracy rate are among them (Uçar et al., 2020).

During the classification of the feature sets, they are divided into (50%) Training and (50%) Test data sets (Table 5). In the received data, 2-second epoching processes were carried out for the data obtained from 14 patients, including Healthy and Control.

Table 5:

Training and test.

	Percent	COPD	Healthy	Total
Training	50%	56,092	53,408	109,500
Test	50%	56,092	53,407	109,499
Total	100%	112,184	106,815	218,999

DOI: 10.7717/peerj-cs.1188/table-5

From the training and test results received, the total number of sick patients is 112,184, and healthy ones are 106,815. The best performance results were acquired by using classification algorithms in the processes and combining the hybrid artificial intelligence method with classification algorithms.

For the calculation of the performance values, the confusion matrix was created, and the performance parameters were calculated (Table 6).

Table 6:

Confusion matrix.

		Predicted
		P	N
Actual situation	P	TP	FN
	N	FP	TN

DOI: 10.7717/peerj-cs.1188/table-6

While interpreting the Kappa value, the ranges in Table 7 are taken into account. According to these values, R values above 0.81 are very good for the system.

Table 7:

Kappa coefficients boundary ranges.

Kappa coefficients	Explanation
0.81–1.00	Very good compatibility
0.61–0.80	Good compatibility
0.41–0.60	Moderate compliance
0.21–0.40	Low level of compliance
0.00–0.20	Poor fit
<0.00	Very poor fit

DOI: 10.7717/peerj-cs.1188/table-7

Results

The results acquired within the scope of the study are presented in this section. The goal of this study is to use artificial intelligence with PPG signals to diagnose chronic obstructive pulmonary disease (COPD). For this purpose, the study was organized as follows: Initially, the PPG signals received from individuals (“Data collection”) are divided into three sub-frequency bands (“Signal pre-processing”). Then, 25 features are extracted to the photoplethysmography signal and in the three sub-frequency bands (“Feature extraction”). In the next step, the diagnosis of individual COPD values was estimated with the help of the feature groups feature selection algorithm (“Feature selection”). Finally, performance assessment criteria were used to evaluate the performances of the proposed models (“Performance assessment criteria”).

Within the scope of the study, COPD was estimated by using all the features of PPG and three sub-frequency bands, both separately and together in Table 8. By using PPG and all the features of the three sub-frequency bands, the estimation was made using the performance evaluation criteria in the prepared models (Table 8). It has been determined that the calculated performance evaluation criteria are very close to 1. The accuracy rate in the model created with the PPG signal is approximately 95%. It is seen that the success rates of the models belonging to the sub-frequencies of PPG are over 80%. The sensitivity and specificity values of the models are balanced and above 0.85.

Table 8:

Results by all features for all records.