Electroencephalography (EEG) based epilepsy diagnosis via multiple feature space fusion using shared hidden space-driven multi-view learning

Xiujian Hu; Yicheng Xie; Hui Zhao; Guanglei Sheng; Khin Wee Lai; Yuanpeng Zhang

doi:10.7717/peerj-cs.1874

Electroencephalography (EEG) based epilepsy diagnosis via multiple feature space fusion using shared hidden space-driven multi-view learning

Xiujian Hu¹, Yicheng Xie¹, Hui Zhao¹, Guanglei Sheng^1,2, Khin Wee Lai³, Yuanpeng Zhang ⁴

1Department of Electronics and Information Engineering, Bozhou University, Bozhou, Anhui, China

2School of Computer Science and Engineering, Xi’an University of Technology, Xi’an, Shanxi, China

3Department of Biomedical Engineering, Universiti Malaya, Kuala Lumpur, Malaysia

4Department of Medical Informatics, Nantong University, Nantong, Jiangsu, China

DOI: 10.7717/peerj-cs.1874

Published: 2024-03-07
Accepted: 2024-01-22
Received: 2023-10-23

Academic Editor: Wenbing Zhao

Subject Areas: Bioinformatics, Artificial Intelligence, Data Mining and Machine Learning, Data Science, Databases
Keywords: Multi-view learning, EEG, Epilepsy, Shared hidden space

Copyright: © 2024 Hu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Hu X, Xie Y, Zhao H, Sheng G, Lai KW, Zhang Y. 2024. Electroencephalography (EEG) based epilepsy diagnosis via multiple feature space fusion using shared hidden space-driven multi-view learning. PeerJ Computer Science 10:e1874 https://doi.org/10.7717/peerj-cs.1874

The authors have chosen to make the review history of this article public.

Abstract

Epilepsy is a chronic, non-communicable disease caused by paroxysmal abnormal synchronized electrical activity of brain neurons, and is one of the most common neurological diseases worldwide. Electroencephalography (EEG) is currently a crucial tool for epilepsy diagnosis. With the development of artificial intelligence, multi-view learning-based EEG analysis has become an important method for automatic epilepsy recognition because EEG contains difficult types of features such as time-frequency features, frequency-domain features and time-domain features. However, current multi-view learning still faces some challenges, such as the difference between samples of the same class from different views is greater than the difference between samples of different classes from the same view. In view of this, in this study, we propose a shared hidden space-driven multi-view learning algorithm. The algorithm uses kernel density estimation to construct a shared hidden space and combines the shared hidden space with the original space to obtain an expanded space for multi-view learning. By constructing the expanded space and utilizing the information of both the shared hidden space and the original space for learning, the relevant information of samples within and across views can thereby be fully utilized. Experimental results on a dataset of epilepsy provided by the University of Bonn show that the proposed algorithm has promising performance, with an average classification accuracy value of 0.9787, which achieves at least 4% improvement compared to single-view methods.

Introduction

Epilepsy is a chronic, non-infectious but genetic disease that affects all ages and is caused by paroxysmal abnormal hypersynchrony of brain neurons. It is one of the most common neurological diseases globally. Due to the diversity and complexity of the clinical manifestation of epilepsy, it is often misdiagnosed or missed. Repetitive seizures can have a persistent negative impact on the patient’s mental and cognitive functions, even threatening their life. Therefore, the study of epilepsy diagnosis and treatment has important clinical significance. The brain electroencephalogram (EEG) is a microvolt-level electrical signal generated by synchronized neurons in the brain when electrodes are placed on the scalp at specific locations. As the most commonly used and cheapest non-invasive brain wave detection method, EEG has a history of over 70 years of research and is the most effective method for diagnosing epilepsy-related diseases, such as identifying seizures, predicting their occurrence, and localizing the affected areas. With the development of artificial intelligence, machine learning models are extensively used in automatic epilepsy recognition. Feature representation is a crucial step in machine learning. Research has indicated that EEG signals can be represented by both linear and non-linear features. Time-domain features are the fundamental features in EEG signal processing, primarily extracted by directly observing and calculating relevant characteristics from the raw signal. Their advantages lie in their simplicity of computation and ease of interpretation. However, the non-stationarity of EEG signals, individual differences, and external interferences can easily affect time-domain features. Frequency-domain features are based on the significant changes in energy in EEG during epileptic seizures, assuming that the background EEG is approximately stationary. Most frequency-domain features are derived from the study of signal power spectra, and various parameter estimation methods can be used for extracting spectral features. The accuracy of these parameters also affects the quality of frequency-domain features. If we consider the amount of information contained in the features, neither pure time-domain features nor frequency-domain features can comprehensively characterize an EEG signal. Additionally, EEG analysis based on the assumption of stationarity is not rigorous. Therefore, researchers have turned their attention to time-frequency analysis methods, such as time-frequency transformations, to re-represent non-stationary EEG signals and extract corresponding features. In addition to the aforementioned linear features, many studies also consider the brain as a nonlinear system and extract corresponding nonlinear features from descriptions of complexity, persistence, synchrony, and other changes in the system. These features are not affected by the non-stationarity of EEG signals and offer more flexibility in dealing with issues such as multi-channel correlation and channel loss. Based on the aforementioned linear or nonlinear feature representations, numerous scholars have constructed machine learning models for the automatic diagnosis of epilepsy. For example, the study conducted by Li, Chen & Zhang (2016) employed a dual-tree complex discrete wavelet transform to extract nonlinear features from individual components. The researchers utilized an ANOVA analysis to select relevant classification features, including the Hurst parameter and fuzzy entropy. For the classification task, a support vector machine (SVM) was employed. Reddy & Rao (2017) computed the central correlated entropy of wavelet components obtained from tunable Q-factor wavelet transform, and utilized models such as RF, LR, and multi-layer perceptron for epileptic signal recognition. Jaiswal & Banka (2017) proposed a feature extraction method called local gradient pattern transformation and applied classification methods such as k-nearest neighbors, SVM, and decision trees for epilepsy detection.

The aforementioned machine learning-based epilepsy diagnostic models utilize single EEG feature representation for epilepsy diagnosis, which have low model complexity and high interpretability. However, these models rely on expert knowledge, and deep features are not easily observed and extracted. As a result, the accuracy is limited. Multi-view learning (Zhao et al., 2017; Jiang et al., 2020; Zhang, Chung & Wang, 2018; Yan et al., 2021) improves the classification accuracy of models by utilizing the differences and similarities between multiple different views based on the principles of view consistency and complementarity. For example, Tian et al. (2019) utilized a convolutional neural network (CNN) model to extract deep features from EEG signals in the time domain, frequency domain, and time-frequency domain. These features were constructed as three views, and multi-view learning was conducted using a multi-view Takagi-Sugeno-Kang (TSK) fuzzy system, which improved the classification and detection performance compared to a single view. Yuan et al. (2018) implemented a multi-view epilepsy automatic diagnosis by utilizing channel characteristics and intra-channel time-frequency features of multi-channel EEG signals extracted using autoencoder (AE) through channel perception technology. Liu & Li (2019) utilized a user-sensitive model for channel selection and extracted time-frequency features from each sub-band of the selected channels, forming multi-view features. They extracted numerical and morphological features using a common spatial projection matrix and utilized a maximum average difference autoencoder to extract inter-channel time-frequency domain features, enabling automatic diagnosis of epilepsy with multiple views. These effective models based on collaborative regularization can construct a common feature space for multi-view learning. However, these models also have certain limitations. While these methods construct the density distributions of each view solely based on the corresponding observed data, they overlook the correlated information among all views. Additionally, they separate the original sample space from the common space obtained through mapping. This approach solely utilizes the common space for learning, neglecting the discriminative information present in the original space.

To overcome such shortcomings, in this study, a shared hidden feature space method is constructed by using kernel density estimation, and it is extended to an expanded space by combining it with the original space. Then, SVM is introduced and a multi-view SVM based on the shared hidden space is proposed to take a careful consideration of the differences and relationships between samples from different views. Through experimental verification on different multi-view data sets, the effectiveness of this method in addressing the challenges mentioned above has also been confirmed. The contributions of this study are mainly reflected in the following aspects:

(1) The kernel density estimation (KDE) technique is used to construct a new shared hidden space, and it is combined with the original space to construct an expanded space for multi-view learning, thus being able to effectively address the special issue mentioned above on multi-view learning.

(2) By constructing the expanded space and utilizing the information of both the shared hidden space and the original space for learning, thereby fully utilizing the relevant information of samples within and across views, we can effectively solve the problem that the difference between samples of the same class from different views is greater than the difference between samples of different classes from the same view.

(3) During the optimization phase, the proposed model is transformed into a classical Quadratic Programming (QP) problem, allowing for the utilization of pre-existing optimization methods that offer both high effectiveness and theoretical guarantees. This transformation enables the application of readily available optimization techniques, which have proven to be highly efficient in solving QP problems.

The following sections are organized as follows. In ‘Data’, we introduce the EEG data used in this study and the corresponding multiple feature space representation. In ‘Methodology’, we present the proposed model. In ‘Experimental studies’, experimental results are reported and in the last section, the whole study is summarized.

Data

The EEG data of epileptic patients used in this study was authorized and provided by the University of Bonn in Germany (Andrzejak et al., 2001), as shown in Table 1. The dataset included volunteers who could be divided into five groups, namely A, B, C, D, and E. Each group contained 100 single-channel EEG segments lasting 23.6 s, with a sampling rate of 173.6 Hz. The EEG signals of groups A and B were collected from healthy volunteers in a relaxed and conscious state, while the eyes of the volunteers were open during the data collection of group A and closed during the data collection of group B. The remaining three groups’ signals were collected from epileptic volunteers, with group C’s signals collected from the hippocampi of the two brain hemispheres, and group D’s signals collected from the epileptic foci. The signals of groups C and D were measured during periods without epileptic seizures, while group E collected signals during epileptic seizures. Figure 1 provides an example of EEG signals from five groups.

Table 1:

Basic collection information of epilepsy EEG signals.

Group	#Volunteers	Collection information
A	100	This group was collected from a group of healthy volunteers who were instructed to keep their eyes open during the recording process. These volunteers did not have any known neurological or psychiatric disorders and were not experiencing any abnormal symptoms at the time of data collection.
B	100	This group was collected from a group of healthy volunteers under conditions where they kept their eyes closed.
C	100	This group was collected from the hippocampal formation of the contralateral hemisphere of the brain during seizure-free intervals. These samples were obtained when the patient was not experiencing any epileptic seizures.
D	100	This group was collected from the epileptogenic zone during periods of seizure freedom. This implies that the recordings were obtained when the patient was not experiencing seizures.
E	100	The group was collected during seizure activity phase offering a unique opportunity to study the dynamics and temporal dynamics of epileptic seizures, paving the way for the development of more accurate and reliable seizure detection and prediction algorithms.

DOI: 10.7717/peerj-cs.1874/table-1

Figure 1: EEG signals from five groups.

Download full-size image

DOI: 10.7717/peerj-cs.1874/fig-1

Frequency-domain representation extraction

Frequency-domain feature representation originates from the significant changes in energy in EEG during epileptic seizures. To extract frequency-domain representation from EEG signals, the Daubechies4 wavelet coefficients are utilized to decompose the original signals into a series of binary wavelets. The frequency band of each Daubechies4 wavelet coefficient is provided in Table 2. By applying these settings, the EEG signals are divided into six distinct frequency bands. An illustrative example of the decomposed signals from group E is depicted in Fig. 2.

Table 2:

Frequency band of each Daubechies4 wavelet coefficient.

Coefficient	Frequency band
Daubechies4 (4, 0)	0–2 Hz
Daubechies4 (4, 5)	2–4 Hz
Daubechies4 (4, 4)	4–8 Hz
Daubechies4 (4, 3)	8–15 Hz
Daubechies4 (4, 2)	16–30 Hz
Daubechies4 (4, 1)	31–60 Hz

DOI: 10.7717/peerj-cs.1874/table-2

Figure 2: Example of frequency-domain representation.

Download full-size image

DOI: 10.7717/peerj-cs.1874/fig-2

Time-domain feature extraction

Time-domain features are the fundamental features in EEG signal processing, primarily extracted by directly observing and calculating relevant characteristics from the raw signal. Their advantages lie in their simplicity of computation and ease of interpretation for researchers. In this study, we employ kernel principal component analysis (KPCA) (Li et al., 2022b) on the raw EEG signals to enable complex nonlinear mapping. Previous research has shown that KPCA features offer discriminative patterns suitable for pattern recognition. An illustration depicting an example of KPCA features from group E can be observed in Fig. 3.

Figure 3: Example of time-domain representation.

Download full-size image

DOI: 10.7717/peerj-cs.1874/fig-3

Time-frequency representation extraction

Pure time-domain or frequency-domain feature representations alone cannot comprehensively characterize an EEG signal, and EEG analysis based on the assumption of stationarity is not rigorous. Therefore, researchers have turned their attention to time-frequency analysis methods, such as time-frequency transformations, to re-represent non-stationary EEG signals and extract corresponding features. To capture time-frequency representation, researchers often employ the short-time Fourier transform (STFT) (Li et al., 2022a). STFT allows for the analysis of how the frequency content of a signal changes over time. It can be formulated as follows:

(1) $F_{t i m e - f r e} (t i m e, f r e) = \int_{- i n f}^{+ i n f} x (t i m e) g (t i m e - u) e^{- j 2 π * f r e * t i m e} d (t i m e) .$

In the context of EEG signal analysis, Eq. (1) represents the transformation of continuous EEG signals, denoted as $x (t i m e)$ , into the time-frequency plane using the function $g (t i m e - u)$ and a limited width window centered around $u$ . This transformation, referred to as $F_{t i m e - f r e} (t i m e, f r e)$ , provides a means to examine the time-varying nature of the EEG signals, revealing local spectrum discrepancies at different time points. To achieve this, the EEG signals undergo partitioning into several segments of local stationary signals using STFT. Through this process, the time-varying characteristics of the EEG signals are captured, highlighting variations in the spectrum. The extraction of six energy bands as features is accomplished using Eq. (1), which takes into account the observed discrepancies. A visualization of these six energy bands, exemplified by group E, is illustrated in Fig. 4.

Figure 4: Example of time-frequency representation.

Download full-size image

DOI: 10.7717/peerj-cs.1874/fig-4

Methodology

In this section, we will design a shared hidden space-driven multi-view learning method to fuse time-frequency representation, frequency-domain representation and time-domain representation.

Construction of shared hidden feature space

Suppose that $Ω \in R^{r \times d}$ is an orthogonal matrix subject to $Ω Ω^{T} = I \in R^{r \times r}$ , $f^{A} = {x_{i}^{A}, y_{i} | x_{i}^{A} \in R^{d}, i = 1, 2, \dots, N}$ represents one kind of feature space, e.g., time-domain feature space, and $f^{B} = {x_{i}^{B}, y_{i} | x_{i}^{B} \in R^{d}, i = 1, 2, \dots, N}$ represents another kind of feature space, then the hidden feature space of $f^{A}$ and $f^{B}$ can be generated by ${Ω x}_{i}^{A} \in R^{r}$ and ${Ω x}_{i}^{B} \in R^{r}$ , respectively, where $r$ represents the number of hidden features. To obtain a consistent hidden feature space between ${Ω x}_{i}^{A}$ and ${Ω x}_{i}^{B}$ , it is expected that the difference between them should be minimized as much as possible. Kernel density estimation (KDE), which is one of the non-parametric estimation methods in probability theory, is usually used to estimate the unknown probability density function (Wang, Wang & Chung, 2013). For a training set $X = {x_{i}, y_{i} | x_{i} \in R^{d}, i = 1, 2, \dots, N}$ , its corresponding kernel density estimation function can be expressed as

(2) $P (x) = \frac{1}{N} \sum_{i = 1}^{N} δ^{2} K (\frac{x - x_{i}}{δ}),$ where $δ$ is the kernel width, $K (\cdot)$ is the kernel function. If the Gaussian kernel function is adopted, then Eq. (2) can be updated as $P (x) = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{δ \sqrt{2 π}} \exp (- \frac{1}{2} {(\frac{x - x_{i}}{δ})}^{2}) .$ Therefore, the kernel density estimation of ${Ω x}_{i}^{A}$ and ${Ω x}_{i}^{B}$ can be expressed as follows when using the Gaussian kernel function, respectively,

(3) $- \frac{∥ {Ω x - Ω x}_{i}^{A} ∥^{2}}{2 δ^{2}} P_{A} (\tilde{x}) = P_{A} (Ω x) = \frac{1}{N \cdot δ \sqrt{2 π}} \sum_{i = 1}^{N} e,$

(4) $- \frac{∥ {Ω x - Ω x}_{i}^{B} ∥^{2}}{2 δ^{2}} P_{B} (\tilde{x}) = P_{B} (Ω x) = \frac{1}{N \cdot δ \sqrt{2 π}} \sum_{i = 1}^{N} e .$

In this study, the difference between $P_{A} (\tilde{x})$ and $P_{B} (\tilde{x})$ is measured by the mean square error, that is

(5) $J = \int {(P_{A} (\tilde{x}) - P_{B} (\tilde{x}))}^{2} d x .$

By minimizing $J$ , the two-view data $x_{i}^{A}$ and $x_{i}^{B}$ can be made to have the maximum commonality in the shared hidden space, and thus the challenge of excessive variability between samples from different views can be addressed. In order to solve Eq. (6), we suppose that $G (Ω x, Ω x_{i}, δ^{2}) = \frac{1}{δ \sqrt{2 π}} e^{- \frac{Ω x - Ω x_{i}^{2}}{2 δ^{2}}}$ , then $P_{A} (\tilde{x})$ and $P_{B} (\tilde{x})$ can be updated as $P_{A} (\tilde{x}) = \frac{1}{N} \sum_{i = 1}^{N} G ({Ω x Ω x}_{i}^{A}, δ^{2})$ and $P_{B} (\tilde{x}) = \frac{1}{N} \sum_{i = 1}^{N} G ({Ω x Ω x}_{i}^{B}, δ^{2})$ . Therefore, Eq. (5) can be computed by $J = \int P_{A} (\tilde{x}) d x - 2 \int P_{A} (\tilde{x}) P_{B} (\tilde{x}) d x + \int P_{B} (\tilde{x}) d x$ . According to Wang, Wang & Chung (2013), Hansen, Jaumard & Xiong (1994), we have $\int G (x, x_{i}, δ_{1}^{2}) G (x, x_{j}, δ_{2}^{2}) d x = G (x_{i}, x_{j}, δ_{1}^{2} + δ_{2}^{2})$ , Therefore, we have the following equations,

(6) $\int P_{A}^{2} (\tilde{x}) d x = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{A}, 2 δ^{2}) = \frac{1}{N} \sum_{i = 1}^{N} [\frac{1}{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{A}, 2 δ^{2})]$

(7) $\int P_{B}^{2} (\tilde{x}) d x = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{B}, {\tilde{x}}_{j}^{B}, 2 δ^{2}) = \frac{1}{N} \sum_{i = 1}^{N} [\frac{1}{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{B}, {\tilde{x}}_{j}^{B}, 2 δ^{2})]$

(8) $\int P_{A} (\tilde{x}) P_{B} (\tilde{x}) d x = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{B}, 2 δ^{2})$ where $\frac{1}{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{A}, 2 σ^{2})$ can be taken as another estimation of $P_{A} ({\tilde{x}}_{i}^{A})$ . Therefore, $\int P_{A}^{2} (\tilde{x}) d x$ can be estimated by $\frac{1}{N} \sum_{j = 1}^{N} P_{A} ({\tilde{x}}_{i}^{A})$ , and further $\frac{1}{N}$ . Similarly, $\int P_{B}^{2} (\tilde{x}) d x$ can be estimated by $\frac{1}{N}$ . Thus, we finally have $J \approx \frac{1}{N} + \frac{1}{N} - \frac{2}{N^{2}} G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{B}, 2 δ^{2})$ . Therefore, we have the following objective,

(9) $\begin{array}{l} \arg \min_{Ω} J \approx \arg \min_{Ω} \sum_{i = 1}^{N} \sum_{j = 1}^{N} G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{B}, 2 δ^{2}) \\ s . t . Ω Ω^{T} = I_{r \times r} \end{array}$

However, it is difficult to solve Eq. (9) directly. Thus, Taylor expansion can be used for getting an approximate solution. Hence, we have

(10) $G ({\tilde{x}}_{i}^{A}, {\tilde{x}}_{j}^{B}, 2 δ^{2}) = \frac{1}{\sqrt{2 π} δ} e^{- \frac{Ω x_{i}^{A} - Ω x_{j}^{B^{2}}}{4 σ^{2}}} \approx \frac{1}{\sqrt{2 π} δ} (1 - {(Ω x_{i}^{A} - Ω x_{j}^{B})}^{2})$

Therefore, Eq. (9) can be further updated as

(11) $\arg min_{Ω} \sum_{i = 1}^{N} \sum_{j = 1}^{N} {({Ω x}_{i}^{A} - {Ω x}_{j}^{B})}^{2}, s . t . Ω Ω^{T} = I_{r \times r}$ in Eq. (11), implicit feature transformation matrix $Ω$ still cannot be solved directly, but can be solved by gradient descent method. Thus, Eq. (11) can be updated as

(12) $\begin{array}{l} J = \underset{Ω}{argmin} \sum_{i = 1}^{N} \sum_{j = 1}^{N} ({(x_{i}^{A})}^{T} Ω^{T} Ω x_{i}^{A} + {(x_{j}^{B})}^{T} Ω^{T} Ω x_{j}^{B} - 2 {(x_{i}^{A})}^{T} Ω^{T} Ω x_{j}^{B}) \\ s . t . Ω Ω^{T} = I_{r \times r} \end{array}$

The partial derivative of $J$ w.r.t. $Ω$ is

(13) $\frac{\partial J}{\partial Ω} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} (2 {Ω x}_{i}^{A} {(x_{i}^{A})}^{T} + 2 {Ω x}_{j}^{B} {(x_{j}^{B})}^{T} - 2 Ω (x_{i}^{A} {(x_{i}^{A})}^{T} + x_{j}^{B} {(x_{j}^{B})}^{T}))$

Then the transformation matrix $Ω$ can be solved by gradient descent method, that is,

(14) $Ω \leftarrow Ω - η \frac{\partial J}{\partial Ω} (I_{r \times r} - Ω Ω^{T}) = Ω - η \nabla Ω$ where $η$ is the step size that can be solved by

(15) $\begin{array}{l} η = \sum_{i = 1}^{N} \sum_{j = 1}^{N} ({(x_{i}^{A})}^{T} (Ω^{T} \nabla Ω + \nabla Ω^{T} Ω) x_{i}^{A} + {(x_{j}^{B})}^{T} (Ω^{T} \nabla Ω + \nabla Ω^{T} Ω) x_{j}^{B} \\ - \frac{2 {(x_{i}^{A})}^{T} (Ω^{T} \nabla Ω + \nabla Ω^{T} Ω) x_{j}^{B})}{\sum_{i = 1}^{N} \sum_{j = 1}^{N} (2 {(x_{i}^{A})}^{T} \nabla Ω^{T} \nabla Ω x_{i}^{A}} + {(x_{j}^{B})}^{T} \nabla Ω^{T} \nabla Ω x_{j}^{B} - 4 {(x_{i}^{A})}^{T} \nabla Ω^{T} \nabla Ω x_{j}^{B}) \end{array}$

According to the above analysis and derivation, the algorithm for solving implicit feature transformation matrix $Ω$ is described as follows.

Multi-view learning based on shared hidden feature space

After determining the shared hidden space between two views, the extended space can be generated by combining the original space and the shared hidden space. Then, a multi-view classifier based on SVM is designed for multi-view data classification in the extended space. In existing multi-view learning mechanisms, it is generally assumed that each view can provide a classifier containing specific information, and classifiers constructed from different view tend to be consistent. Additionally, since views can provide specific information to each other, the proposed model establishes the objective function by considering the mutual information between two views. In summary, the proposed model, based on SVM, restructures the slack variables on each view, and then narrows the gap between the two views by using the corresponding regularization term. The objective function of multi-view learning based on shared hidden feature space can be formulated as

(16) $\begin{array}{l} a r g \min_{w_{A}, w_{B}, v_{A}, v_{B}, b_{A}, b_{B}} \frac{1}{2} ∥ w_{A} ∥^{2} + \frac{1}{2} ∥ w_{B} ∥^{2} + \frac{1}{2} ∥ v_{A} ∥^{2} + \frac{1}{2} ∥ v_{B} ∥^{2} + C^{A} \sum_{i = 1}^{N} ξ_{i}^{A} + C^{B} \sum_{i = 1}^{N} ξ_{i}^{B} + λ ∥ v_{A} - v_{B} ∥^{2} \\ s . t . y_{i} (w_{A}^{T} ϕ (x_{i}^{A}) + v_{A}^{T} ϕ (Ω x_{i}^{A}) + b_{A}) \geq 1 - ξ_{i}^{A} \\ y_{i} (w_{B}^{T} ϕ (x_{i}^{B}) + v_{B}^{T} ϕ (Ω x_{i}^{B}) + b_{B}) \geq 1 - ξ_{i}^{B} \\ ξ_{i}^{A}, ξ_{i}^{B} \geq 0, i = 1, 2, \dots, N \end{array}$ where $λ$ , $C^{A}$ and $C^{B}$ are the regularization parameters. Observe that Eq. (16) consists of three parts: the first four terms reflect the outcome risk in the original feature space and the shared hidden space respectively; the second two terms represent the empirical risk; and the third term reflects the difference between the two views in the shared hidden space. The objective function in Eq. (16) strengthens the constraints based on the traditional SVM through the implicit mapping, so that the probability distributions of data from different views in the shared hidden space are as consistent as possible, which can well solve the problem described at the beginning of this study. In order to solve Eq. (16) efficiently, the relevant Lagrangian multipliers are introduced according to the Lagrangian optimization theory, hence Eq. (16) can be converted into the corresponding dual form as follows. The Lagrangian function corresponding to Eq. (16) is

(17) $\begin{array}{l} L = \frac{1}{2} ∥ w_{A} ∥^{2} + \frac{1}{2} ∥ w_{B} ∥^{2} + \frac{1}{2} ∥ v_{A} ∥^{2} + \frac{1}{2} ∥ v_{B} ∥^{2} + C^{A} \sum_{i = 1}^{N} ξ_{i}^{A} \\ + C^{B} \sum_{i = 1}^{N} ξ_{i}^{B} + λ ∥ v_{A} - v_{B} ∥^{2} \\ + \sum_{i = 1}^{N} α_{i}^{A} (1 - ξ_{i}^{A} - y_{i} (w_{A}^{T} ϕ (x_{i}^{A}) + v_{A}^{T} ϕ (Ω x_{i}^{A}) + b_{A})) \\ + \sum_{i = 1}^{N} α_{i}^{B} (1 - ξ_{i}^{B} - y_{i} (w_{B}^{T} ϕ (x_{i}^{B}) + v_{B}^{T} ϕ (Ω x_{i}^{B}) \\ + b_{B})) - \sum_{i = 1}^{N} μ_{i}^{A} ξ_{i}^{A} - \sum_{i = 1}^{N} μ_{i}^{B} ξ_{i}^{B} \end{array}$ where $α_{i}^{A} \geq 0$ , $α_{i}^{B} \geq 0$ , $μ_{i}^{A} \geq 0$ , and $μ_{i}^{B} \geq 0$ are Lagrangian multipliers. By setting the partial derivatives of Lagrangian function $L$ with respect to $w_{A}$ , $w_{B}$ , $v_{A}$ , $v_{B}$ , $b_{A}$ , $b_{B}$ , $ξ_{i}^{A}$ , and $ξ_{i}^{B}$ to 0, we have

(18) $w_{A} = \sum_{i = 1}^{N} α_{i}^{A} y_{i} ϕ (x_{i}^{A}), w_{B} = \sum_{i = 1}^{N} α_{i}^{B} y_{i} ϕ (x_{i}^{B}),$

(19) $v_{A} = \frac{1 + 2 λ}{1 + 4 λ} \sum_{i = 1}^{N} α_{i}^{A} y_{i} ϕ (x_{i}^{A}) + \frac{2 λ}{1 + 4 λ} \sum_{i = 1}^{N} α_{i}^{B} y_{i} ϕ (x_{i}^{B}),$

(20) $v_{B} = \frac{1 + 2 λ}{1 + 4 λ} \sum_{i = 1}^{N} α_{i}^{B} y_{i} ϕ (x_{i}^{B}) + \frac{2 λ}{1 + 4 λ} \sum_{i = 1}^{N} α_{i}^{A} y_{i} ϕ (x_{i}^{A}),$

(21) $\sum_{i = 1}^{N} α_{i}^{A} y_{i} = 0, \sum_{i = 1}^{N} α_{i}^{B} y_{i} = 0,$

(22) $C_{A} = α_{i}^{A} + u_{i}^{A}, C_{B} = α_{i}^{B} + u_{i}^{B}$

By submitting Eqs. (18–22) to Eq. (16), we have the dual problem of Eq. (24), which can be defined as

(23) $\underset{\tilde{α}}{\arg m a x} - \frac{1}{2} {\tilde{α}}^{T} \tilde{α} + {\tilde{α}}^{T} 1. s . t . {\tilde{α}}^{T} f = 0, f = {[y^{T}, y^{T}]}^{T} {\tilde{α}}_{i} 0, \forall i$ where

(24) $\tilde{α} = {[α_{1}^{A}, α_{2}^{A}, \dots, α_{N}^{A}, α_{1}^{B}, α_{2}^{B}, \dots, α_{N}^{B}]}^{T},$

(25) $K_{A} = K (x^{A}, x^{A}) y y^{T} + \frac{1 + 2 λ}{1 + 4 λ} K (Ω x^{A}, Ω x^{A}) y y^{T}$

(26) $K_{B} = K (x^{B}, x^{B}) y y^{T} + \frac{1 + 2 λ}{1 + 4 λ} K (Ω x^{B}, Ω x^{B}) y y^{T}$

(27) $K_{A B} = \frac{2 λ}{1 + 4 λ} K (Ω x^{A}, Ω x^{B}) y y^{T}$

(28) $K = [\begin{matrix} K_{A} & K_{A B} \\ K_{A B} & K_{B} \end{matrix}]$

(29) $y = {[y_{1}, y_{2}, \dots, y_{N}]}^{T}$ and $K$ is the kernel function. It is obvious that the optimization of Eq. (23) can be considered as a QP problem, which can be solved according to Deng et al. (2013). The decision function of the proposed model in this study is defined as

(30) $f (x) = \frac{1}{2} (w_{A}^{T} ϕ (x^{A}) + v_{A}^{T} ϕ (Ω x^{A}) + b_{A} + w_{B}^{T} ϕ (x^{B}) + v_{B}^{T} ϕ (Ω x^{B}) + b_{B})$

The algorithm of multi-view learning based on shared hidden feature space can be obtained, as shown in Algorithm 2. From Algorithm 2, we can find that the time complexity is mainly contributed by steps 1, 3 and 4. The time complexity of Algorithm 1 is $O (N r d + r^{2}) .$ The time complexity of step 3 is $O ({(r + d)}^{2})$ . The time complexity of step 4 is $O (N^{2})$ . Therefore, the time complexity of Algorithm 2 is $O (N r d + r^{2} + {(r + d)}^{2} + N^{2}) .$

Algorithm 1:

Shared hidden feature space generation.

Input:

x_{i}^{A}

x_{i}^{B}

, and

y = {[y_{i}]}_{i = 1, 2, \dots, N}

Output:

Ω

Procedures:

1. Initialize

Ω_{0} \in R^{r \times d}

t = 0

i t e r_{m a x}

δ = 1 e - 6

2. Repeat:

t = t + 1

4. Compute

\frac{\partial J}{\partial Ω}

and

η

by Eqs. (13) and (15).

5. Update

Ω (t)

by Eq. (14).

6. Until

Ω (t) - Ω (t - 1) \leq δ

t > i t e r_{m a x}

DOI: 10.7717/peerj-cs.1874/table-7

Algorithm 2:

Multi-view learning based on shared hidden feature space.

Input: training samples of view-1:

{x_{i}^{A}, y_{i}}

, training samples of view-2:

{x_{i}^{B}, y_{i}}

, regularized parameters

C^{A}, C^{B}

and

λ

Output:

w_{A}^{T}

w_{B}^{T}

b_{A}

b_{B}

v_{A}

and

v_{B}

Procedures:

1. Use Algorithm 1 to obtain

Ω

2. Use

Ω

to obtain the shared hidden space

3. Solve the

{\tilde{α}}_{i}

according to Eq. (23)

4. Solve the

w_{A}^{T}

w_{B}^{T}

b_{A}

b_{B}

v_{A}

and

v_{B}

by Eqs. (18)–(22)

5. Construct the decision function based on

w_{A}^{T}

w_{B}^{T}

b_{A}

b_{B}

v_{A}

and

v_{B}

DOI: 10.7717/peerj-cs.1874/table-8

Experimental studies

Settings

To observe the merits of the proposed model, k-nearest neighbor (KNN) (Liu & Liu, 2016), support vector machine (SVM) (Liu & Liu, 2016), SVM2K (Farquhar et al., 2005), multi-view L2-SVM (MV-L2-SVM) (Huang, Chung & Wang, 2016), and alternative multi-view MED (AMVMED) (Chao & Sun, 2015) are introduced for comparison studies. Accuracy is used as the evaluation indicator in this study. SVM, SVM2K, MV-L2-SVM, and 2V-SVM-SH are all trained using a Gaussian kernel for experimentation. For all methods, ten-fold cross-validation (CV) is used to determine the optimal parameters. Table 3 provides the specific parameters and ranges used for each method. All experiments are conducted on a PC with a 16-core CPU with a clock speed of 3.40 GHz and 32 GB of memory. The programming environment was Matlab R2016a.

Table 3:

Parameter settings.

Method	Parameter settings
KNN	k ∈{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
SVM	C ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, σ ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}
SVM-2K	C^A ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, C^B ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, D ∈{2e−5, 2e−4, …, 2e0, 2e1, …, 2e4, 2e5},σ ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}
MV-L2-SVM	C^A∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, C^B ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, σ ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}
AMVMED	C^A ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, C^B∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, γ ∈{0.1, 0.2, …, 0.9}
Proposed model	C^A ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, C^B ∈ {2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, σ ∈{2e−8, 2e−7, …, 2e0, 2e1, …, 2e7, 2e8}, λ ∈{0.1, 0.2, …, 0.9, 1};

DOI: 10.7717/peerj-cs.1874/table-3

To construct a two-view learning scenario, based on “Data”, three feature extraction methods, namely wavelet packet decomposition (WPD), short-time Fourier transform (STFT) and kernel principal component analysis (KPCA) are adopted, to extract time-frequency features, frequency-domain features and time-domain features from the original EEG signals, as shown in Fig. 2. Finally, 12 datasets are constructed, as shown in Table 4.

Table 4:

Two-view learning scenarios.

Datasets	Classification tasks	Views (view-A, view-B)	#Sample size
DS1	AB vs CDE	WPD, STFT	500
DS2	AB vs CDE	WPD, KPCA	500
DS3	AB vs CDE	STFT, KPCA	500
DS4	AB vs CD	WPD, STFT	400
DS5	AB vs CD	WPD, KPCA	400
DS6	AB vs CD	STFT, KPCA	400
DS7	AB vs DE	WPD, STFT	400
DS8	AB vs DE	WPD, KPCA	400
DS9	AB vs DE	STFT, KPCA	400
DS10	AB vs CE	WPD, STFT	400
DS11	AB vs DE	WPD, KPCA	400
DS12	AB vs CE	STFT, KPCA	400

DOI: 10.7717/peerj-cs.1874/table-4

Experimental results and analysis

The experimental results are reported in Table 5. We can see from Table 5 that the proposed model wins the best performance on most datasets. Only on DS5, DS9, the proposed model performs worse than SVM-2K and MV-L2-SVM. The advantages of the proposed model indicate the promising ability of the shared hidden space. From the promising results, it can be found that by constructing the expanded space and utilizing the information of both the shared hidden space and the original space for learning, thereby fully utilizing the relevant information of samples within and across views, the proposed model effectively solves the problem that the difference between samples of the same class from different views is greater than the difference between samples of different classes from the same view. The experimental results also indicate the power of KDE which is used to construct the shared hidden space.

Table 5:

Classification performance in terms of accuracy on all multi-view learning scenarios.

Datasets	KNN_A (KNN on view-A)	KNN_B (KNN on view-B)	SVM_A (SVM on view-A)	SVM_B (SVM on view-B)	SVM-2K	MV-L2-SVM	AMVMED	Proposed model
DS1	0.9098 (0.0019)	0.9176 (0.0045)	0.9432 (0.0076)	0.9521 (0.0087)	0.9754 (0.0063)	0.9543 (0.0065)	0.9643 (0.0043)	0.9876 (0.0023)
DS2	0.9213 (0.0032)	0.9098 (0.0021)	0.9583 (0.0065)	0.9321 (0.0087)	0.9654 (0.0063)	0.9431 (0.0065)	0.9546 (0.0043)	0.9768 (0.0023)
DS3	0.9223 (0.0034)	0.9098 (0.0021)	0.9345 (0.0022)	0.9321 (0.0087)	0.9654 (0.0023)	0.9437 (0.0013)	0.9554 (0.0063)	0.9764 (0.0034)
DS4	0.9214 (0.0034)	0.9097 (0.0011)	0.9067 (0.0073)	0.9164 (0.0027)	0.9567 (0.0032)	0.9511 (0.0023)	0.9598 (0.0044)	0.9690 (0.0036)
DS5	0.9214 (0.0034)	0.9481 (0.0023)	0.9875 (0.0046)	0.9467 (0.0056)	0.9892 (0.0017)	0.9564 (0.0054)	0.9578 (0.0023)	0.9743 (0.0045)
DS6	0.9324 (0.0052)	0.9481 (0.0023)	0.9875 (0.0046)	0.9467 (0.0056)	0.9653 (0.0018)	0.9511 (0.0034)	0.9587 (0.0033)	0.9811 (0.0056)
DS7	0.9331 (0.0026)	0.9325 (0.0026)	0.9481 (0.0017)	0.9435 (0.0037)	0.9563 (0.0032)	0.9673 (0.0026)	0.9543 (0.0046)	0.9781 (0.0015)
DS8	0.9331 (0.0026)	0.9221 (0.0025)	0.9481 (0.0017)	0.9387 (0.0026)	0.9612 (0.0018)	0.9671 (0.0056)	0.9409 (0.0055)	0.9812 (0.0035)
DS9	0.9631 (0.0015)	0.9221 (0.0025)	0.9511 (0.0090)	0.9387 (0.0026)	0.9654 (0.0143)	0.9786 (0.0087)	0.9765 (0.0049)	0.9760 (0.0054)
DS10	0.9318 (0.0079)	0.9543 (0.0056)	0.9345 (0.0054)	0.9245 (0.0064)	0.9534 (0.0048)	0.9501 (0.0047)	0.9534 (0.0019)	0.9756 (0.0087)
DS11	0.9134 (0.0078)	0.9215 (0.0056)	0.9381 (0.0054)	0.9275 (0.0034)	0.9452 (0.0036)	0.9517 (0.0045)	0.9732 (0.0017)	0.9789 (0.0087)
DS12	0.9532 (0.0035)	0.9378 (0.0043)	0.9785 (0.0038)	0.9634 (0.0014)	0.9763 (0.0013)	0.9587 (0.0054)	0.9661 (0.0064)	0.9898 (0.0034)
Average	0.9311	0.9333	0.9472	0.9434	0.9646	0.9561	0.9596	0.9787

DOI: 10.7717/peerj-cs.1874/table-5

Note:

Bold entries indicate the best performance achieved by the corresponding method.

Statistical analysis

We use the Friedman test (Zimmerman & Zumbo, 1993; Sakamoto et al., 2015) to conduct a statistical analysis of the experimental results on all methods across all datasets. The Friedman test is a non-parametric testing method that can be used to analyze whether there are significant differences in performance among multiple methods on multiple datasets. The principle is to first obtain the average ranking of each method’s performance on all datasets, and then compare whether these rankings are the same. If they are the same, it indicates that all methods have the same performance, otherwise it suggests that there are significant differences in performance among all methods. If there are significant differences among all methods, we further use a Holm post-hoc hypothesis test to specifically analyze which methods and our proposed algorithm have significant differences. From Fig. 5, we see that 2V-SVM-SH wins the best ranking result. The p-values embedded in Fig. 5 computed by Friedman test hint that there are significant differences among different models. From Table 6, it can be seen that all hypothesis is rejected except the proposed model vs AMVMED and the proposed model vs SVM-2K. These results indicate that the proposed model performs significantly better than KNN-A, KNN-B, SVM-B, SVM-A and MV-L2-SVM. Although the hypothesis of the proposed model vs AMVMED and the proposed model vs SVM-2K is not reject, the low p-value of the proposed model vs AMVMED and the proposed model vs SVM-2K also indicates the reveal the competition of the proposed model.

Figure 5: Friedman rankings of all models.

Download full-size image

DOI: 10.7717/peerj-cs.1874/fig-5

Table 6:

Holm test results with α = 0.05.

$i$	Algorithm	$z = (R_{0} - R_{i}) / S E$	$p$	$H o l m = α / i$	Hypothesis
7	KNN-A	5.583333	0	0.007143	Rejected
6	KNN-B	5.25	0	0.008333	Rejected
5	SVM-B	4.166667	0.000031	0.01	Rejected
4	SVM-A	3.666667	0.000246	0.0125	Rejected
3	MV-L2-SVM	2.5	0.012419	0.016667	Rejected
2	AMVMED	2.125	0.033587	0.025	Not rejected
1	SVM-2K	1.375	0.169131	0.05	Not rejected

DOI: 10.7717/peerj-cs.1874/table-6

Conclusions

In this study, a multi-view support vector machine based on a shared hidden space is constructed using kernel density estimation. The method is designed to address the problem of decreased recognition performance due to the difference in sample characteristics between different view models in multi-view learning. The method involves incorporating SVM into the shared hidden space, resulting in an effective solution to the problem of solving the classic QP problem. Experimental results on EEG-based epilepsy diagnosis demonstrate that our proposed method is better able to extract complementary information between different view models than other methods.

In practical applications, annotating training samples is often a time-consuming task. Therefore, in subsequent research, we intend to extend the multi-view algorithm proposed in this article to transfer learning scenarios, aiming to reduce the reliance on labeled samples.

Supplemental Information

Source code.

DOI: 10.7717/peerj-cs.1874/supp-1

Download

EEG datasets of five group.

DOI: 10.7717/peerj-cs.1874/supp-2

Download

[1] Andrzejak RG, Lehnertz K, Mormann F, Rieke C, David P, Elger CE. 2001. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Physical Review E 64(6):061907

[2] Chao G, Sun S. 2015. Alternative multiview maximum entropy discrimination. IEEE Transactions on Neural Networks and Learning Systems 27(7):1445-1456

[3] Deng Z, Jiang Y, Choi KS, Chung FL, Wang S. 2013. Knowledge-leverage-based TSK fuzzy system modeling. IEEE Transactions on Neural Networks and Learning Systems 24(8):1200-1212

[4] Farquhar J, Hardoon D, Meng H, Shawe-Taylor J, Szedmak S. 2005. Two view learning: SVM-2K, theory and practice. Advances in Neural Information Processing Systems 18:355-362

[5] Hansen P, Jaumard B, Xiong J. 1994. Cord-slope form of Taylor’s expansion in univariate global optimization. Journal of Optimization Theory and Applications 80(3):441-464

[6] Huang C, Chung FL, Wang S. 2016. Multi-view L2-SVM and its multi-view core vector machine. Neural Networks 75(1):110-125

[7] Jaiswal AK, Banka H. 2017. Local pattern transformation based feature extraction techniques for classification of epileptic EEG signals. Biomedical Signal Processing and Control 34(9):81-92

[8] Jiang Y, Zhang Y, Lin C, Wu D, Lin CT. 2020. EEG-based driver drowsiness estimation using an online multi-view and transfer TSK fuzzy system. IEEE Transactions on Intelligent Transportation Systems 22(3):1752-1764

[9] Li M, Chen W, Zhang T. 2016. Automatic epilepsy detection using wavelet-based nonlinear analysis and optimized SVM. Biocybernetics and Biomedical Engineering 36(4):708-718

[10] Li M, Liu Y, Zhi S, Wang T, Chu F. 2022a. Short-time Fourier transform using odd symmetric window function. Journal of Dynamics, Monitoring and Diagnostics 1(1):37-45

[11] Li P, Zhang W, Lu C, Zhang R, Li X. 2022b. Robust kernel principal component analysis with optimal mean. Neural Networks 152(6):347-352

[12] Liu Y, Li Y. 2019. A multi-view unified feature learning network for EEG epileptic seizure detection.

[13] Liu Q, Liu C. 2016. A novel locally linear KNN method with applications to visual recognition. IEEE Transactions on Neural Networks and Learning Systems 28(9):2010-2021

[14] Reddy GRS, Rao R. 2017. Automated identification system for seizure EEG signals using tunable-Q wavelet transform. Engineering Science and Technology, an International Journal 20(5):1486-1493

[15] Sakamoto S, Lala A, Oda T, Kolici V, Barolli L, Xhafa F. 2015. Analysis of WMN-HC simulation system data using Friedman test.

[16] Tian X, Deng Z, Ying W, Choi K-S, Wu D, Qin B, Wang J, Shen H, Wang S. 2019. Deep multi-view feature learning for EEG-based epileptic seizure detection. IEEE Transactions on Neural Systems and Rehabilitation Engineering 27(10):1962-1972

[17] Wang S, Wang J, Chung FL. 2013. Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Transactions on Cybernetics 44(1):1-20

[18] Yan X, Hu S, Mao Y, Ye Y, Yu H. 2021. Deep multi-view learning methods: a review. Neurocomputing 448(1):106-129