Dynamic guided metric representation learning for multi-view clustering

Tingyi Zheng; Yilin Zhang; Yuhang Wang

doi:10.7717/peerj-cs.922

Dynamic guided metric representation learning for multi-view clustering

Tingyi Zheng ^1,2, Yilin Zhang³, Yuhang Wang⁴

1College of Information and Computer, Taiyuan University of Technology, Taiyuan, Shanxi, China

2Department of Electrical and control Engineering, Shanxi Institute of Energy, Jinzhong, Shanxi, China

3Software College, Taiyuan University of Technology, Taiyuan, Shanxi, China

4College of Data Science, Taiyuan University of Technology, Taiyuan, Shanxi, China

DOI: 10.7717/peerj-cs.922

Published: 2022-03-08
Accepted: 2022-02-16
Received: 2021-12-23

Academic Editor: Pengcheng Liu

Subject Areas: Data Mining and Machine Learning, Data Science
Keywords: Multi-view clustering, Dynamic routing, Guided metric representation learning, Fisher discriminant analysis, Hilbert-Schmidt independence criteria, Generalized canonical correlation analysis

Copyright: © 2022 Zheng et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Zheng T, Zhang Y, Wang Y. 2022. Dynamic guided metric representation learning for multi-view clustering. PeerJ Computer Science 8:e922 https://doi.org/10.7717/peerj-cs.922

The authors have chosen to make the review history of this article public.

Abstract

Multi-view clustering (MVC) is a mainstream task that aims to divide objects into meaningful groups from different perspectives. The quality of data representation is the key issue in MVC. A comprehensive meaningful data representation should be with the discriminant characteristics in a single view and the correlation of multiple views. Considering this, a novel framework called Dynamic Guided Metric Representation Learning for Multi-View Clustering (DGMRL-MVC) is proposed in this paper, which can cluster multi-view data in a learned latent discriminated embedding space. Specifically, in the framework, the data representation can be enhanced by multi-steps. Firstly, the class separability is enforced with Fisher Discriminant Analysis (FDA) within each single view, while the consistence among different views is enhanced based on Hilbert-Schmidt independence criteria (HSIC). Then, the 1st enhanced representation is obtained. In the second step, a dynamic routing mechanism is introduced, in which the location or direction information is added to fulfil the expression. After that, a generalized canonical correlation analysis (GCCA) model is used to get the final ultimate common discriminated representation. The learned fusion representation can substantially improve multi-view clustering performance. Experiments validated the effectiveness of the proposed method for clustering tasks.

Introduction

Multi-view data are extremely common in many applications, each individual view and the correlation of multiple views have their specific property for a particular knowledge discovery task. Multi-view data often contain diversity as well as consistent information that should be exploited and fused. Therefore, considering the uniqueness, complementary, and correlation of each view, it is essential to study how to fuse multi-view data efficiently. Recently, increasing research efforts have been made in multi-view learning, where multi-view clustering (MVC) forms a mainstream task that aims to divide subjects into meaningful groups from different perspectives by learning the multi-view information (e.g., Yang & Wang, 2018). However, a better clustering representation can be learned by simultaneously analyzing the discriminant characteristics of a single view and the correlations among multiple views. Therefore, a priority for most MVC methods is to find a feasible and direct way to explore the underlying data cluster latent fusion representation by multiple views to obtain the final ideal clustering results.

Many research work about MVC have been investigated in the last decades, in which most of them proposed to effectively consider rich information from multiple views (e.g., Wang, Yang & Liu, 2020). For instance, co-training (e.g., Xia, Yang & Yu, 2020) and co-regularized (e.g., Kang, Shi & Huang, 2020) spectral clustering have been proposed to minimize the disagreement between each pair of views (e.g., Chen, Huang & Wang, 2020). However, clustering performance is easily affected by the poor quality of original views in this kind of method. Multi-view subspace clustering uses the unified shared feature representation of each view to obtain consistent clustering results from multiple views (e.g., Huang, Kang & Xu, 2020; Zhao, Ding & Fu, 2017a; Zhao, Ding & Fu, 2017b). Typical models also include subspace learning-based and non-negative matrix factorization (NMF)-based models. In the past decade, numerous machine learning technologies have been investigated to determine the scope of combining multiple views. However, these methods lack the ability to mine the latent unified representation and to learn the non-linear correlation of views. To address the second limitation, many researchers proposed multi-kernel and CCA-based clustering methods. The multi-kernel method uses predefined kernels for each view and combines these kernels in linear or non-linear mode. Nevertheless, the complex relationships make it difficult to represent data fusion. To solve this problem, canonical correlation analysis (CCA) (e.g., Haldorai & Ramu, 2020; Hotelling, 1936) and kernel CCA are commonly used in multi-view clustering. Rasiwasia (e.g.,Blaschko & Lampert ,2008) proposed mean CCA and cluster CCA. Blaschko (e.g., Blaschko & Lampert, 2008) projected the data across different views by KCCA and used k-means to cluster projected samples. We can find that the computational cost of the CCA and KCCA models is high (e.g., Wang, Li & Huang, 2017). Besides, existing CCA-based clustering methods are only focus on mining the linear correlations. With the wide application of deep learning, the deep network has been applied in CCA model. Such as, Deep CCA (DCCA) (e.g., Andrew, Arora & Bilmes, 2013), Deep Generalized CCA (DGCCA) (e.g., Benton, Khayrallah & Gujral, 2017). However, DCCA model can only learn two views. Although DGCCA overcame the limitation of DCCA, it ignores mining the specific characteristics of multiple views. Subsequently, the newest multi-view representation learning model (MRL) (e.g., Zheng, Ge & Li, 2020) based on DGCCA is proposed. This model is focus on mining the specific characteristics of inner-view and then learning the fusion representation based on the maximum correlation of multiple views. Recently, multi-view representation based on clustering has been viewed as the problem of learning a meaningful representation of data. Therefore, how to design model for multi-view clustering is an intriguing direction.

In summary, an effective mechanism is to learn a comprehensive meaningful representation with the discriminant characteristics in a single view and the correlation of multiple views. DCCA-based multi-view clustering methods are rarely used, but they have room for improvement with deep networks to mine nonlinear correlation and high-level fusion representations. With the aim of addressing the limitation of DCCA-based MVC methods, this paper proposes a unique novel framework called Dynamic Guided Metric Representation Learning for Multi-View Clustering (DGMRL-MVC) that has not been investigated already in previous works on this topic.

The main contributions of this work can be summarized as follows:

• A multi-step enhanced representation is proposed for multi-view clustering, which consists of inter-intra learning, deep learning and latent space mapping. The proposed model can jointly learn a latent discriminated embedding.

• On the basis of learning intra-view class separability and inter-view consistency by Fisher Discriminant Analysis -with Hilbert–Schmidt Independence Criteria (FDA-HSIC) metric learning, a dynamic guided deep learning method is used that introduces location or direction information to enhance the single view representation.

• An ultimate common representation is obtained by generalized canonical correlation analysis (GCCA) model for multiple views.

• Experiments on four real-world multi-view datasets have validated the effectiveness of the proposed method for clustering tasks.

The rest of this paper is organized as follows: ‘Related work’ presents a review of related work. The next section introduces the proposed DGMRL-MVC model. ‘Experiments’ presents the datasets, experimental settings, and experimental results. Finally, ‘Conclusions’ draws conclusions.

Related work

Related work on multi-view clustering methods can be divided into two categories: the common matrix framework (spectral clustering, subspace clustering, and non-negative matrix factorization clustering) and view fusion methods (multi-kernel clustering and DCCA-based methods). This section will review multi-view clustering methods from these two technological categories.

Common matrix framework

Methods of this type have the commonality that they share the similar structure to combine multiple views.

Multi-view spectral clustering

Multi-view spectral clustering assumes that all the views share the same or similar eigenvector matrices, in which co-training spectral clustering and co-regularized spectral clustering are the two representative methods.

(1) Co-training spectral clustering. These algorithms are investigated under the assumption of consensus among multiple views and trained alternately to maximize the consistency of the two distinct views. Three main assumptions are made: (1) each view is sufficient for the learning task; (2) the views are conditionally independent given the class labels; and (3) the objective functions export the same predictions for co-occurring features with high probability in both views. Overall, most co-training methods are semi-supervised learning.

(2) Co-regularized spectral clustering. Zhu (e.g., Zhu, Zhang & He, 2019) proposed co-regularized approach. The main idea of this kind of method is to minimize the distinction between the predictor functions of two views acting as one part of an objective function. They used graphic Laplacian eigenvectors to play a role much like that of predictor functions in a semi-supervised learning scenario. Inspired by the previous work (e.g., Zhu, Zhang & He, 2019; Chen, Huang & Wang, 2020), automatically learned the weights of different views from data (e.g.,Ye, Liu & Yin, 2016) .

Multi-view subspace clustering

In practice, multi-view data can be sampled from multiple subspaces. A subspace learning model can learn a new and unified representation or a latent space for multi-view data. The unified representation or latent space can then be directly used for the clustering task. In addition, a subspace clustering model can deal with high-dimensional data. The approach is to find the underlying subspaces and then to cluster. Wang (e.g., Wang, Lin & Wu, 2015) proposed a model to measure correlation consensus in multiple views. Unlike Wang, Zhao (e.g., Zhao, Ding & Fu, 2017a; Zhao, Ding & Fu, 2017b) used a deep semi-nonnegative matrix factorization to perform clustering. This kind of method focus on mining the inherent structure from multiple subspace of views and the clustering performance is heavily dependent on the affinity matrix. Therefore, some works used deep networks to learn the inter-view specific features based on classical subspace clustering methods. The newest multi-view representation learning model (MRL) (e.g., Zheng, Ge & Li, 2020) is focus on mining the specific characteristics of inner-view and then learning the fusion representation based on the maximum correlation of multiple views. In all, how to mining the hidden difference of views is a research topic in recent years.

Multi-view non-negative matrix factorization clustering

Given a non-negative matrix X ∈ ℝ^d×n, non-negative matrix factorization (NMF) seeks two non-negative matrices W ∈ ℝ^d×p and V ∈ ℝ^p×n, whose product the original matrix X: $X \approx W V^{T},$

where W is the basis matrix and V the indicator matrix.

Due to its non-negativity constraints, NMF has emerged as a latent feature learning method. To combine multi-view information in an NMF framework, many variants of NMF have also been proposed. Kalayeh (e.g., Kalayeh, Idrees & Shah, 2014) present a weighted extension of Multi-view NMF to address the aforementioned drawbacks. Liu (e.g., Liu, Wang & Lu, 2020) used the cross entropy loss function to constrain the objective function better.

View fusion methods

View fusion methods are also commonly used to learn multi-view information. However, their learning targets directly combine the views for clustering by different modes. Hence, they focus on how to combine multiple views.

Multi-view multi-kernel clustering

Multi-kernel learning was originally used to combine views by means of a kernel and was widely used to deal with multi-view data. The common approach is to define a kernel for each view and then combine these kernels in a convex combination (e.g., Sellami & Alaya, 2021; Lu, Liu & Wei, 2020; Nithya, Appathurai & Venkatadri, 2020). Therefore, one main problem of this method is to choose kernel functions and optimal. A k-means analysis performed on kernel space that simultaneously found the best cluster labels, the cluster kernels, and the optimal combination of multiple kernels (e.g., Cui, Fern & Dy, 2007). Liu (e.g., Liu, Ji & Glänzel, 2013) extended k-means clustering into Hilbert space. Liu try to denote the data matrices as kernel matrices and then combined for fusion. In addition, consider the differences among views, methods with weighted combinations of kernels have also been studied.

However, multi-view data are incomplete. To address this issue, researchers integrated kernel imputation and clustering into a unified learning procedure (e.g., Monney, Zhan & Zhen, 2020). Besides, other methods use a direct combination of features to perform multi-view clustering, like those in (e.g., Xu, Han & Nie, 2016; Nie, Li & Li, 2017).

Multi-view CCA-based methods clustering

For multi-view data clustering, it is reasonable to combine the data directly. However, it is hard to fuse different data types, and high dimensionality and noise are difficult to handle. The CCA method is used to combine the views directly and CCA-based techniques for multi-view clustering fusion. Chaudhuri (e.g., Chaudhuri, Kakade & Livescu, 2009) projected the data into a lower-dimensional space using CCA and cluster. Then, he CCA-based methods are proposed, such as KCCA (e.g., Wang, Li & Huang, 2017), DCCA(e.g., Andrew, Arora & Bilmes, 2013), DGCCA (e.g., Benton, Khayrallah & Gujral, 2017) and MRL(e.g., Zheng, Ge & Li, 2020). Overall, although DCCA-based methods are used now in multi-view representation, this is a direction worthy of further multi-view clustering study using DCCA-based methods. In contrast to these methods, the representation will directly influence effectiveness on the clustering task. Thus, based on existing multi-view clustering learning methods, the intention is to design a novel model to learn a comprehensive meaningful data representation with the discriminant characteristics in a single view and correlation of all views.

In short, existing methods still has the following challenges: (1) Common matrix methods focus on learning a shared similar structure, they ignore the hidden difference of views. (2) View fusion methods considering the unique features of each view and the correlation of all views. However, existing methods strategy is two steps. The first step is self-representation of each view independently and the next step is focus on fusion views. The two-step collaboration is not well meet the clustering task.

The proposed model

Motivation

Existing CCA-based multi-view clustering approaches that can deal with multi-view data learn the correlation of two views and perform a clustering task at the same time. Despite appealing performance, they still have some limitations. First, the quality of data representation is the key issue in MVC. And in real-world applications, directly fusing multi-view data is difficult and the effect is not good. Second, the discriminant characteristics in a single view and the correlation of multiple views should all be considered. Considering this, a novel framework called Dynamic Guided Metric Representation Learning for Multi-View Clustering (DGMRL-MVC) is proposed in this paper, which can cluster multi-view data in a learned latent discriminated embedding space. Specifically, in the framework, the data representation can be enhanced by multi-steps. The model combines the effectiveness of FDA-HSIC and the dynamic routing learning of views embedded in the model. Hence, it can mine the class separability of single view, the consistence among different views, discriminant characteristics in a single view and a latent discriminated embedding for clustering task.

Problem formulation

Given multi-view data observations, let X_j ∈ ℝ^d_j×N denote the jth view, where d_j is the feature dimensionality of the jth view and N is the number of samples.

Definition 1. (Mahalanobis distance). The Mahalanobis distance between two views x₁ and x₂ is defined as: (1) $d_{M}^{2} (x_{1}, x_{2}) = {∥x_{1} - x_{2}∥}_{M}^{2} = {(x_{1} - x_{2})}^{T} M (x_{1} - x_{2}),$ where, the Mahalanobis matrix M is constrained to be symmetric positive-definite to ensure its validity.

Framework

Network architecture

Figure 1 illustrates the architecture of the proposed DGMRL-MVC model for multi-view clustering. This architecture effectively improves fusion representation and overcomes the limitations of traditional DCCA-based multi-view clustering learning. The model consists of three modules: inter-intra representation based on Fisher discriminant analysis and the Hilbert–Schmidt independence criteria, together called FDA-FISH metric learning; deep representation based on dynamic guided deep learning; shared representation based on deep generalized canonical correlation analysis. The first module learns intra-view separability and inter-view consistency. Then, as for dynamic guided deep learning, it is used to introduces location or direction information to enhance the single view representation. In the last sub-model, the fusion representation G of multiple views for the clustering task is created.

Figure 1: Illustration of the proposed DGMRL-MVC model.

Download full-size image

DOI: 10.7717/peerjcs.922/fig-1

1) Inter-intra representation based on FDA-HSIC

The Fisher-HSIC Multi-View Metric Learning (FISH-MML) system is proposed in the work (e.g., Zhang, Liu & Liu, 2018). This method is based on FDA and HSIC, which are simple, yet rather effective. Inspired by Zhang’s work, an effort was made to add FDA-HSIC into the proposed framework for multi-view clustering representation learning.

Intra-view separability. The starting point is the definitions of between-class and total scatter matrices in the v^th view: (2) $S_{b}^{(v)} = \frac{1}{n} \sum_{j = 1}^{m} n_{j} (μ_{j}^{(v)} - μ^{(v)}) {(μ_{j}^{(v)} - μ^{(v)})}^{T},$ (3) $S_{t}^{(v)} = \frac{1}{n} \sum_{i = 1}^{n} (z_{i}^{(v)} - μ^{(v)}) {(z_{i}^{(v)} - μ^{(v)})}^{T},$ (4) $μ_{j}^{(v)} = \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} z_{i}^{j (v)}, μ^{(v)} = \frac{1}{n} \sum_{i = 1}^{n_{j}} z_{i}^{(v)},$ where b is the abbreviation of between, t is the abbreviation of total, m is the number of classes, n_j is the number of samples belonging to class C_j, $μ_{j}^{(v)}$ are the sample means of the v^th view for class C_j, and $μ^{(v)}$ are the sample means of the v^th view for all the views. $z_{{i^{j}}^{(v)}}$ is the projected feature vector corresponding to $x_{{i^{j}}^{(v)}}$ andis defined as follows: (5) $z_{{i^{j}}^{(v)}} = P^{(v)} x_{i}^{j (v)},$

where the symmetric positive semi-definite matrix M = P^TP andP ∈ ℝ^k×d(k ≥ rank(M) is the new space.

The trace operator for $S_{b}^{(v)}$ and $S_{t}^{(v)}$ conditioned on $P^{(v)}$ is then optimized by the following optimization function: (6) $max_{{\{P^{(v)}\}}_{v = 1}^{V}} \sum_{v = 1}^{V} T r (S_{b}^{(v)}; P^{(v)}) - γ T r (S_{t}^{(v)}; P^{(v)}),$

where γ is a tunable parameter to balance the two terms involved. Therefore, the above optimization function is focused on seeking metrics that jointly maximize separability.

Inter-view consistency. The next step is to explore the complementarity information from multi-views using HSIC (e.g., Gretton, Bousquet & Smola, 2005). Based on HSIC, the function can be defined as: (7) $HSCI (Z^{(v)}, Z^{(w)}) = {(n - 1)}^{- 2} tr (K_{v} H K_{w} H),$ (8) $K^{(v)} = {Z^{(v)}}^{T} Z^{(v)} = {X^{(v)}}^{T} {P^{(v)}}^{T} P^{(v)} X^{(v)},$ (9) $K^{(w)} = {Z^{(w)}}^{T} Z^{(w)} = {X^{(w)}}^{T} {P^{(w)}}^{T} P^{(w)} X^{(w)},$ where $K^{(v)}$ and $K^{(w)}$ are the Gram matrices from the two views parameterized by the projections $P^{(v)}$ and $P^{(w)}$ . In the model, the dependency between $K^{(v)}$ and $K^{(w)}$ is enhanced by maximizing the HSIC function, and H is the Gram matrix that ensures zero mean in the feature space.

Optimization. The FDA-HSIC optimization function can be described as: $max_{{\{P^{(v)}\}}_{v = 1}^{V}} \sum_{v = 1}^{V} T r (S_{b}^{(v)}; P^{(v)}) + λ_{1} T r (S_{t}^{(v)}; P^{(v)}) + λ_{2} \sum_{v \neq w} HSIC (P^{(v)} X^{(v)}, P^{(w)} X^{(w)})$

= ${max}_{{\{P^{(v)}\}}_{v = 1}^{V}} \sum_{v = 1}^{V} T r (P^{(v)} (A + λ_{1} B + λ_{2} C) {P^{(v)}}^{T}) = {max}_{{\{P^{(v)}\}}_{v = 1}^{V}} \sum_{v = 1}^{V} t r (P^{(v)} D {P^{(v)}}^{T})$ (10) $s . t . P^{(v)} {P^{(v)}}^{T} = I, v = 1, 2, \dots, V,$

where λ₁ > 0 and λ₂ > 0.

2) Deep representation based on dynamic guided learning

The output $P^{(v)}$ of first submodel is fed into this submodel. $P^{(v)}$ is the v^th view. In this process, the learning of each view is independent. Inspired by Sabour’s work (e.g., Sabour, Frosst & Hinton, 2017), this submode adopt the dynamic guided mechanism to learn the location or direction information of view. Taking the first view $P^{(1)}$ as example, we first divide $P^{(1)}$ into several capsules u₁, …, u_k1. Each capsule u_k1 by a weight matrix $W_{k_{1} k_{2}}^{*}$ and get a prediction vector ${\hat{o}}_{k_{2} |k_{1}} . {\hat{o}}_{k_{2} |k_{1}}$ can be called deep learning mapping. Next, to represent the probability for the output vector, a nonlinear function squashing is applied to s_k₂.And the new capsule s_k₂ is weighted sum over u_k₁ with the coupling coefficient c_k₁k₂. (11) ${\hat{o}}_{k_{2} |k_{1}} = W_{k_{1} k_{2}}^{*} u_{k_{1}},$ (12) $s_{k_{2}} = \sum_{k 1} c_{k_{1} k_{2}} {\hat{o}}_{k_{2} |k_{1}},$ (13) $c_{k_{1} k_{2}} = softmax (p_{k_{1} k_{2}}) = \frac{exp (p_{k_{1} k_{2}}^{*})}{\sum_{k} exp (p_{k_{1} k}^{*})},$ where the initial logits $p_{k_{1} k_{2}}^{*}$ is the log prior probabilities that capsule k₂ to k₁.

Finally, the output of the first view is P^∗(1), it is denoted as: P^∗(1) = [v₁, v₂, ……, v_k₂], where v_k₂ = [v₁, …, v_j]. The discriminant characteristics of single view is learned: (14) $v_{k_{2}} = squas h (s_{k_{2}}) = \frac{{∥s_{k_{2}}∥}^{2}}{1 + {∥s_{k_{2}}∥}^{2}} \frac{s_{k_{2}}}{∥s_{k_{2}}∥},$ 3) Shared representation based on deep generalized canonical correlation analysis

This sub-model learns the ultimate common fusion representation G from all views. In this process, minimized loss function as follows: $\underset{U_{v} \in R^{d_{v} \times r}, G \in R^{r \times N}}{minimize} \sum_{v = 1}^{V} {∥G - U_{v}^{T} P^{* (v)}∥}_{F}^{2},$ (15) $s u b j e c t t o G G^{T} = I_{r},$ where P^∗(v) ∈ ℝ^d_v×N_v is latent discriminated embedding of the v^th view. Then, all $P_{v}^{*}$ spliced into a matrix G.

Training

In this model, the scaled empirical co-variance matrix C_vv of the v^th network is calculated by C_vv = P^∗(v)(P^∗(v))^T ∈ ℝ^d_v∗d_v. Where $P^{* (v)} = {P^{* (v)}}^{T} C_{v v}^{- 1} P^{* (v)} \in R^{N∗N}$ is the corresponding projection matrix. In optimization process, we consider maximize the sum of correlations between fusion representation G and single view. The reconstruction error as follows: $\sum_{v = 1}^{V} {∥G - U_{v}^{T} P^{* (v)}∥}_{F}^{2} = \sum_{v = 1}^{V} {∥G - G {P^{* (v)}}^{T} C_{v v}^{- 1} P^{* (v)}∥}_{F}^{2}$ (16) $= r J - Tr (GM G^{T}),$ where the positive semi-definite matrix of each view is $M = \sum_{v = 1}^{V} P_{v}$ , r is the top rows of M. To learn the correlation of every pair of views, we stack them in a j∗j matrix, it is defined as J.

The derivative of loss function is: (17) $\frac{\partial L}{\partial P_{v}^{*}} = 2 U_{v} G - 2 U_{v} U_{v}^{T} O_{v}^{*},$

where $L = \sum_{i = 1}^{r} λ_{i} (M)$ .

Experiments

Datasets

We select four datasets come from real-world to verify model. The views of these datasets include network, text, IDs and image.

• Football. This dataset is collected as previously described in Greene & Cunningham (2013). It is the Twitter active data of 248 English Premier Leaguefootball players and clubs. This dataset contains nine views and 20 class labels.

• Wikipedia. This datasetis collected as previously described in Rasiwasia, Mahajan & Mahadevan (2014) & Pereira et al. (2014). It come fromWikipedia’s featured articles. This dataset contains two views and 10 categories.

• Handwritten. This dataset is collected through https://archive.ics.uci.edu/ml/datasets/Multiple+Features. It contains six views and 10 categories.

• Pascal. This dataset is collected as previously described in C.Rashtchian et al. (2010). It contains six views and 20 categories.

Experimental settings

In the experiments, we choose 80% of data randomly to train and other 20% to test. In FDA-HSCI, the parameter tuning was as in Zhang’s work (e.g., Zhang, Liu & Liu, 2018). The maximum training iterations and learning rate η as in Benton, Khayrallah & Gujral (2017) and Chandar, Khapra & Larochelle (2015).

Evaluation metrics

The effect of our model is evaluated by comparing it with baselines in terms of clustering precision, recall, F1, RI, Normalized Mutual Information(NMI) and Silhouette_score. These evaluation measures are defined as follows: (18) $p r e c i s i o n = \frac{TP}{TP + FP},$ (19) $r e c a l l = \frac{T P}{T P + F N},$ (20) $F 1 = \frac{2 P R}{P + R},$ (21) $R I = \frac{T P + T N}{T P + F P + F N + T N},$ where TP and FP are positive and negative samples predicted as positive classes, TN and FN are positive and negative samples predicted as negative classes. (22) $N M I (X; Y) = \frac{2 I (X; Y)}{H (X) + H (Y)},$ where $I (X; Y) = \sum_{x} \sum_{y} p (x, y) l o g \frac{p (x, y)}{p (x) p (y)}, H (X) = - \sum_{i} p (x_{i}) l o g p (x_{i})$ .

The Silhouette_score for a sample is: (23) $Silhouette_score = \frac{b - a}{max (a, b)},$ where b is the distance between a sample and the nearest cluster of which the sample is not a member. Values close to 1 indicate that a sample has been assigned to a better cluster because a different cluster is more similar.

Baselines

The following methods were compared in the experiments described in this section:

• DCCA: The method realizes the deep learning applied in CCA. However, DCCA can only learn nonlinear mapping of two views (e.g., Andrew, Arora & Bilmes, 2013).

• DGCCA: Although this model overcame the view number limitation of DCCA, it ignores mining the specific characteristics of multiple views (e.g., Benton, Khayrallah & Gujral, 2017).

• FISH-MML: A Fish-HSIC multi-view metric learning method, which is simple, yet rather effective. The model can learn intra-view separability and inter-view correlation (e.g., Zheng, Ge & Li, 2020).

• Dynamic guided representation-MVC: To better verify the validity of the model with FDA-FISH, only the dynamic guided representation was added to the proposed model (e.g., Zheng, Ge & Li, 2020).

Table 1 shows the differences among these methods. In summary, the effectiveness of the proposed DGMRL-MVC model was directly evaluated by comparing it with newest DCCA-based multi-view learning methods. The significance of the dynamic guided representation mechanism in the proposed model was further verified by comparison with FISH-MML. To provide a further comprehensive analysis of the stability and performance of the proposed multi-view representation learning model, the dynamic guided representation-MVC was also chosen as a baseline.

Results

The experimental results answered the following questions:

(RQ1) How good is the performance improvement of the proposed model with dynamic guided representation? (DGMRL-MVC vs. FISH-MML)

FISH-MML provided the best multi-view metric learning with FDA-FISH. The proposed DGMRL-MVC model was first compared with FISH-MML. The results showed that DGMRL-MVC significantly outperformed FISH-MML on all four datasets for the multi-view clustering task. Table 2 presents the comprehensive comparison results. It is obvious that the proposed model achieved the best performance on nearly all datasets under all metrics. Take the Handwritten dataset with six views as an example, the improvement of DGMRL-MVC over FISH-MML was about 42.2%, 60.4%, 46%, 60.3% and 19.5% in terms of precision, recall, F1, RI, and NMI respectively. This proved that the dynamic guided representation used in the proposed method is effective for multi-view clustering. Our model has been enhanced in two aspects, i.e., inter-intra relations and dynamic routing learning embedded into each view self-representation, and fusion representation with maximum correlation.

Table 1:

The differences among baselines.

Method	Type	Dynamic guided representation	FDA-HSIC	Optimization
DCCA	Deep, 2-view	×	×	$(θ_{1}^{}, θ_{2}^{}) = a r g {max}_{(θ_{1}, θ_{2})} (f_{1} (X_{1}; θ_{1}), f_{2} (X_{2}; θ_{2}))$
DGCCA	Deep, multi-view	×	×	$\underset{U_{j} \in R^{d_{j} \times r}, G \in R^{r \times N}}{Minimize} \sum_{j = 1}^{J} {∥G - U_{j}^{T} f_{j} (X_{j})∥}_{F}^{2}$
FISH-MML	Deep, multi-view	×	✓	$m a x_{{\{M^{(v)}\}}_{v = 1}^{V}} S ({\{M^{(v)}\}}_{v = 1}^{V}) + λ C ({\{M^{(v)}\}}_{v = 1}^{V})$
Dynamic guided representation-MVC	Deep, multi-view	✓	×	$\underset{U_{j} \in R^{d_{j} \times r}, G \in R^{r \times N}}{Minimize} \sum_{j = 1}^{J} {∥G - U_{j}^{T} O_{j}^{*}∥}_{F}^{2}$
DGMRL-MVC	Deep, multi-view	✓	✓	$\underset{U_{v} \in R^{d_{v} \times r}, G \in R^{r \times N}}{minimize} \sum_{v = 1}^{V} {∥G - U_{v}^{T} P^{*} (v)∥}_{F}^{2}$

DOI: 10.7717/peerjcs.922/table-1

Table 2:

Improvement performance of dynamic guided representation on different views.

The best results are highlighted in bold.

Dataset	Method	Number of views	Precision	Recall	F1	RI	NMI
Handwritten	FISH-MML	6 (full views)	0.09961	0.11099	0.10493	0.11100	0.09643
	DGMRL-MVC	6 (full views)	0.14166	0.17800	0.15315	0.17800	0.11521
	FISH-MML	2 views	0.17843	0.20600	0.18586	0.20600	0.21906
	DGMRL-MVC	2 views	0.25699	0.22100	0.22521	0.22100	0.32698
	FISH-MML	3 views	0.09092	0.10100	0.09569	0.10100	0.25037
	DGMRL-MVC	3 views	0.17813	0.14300	0.13452	0.14300	0.37442
Wikipedia	FISH-MML	2 (full views)	0.08957	0.10149	0.09313	0.10032	0.51780
Wikipedia	DGMRL-MVC	2 (full views)	0.10491	0.12321	0.10398	0.13438	0.53230
Football	FISH-MML	9 (full views)	0.05679	0.05885	0.01761	0.06048	0.22956
	DGMRL-MVC	9 (full views)	0.05900	0.06248	0.05575	0.06452	0.28367
	FISH-MML	6 views	0.11914	0.04903	0.03745	0.05645	0.28422
	DGMRL-MVC	6 views	0.05619	0.06506	0.05848	0.06452	0.27343
Pascal	FISH-MML	6 (full views)	0.07616	0.06700	0.06606	0.06700	0.23351
	DGMRL-MVC	6 (full views)	0.08315	0.07500	0.07342	0.07500	0.48847
	FISH-MML	2 views	0.06266	0.05600	0.05094	0.05600	0.18011
	DGMRL-MVC	2 views	0.08313	0.06800	0.06448	0.06800	0.45506

DOI: 10.7717/peerjcs.922/table-2

(RQ2) How good is the performance of DGMRL-MVC compared with other DCCA-based methods?

The latent DCCA-based representation was compared with the proposed algorithm on a clustering task. DCCA and DGCCA were chosen as baselines. In the dynamic guided representation-MVC model, FDS-FISH was not present, and only dynamic guided representation and generalized canonical correlation analysis were added. As shown in Table 3, the results showed that DGMRL-MVC significantly outperformed DGCCA and DCCA on all datasets. In addition, the performance using the proposed discriminated representation was generally better than that using dynamic guided representation-MVC. In all, our model achieved comprehensive results because our model adds FDA-FISH and dynamic guided module that distills more separable intra-view and specific intrinsic features, resulting in a high-quality latent discriminated embedding.

Table 3:

Comparison with latest methods with multiple views.

The best results are highlighted in bold.

Dataset	Method	Number of views	Precision	Recall	F1	RI	NMI
Handwritten	DCCA	6 (full views)	0.10111	0.10000	0.10047	0.10000	0.57354
	DGCCA		0.08222	0.05400	0.06244	0.10800	0.54150
	FISH-MML		0.09961	0.11099	0.10493	0.11100	0.09643
	Dynamic guided representation-MVC		0.09835	0.11650	0.10402	0.11650	0.52245
	DGMRL-MVC		0.14166	0.17800	0.15315	0.17800	0.11521
Wikipedia	DCCA	2 (full views)	0.09245	0.08467	0.08289	0.09090	0.50819
	DGCCA		0.10329	0.10567	0.10267	0.10400	0.54222
	FISH-MML		0.08957	0.10149	0.09313	0.10032	0.51780
	Dynamic guided representation-MVC		0.16876	0.14224	0.13550	0.17027	0.59470
	DGMRL-MVC		0.19480	0.18393	0.22795	0.18110	0.59851
Football	DCCA	9 (full views)	0.05625	0.04909	0.05200	0.05241	0.21037
	DGCCA		0.01404	0.01885	0.01603	0.02016	0.21256
	FISH-MML		0.05679	0.05885	0.01761	0.06048	0.22956
	Dynamic guided representation-MVC		0.10516	0.06449	0.05430	0.06451	0.26470
	DGMRL-MVC		0.07540	0.06867	0.06647	0.07258	0.28367
Pascal	DCCA	6 (full views)	0.02123	0.04200	0.02814	0.04200	0.03363
	DGCCA		0.04716	0.06600	0.05351	0.06600	0.09261
	FISH-MML		0.07616	0.06700	0.06606	0.06700	0.23351
	Dynamic guided representation-MVC		0.06619	0.07200	0.04782	0.07200	0.15842
	DGMRL-MVC		0.08315	0.07500	0.07342	0.07500	0.48847

DOI: 10.7717/peerjcs.922/table-3

(RQ3) Is the discriminated fusion representation good on the clustering task?

As shown in Tables 4 and 5, the visualization is consistent with the clustering results. In this experiment, the Silhouette_score was also chosen as the evaluation indicator. Tables 4 and 5 reveal the advantage of the proposed model when the dynamic guided and FDA-HSIC models were used simultaneously. The Silhouette_score for clustering was higher than that for the FISH-MML model. Besides, the Silhouette_score of the full view method was higher than that of the 2-view method, which proved the necessity of the multi-view fusion representation. Our model has better learning performance by using full views, and can effectively mine the features of multiple views. The result empirically proves that clustering with full-view is more robust that with 2-view.

(RQ4) How does performance vary w.r.t. the parameter of n_clusters?

Parameter analysis was conducted on the Silhouette_score parameter by varying the number of clusters in [2,3,5,7,8,9,10] (Handwritten and Wikipedia datasets) and [2,5,8,10,13,15,18,20] (Football and Pascal datasets). Figures 2, 3, 4 and 5 plots the results in terms of Silhouette_score when different numbers of clusters were used on the four datasets. The proposed DGMRL-MVC showed the best performance on the other datasets. However, DGMRL-MVC slightly worse than the other methods on the Handwritten dataset. Further, combined with the previous experimental results, we analysis the reason of Fig. 2 result. First, this dataset is different from other datasets, its views are six types of features on same picture, i.e., Fourier coefficients of the character shapes, profile correlations, Karhunen-love coefficients, pixel averages in 2 × 3 windows, Zernike moment, morphological. Due to the large difference and weak correlation between the original data views, our model weakens the original relationship of views. The value of the Silhouette_score reflects the distance of samples in same cluster. In detail, although clustering evaluation metrics of our model is higher than the baselines, the sample is closer to the boundary of the cluster in this kind of dataset.

Parameter analysis on n_clusters in terms of Silhouette_score on the Handwritten dataset. — Figure 2: Parameter analysis on *n_clusters* in terms of *Silhouette_score* on the Handwritten dataset.

Download full-size image

DOI: 10.7717/peerjcs.922/fig-2

Parameter analysis on n_clusters in terms of Silhouette_score on the Wikipedia dataset. — Figure 3: Parameter analysis on *n_clusters* in terms of *Silhouette_score* on the Wikipedia dataset.

Download full-size image

DOI: 10.7717/peerjcs.922/fig-3

Parameter analysis on n_clusters in terms of Silhouette_score on the Football dataset. — Figure 4: Parameter analysis on *n_clusters* in terms of *Silhouette_score* on the Football dataset.

Download full-size image

DOI: 10.7717/peerjcs.922/fig-4

Parameter analysis on n_clusters in terms of Silhouette_score on the Pascal dataset. — Figure 5: Parameter analysis on *n_clusters* in terms of *Silhouette_score* on the Pascal dataset.

Download full-size image

DOI: 10.7717/peerjcs.922/fig-5

(RQ5) Ablation experiment

Finally, ablation experiments were used to further verify the effectiveness of the model. We compared the differences of model learning performance under three steps respectively. Specifically, Step-A is inter-intra representation, Step-B is fusion representation directly, Step-C is dynamic routing and fusion representation. From the results of Fig. 6, we can find that performance of Step-B is lower than Step-C. The main reason is that fusion learning is only aimed at the maximum correlation between views, lacking difference learning of views, which is not conducive to the feature expression for clustering tasks. At the same time, the performance difference between Step-B and Step-A is not big, and only some indicators are slightly higher than Step-A, indicating that in the process of fusion learning, the specific intrinsic features learning with dynamice routing plays the most obvious role. Our method result is higher than other sub-models, which proves that on the basic of strengthening the learning view relationship, adding subspace fusion characterization can effectively improve the clustering performance of model.

Figure 6: Ablation experiments on four multi-view datasets.

Download full-size image

DOI: 10.7717/peerjcs.922/fig-6

Conclusions

This paper has proposed a novel framework called Dynamic Guided Metric Representation Learning for Multi-View Clustering (DGMRL-MVC). The proposed method could cluster multi-view data in a learned latent discriminated embedding space with multiple views by multi-step enhanced representation. Within this framework, intra-view class separability and the complex correlations among different views were explored using FDA-HSIC. Then the direction information is added to enhance the single view representation by dynamic routing deep learning mechanism. Finally, the common discriminated representation was obtained by generalized canonical correlation analysis for multiple views and directly improved MVC performance. Experiments on four multi-view datasets have demonstrated the effectiveness of the proposed method for clustering tasks.

Multi-view data exits everywhere and there are still many challenging problems, such as data alignment from heterogeneous sources (e.g., Wang, Zheng & Li, 2017), unified multi view fusion model design and optimization (e.g., Tang et al., 2020), and et al. In the future, we will do more work on multi view data fusion.

Supplemental Information