Disentangled self-attention neural network based on information sharing for click-through rate prediction

View article
Loading...
PeerJ Computer Science

Main article text

 

Introduction

  • We introduce a disentangled self-attention mechanism and define paired terms and unary terms. It allows us to think about feature representation from a global perspective and focus on the crucial features.

  • This article introduces two modules in the shared interaction layer to improve the interaction signals between parallel networks. One module distinguishes the feature distribution, and the other fuses the parallel network’s features.

  • Extensive experiments were conducted on two datasets to demonstrate the superior accuracy and lower loss rate of the proposed method compared to existing prediction methods.

Methods

Problem definition

Embedding layer

Disentangled self-attention layer

Sharing interaction layer

Decomposition module.

Sharing module.

  1. We assume that Cl and Dl are two feature representations in a parallel network architecture. The first feature fusion, denoted as FHP, can be expressed as FHP = ClDl, where ∘ is the Hadamard Product, and it takes their element-wise product.

  2. It may not be possible to effectively model sparse feature interaction using the Hadamard Product or inner product. Therefore, we combine the inner product and the Hadamard Product to learn feature interaction. We named the second feature fusion FIH. This interaction function is denoted as FIH=ailClDl, where ail are the shared parameters in the lth layer and ⋅ denotes the regular inner product.

  3. The third approach of feature fusion concatenates two vectors. In order to keep the vector dimension of output M × d, we design a feedforward layer with an activation function. The formula can be expressed as FCN=ReLU(wTk[Cl;Dl]+bk), where wk and bk are the weight and bias parameters for the the lth layer, respectively.

Output layer

EXPERIMENT AND ANALYSIS

Experimental settings

Datasets.

Data preprocessing.

Experimental details.

Evaluation metrics.

Baseline models.

  • LR (Richardson, Dominowska & Ragno, 2007): It can only learn the first-order feature interaction, which cannot represent the interaction between features.

  • FM (Rendle, 2010): It can effectively handle high-dimensional sparse features by modeling the interaction between features and having good generalization capabilities. But it can’t simulate higher-order feature interactions.

  • Wide & Deep (WDL) (Cheng et al., 2016): It is a hybrid architecture that combines the power of deep neural networks for learning intricate patterns with the memorization capability of a comprehensive linear model, enabling accurate click-through rate prediction by capturing both generalization and specific feature interactions.

  • NFM (He & Chua, 2017): It is a neural network-based extension of Factorization Machines that leverages deep learning techniques to capture feature interactions and improve click-through rate prediction in various applications.

  • Deep & Cross (DCN) (Wang et al., 2017): It is a neural network architecture incorporating cross-network operations to capture high-order feature interactions (He et al., 2016), enabling accurate click-through rate prediction by balancing depth-wise representation learning and explicit feature interactions.

  • DeepFM (Guo et al., 2017): It is a hybrid approach combining deep neural networks and factorization machines, leveraging their complementary strengths to capture high-order feature interactions and low-rank representations, enabling accurate click-through rate prediction in large-scale recommendation systems.

  • xDeepFM (Lian et al., 2018): It is an advanced deep learning architecture integrating cross-network operations and deep neural networks, effectively capturing intricate feature interactions and hierarchical representations.

  • FiBiNet (Huang, Zhang & Zhang, 2019): It introduces two modules: Bililinear Feature Interaction and SENet. SENet is a powerful mechanism that selectively recalibrates feature representations by learning channel-wise attention weights. The bilinear-interaction layer performs element-wise product and linear transformation operations to capture intricate feature interactions.

  • InterHAt (Li et al., 2020): It incorporates hierarchical self-attention mechanisms to capture feature interactions at different levels, improving click-through rate prediction by effectively modeling the importance and dependencies among features in recommendation systems.

  • DESTINE (Xu et al., 2021): It stacks multiple disentangled self-attention mechanisms to model the interaction of higher-order features, which decouples the unary feature importance calculation from the second-order feature interactions.

  • DeepLight (Deng et al., 2021): It introduces a parallel network architecture that combines DNN and FwFM and conducts analysis compared to DeepFM and xDeepFM. This approach aims to address the challenges of increased service latency and high memory usage when delivering real-time services in production.

  • CowClip (Zheng et al., 2022): It develops the adaptive column-wise clipping to address the model’s training speed, reducing the training time from 12 h to 10 min on a single V100 GPU.

  • FinalMLP (Mao et al., 2023): It proposes seamlessly integrating feature gating and interaction aggregation layers into an upgraded dual-stream MLP model. In other words, by combining these two MLPs, it can achieve improved performance.

Performance comparison

  • In the “Previously Reported” column, the performance of InterHAt is inferior to LR on both datasets. It could be attributed to differences in how the models preprocess the datasets and the splitting ratios. This article evaluates all models using the same evaluation protocol and preprocessing methods for result comparability. In addition, we perform a paired t-test to verify the statistical significance of the relative improvement of DSAN.

  • The performance obtained on Avazu datasets is essentially the same or even higher than the reported results. For example, InterHAt and DCN receive better results on Avazu datasets. The experimental results show that LR and FM shallow models perform worse because they cannot catch complex higher-order feature interactions. Deep models are more advantageous in capturing complex higher-order feature interactions than shallow models and usually perform better. Models like WDL, NFM, and DeepFM, among others in deep learning, demonstrate superior performance compared to shallow models. By adjusting the parameters, we find that the differences between most models become minimal. The discrepancies between DCN and DeepFM on the Avazu dataset are negligible. During the model training process, we have noticed that most models exhibit a phenomenon called one-epoch overfitting, meaning that training for just one epoch typically suffices to achieve optimal performance. According to the relevant literature (Zhang et al., 2022), it points out that feature sparsity is the cause of one epoch. Improving model performance by improving data sparsity may be a worthwhile research topic.

  • The performance of the Criteo dataset is generally in line with the reported results, but in some aspects, it is slightly lower than expected. The lack of detailed hyperparameter information provided in the article resulted in us not obtaining the best combination of hyperparameter tuning and feature engineering. However, InterHAt and NFM models demonstrate superior performance on Criteo datasets, underscoring the effectiveness of our approach to data preprocessing and hyperparameter optimization. We can also observe that the Criteo dataset has better feature representation at an embedding dimension of 16, while the Avazu dataset obtains better results at higher embedding dimensions. This difference may stem from the complexity and characteristics of the datasets themselves, leading to variations in the optimal choice of embedding size.

  • Tuning model parameters is one of the crucial steps in deep learning. We adjust the models regarding embedding size, number of attention heads, and network layers. We tune the embedding size to { 16, 24, 32, 40, 48 } for both datasets to investigate the effect of embedding dimensions on the model. The number of network layers in deep learning is tuned between 0 and 4. To further illustrate the need for tuning, Table 3 shows the results of the three benchmark models before and after tuning. The “Reported” reflects the findings of the existing research; “Rerun” denotes the outcome pre-adjustment; and “Retuned” signifies the outcome post-adjustment. We can observe that while DeepLight and FinalMLP didn’t achieve the reported performance on the Criteo dataset, they slightly surpassed the previous performance on the Avazu dataset. This variance might arise from the distinct characteristics and complex relationships among features in these datasets. It indicates that the characteristics of the data influence a model’s performance across different datasets, and it requires specific adjustments tailored to each dataset to achieve better performance.

Performance analysis

Disentangled self-attention layer performance analysis

Different variant of sharing module

Ablation study

Conclusion

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Yingqi Wang conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Huiqin Ji conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.

Xin He analyzed the data, prepared figures and/or tables, and approved the final draft.

Junyang Yu performed the experiments, prepared figures and/or tables, and approved the final draft.

Hongyu Han conceived and designed the experiments, prepared figures and/or tables, and approved the final draft.

Rui Zhai performed the experiments, prepared figures and/or tables, and approved the final draft.

Longge Wang analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data and models are available at Zenodo: haha. (2023). jihuiqin2/interaction_ctr: v0.1 (interaction). Zenodo. https://doi.org/10.5281/zenodo.8119714.

Funding

The research received funding from the Key Research and Promotion Projects of Henan Province under Grant Agreement No (222102210034, 222102210178, 222102210229 and 232102210031), and the Key Research Projects of Henan Higher Education Institutions under Grant Agreement No 22A520020. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

771 Visitors 745 Views 51 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more