Spatio temporal attention mechanism for real time cellular traffic prediction

Dharani sabari Samudrala; Rajiv Senapati

doi:10.7717/peerj-cs.3571

Spatio temporal attention mechanism for real time cellular traffic prediction

Dharani sabari Samudrala, Rajiv Senapati

Department of CSE, SRM University Andhra Pradesh, Guntur, Andhra Pradesh, India

DOI: 10.7717/peerj-cs.3571

Published: 2026-02-03
Accepted: 2025-12-15
Received: 2025-08-01

Academic Editor: Davide Chicco

Subject Areas: Algorithms and Analysis of Algorithms, Artificial Intelligence, Computer Networks and Communications, Data Mining and Machine Learning, Neural Networks
Keywords: Hybrid attention, Spatio temporal attention, Traffic prediction, Lightweight convolution, Cellular networks

Copyright: © 2026 Samudrala and Senapati
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Samudrala Ds, Senapati R. 2026. Spatio temporal attention mechanism for real time cellular traffic prediction. PeerJ Computer Science 12:e3571 https://doi.org/10.7717/peerj-cs.3571

The authors have chosen to make the review history of this article public.

Abstract

Accurate prediction of cellular network traffic is a fundamental requirement for ensuring efficient network performance in the context of increasing mobile data demands. Existing models such as Convolutional Neural Network-Long Short Term Memory (CNN-LSTM), Temporal Fusion Transformer (TFT), and Reslearn often lack the capacity to capture the complex spatio and temporal patterns and variability inherent in real-world traffic, thereby limiting their effectiveness in practical deployments. This study proposes a lightweight hybrid framework incorporating a Spatio Temporal model with Attention Mechanism (STAM) to address these limitations and enhance predictive performance. The proposed model is trained on real-world cellular network data and is designed to capture both short-term fluctuations and long-term temporal dependencies. The attention mechanism embedded within the architecture allows the model to selectively focus on salient temporal features, improving its ability to learn meaningful traffic patterns while maintaining computational efficiency. Evaluation with standard metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE), and R² demonstrates improved prediction accuracy compared to traditional baseline models. The resulting predictions provide actionable insights for dynamic resource allocation and informed network planning. These capabilities support reduced latency, improved traffic distribution, and efficient bandwidth utilization, thereby contributing to enhanced Quality of Service (QoS), Spectrum Efficiency (SE), and Network Utility (NU) within next-generation cellular systems.

Introduction

With the rapid growth of digital technologies in both everyday life and industrial settings, cellular networks are facing increasingly complex demands. Modern services such as video streaming, online gaming, remote healthcare, smart transportation, and automation require networks that can deliver reliable performance despite constantly changing traffic conditions (Saad, Bennis & Chen, 2019). Recent studies emphasize the importance of attention-based and context-aware deep learning models for intelligent traffic management in 6G networks (Das et al., 2025). These growing and diverse requirements are placing significant pressure on current network systems, revealing limitations in how well they can manage fluctuating data loads. One of the major challenges in this context is the ability to accurately predict network traffic. Without dependable traffic prediction, it becomes difficult for network providers to manage resources efficiently, ensure consistent service quality, and prevent congestion. Accurate traffic prediction plays a key role in making networks more responsive in real time, improving the overall user experience, and supporting the effective use of limited infrastructure especially as we move toward more advanced, next generation network technologies.

Conventional traffic prediction has often relied on statistical models like AutoRegressive model (AR) and Auto Regressive Integrated Moving Average Model (ARIMA) (Moayedi & Masnadi-Shirazi, 2008), which use historical data to predict future trends. While effective in stable settings, these methods struggle with the non linearity, noise, and variability found in real world network traffic (Box et al., 2015). Some enhancements, such as decomposing traffic into trends and abrupt changes, have been explored but still lack adaptability and generalization. Recently, deep learning has shown promise by capturing complex temporal patterns from large datasets. However, many models face challenges in efficiently learning relevant features, resulting in limited accuracy and high computational cost barriers to real time, practical deployment. Existing spatio-temporal models like Convolutional Neural Network-Long Short Term Memory (CNN-LSTM), Gated Recurrent Unit (GRU)-based, and hybrid attention networks often achieve high accuracy but suffer from high computational overhead, limited adaptability, and weak generalization across varying traffic patterns.

To address the limitations of existing approaches, this study introduces a lightweight deep learning framework that combines a hybrid attention mechanism with Spatio Temporal model with Attention Mechanism (STAM). The proposed approach is designed for both accuracy and efficiency, capable of capturing short term variations and long term patterns in network traffic while keeping computational demands low. Its core components include an Efficient Hybrid Attention (EHA) module for improved feature extraction, Depthwise Separable Convolution (DSC) to reduce complexity, and an STAM to focus on the most relevant time and location based information. This approach is not only a technical advancement but also a response to the growing need for intelligent, adaptive network management. By enhancing prediction accuracy, the model may support more effective load balancing, dynamic bandwidth allocation, and proactive congestion control. These improvements are key to strengthening Quality of Service (QoS), Network Utility (NU), and Spectrum Efficiency (SE) like critical factors for the evolution of next generation cellular networks. The main contributions of this work are outlined below.

We propose a unified framework that efficiently learns spatial and temporal dependencies within cellular traffic data, enabling accurate real-time traffic volume prediction with reduced computational cost.
Our proposed framework incorporates an adaptive attention strategy that jointly captures spatial and temporal correlations, allowing the model to focus on the most influential regions and time intervals under dynamic network conditions.
Further we develop a lightweight feature extraction strategy that forms the architectural foundation of STAM, achieving robust predictive performance while significantly minimizing model complexity and training overhead.
The proposed model demonstrates superior performance through comprehensive experimentation, where the proposed model consistently outperforms baseline methods using key evaluation metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE), and $R^{2}$ .

In the rest of this article, Related Work is reviewed in ‘Related Work’, Dataset description and data preprocessing is presented in ‘Dataset Description and Preprocessing’, Proposed Methodology is presented in ‘Proposed Methodology’, Experimental results are discussed in ‘Results and Analysis’ and finally, ‘Conclusion’ concludes this article.

Related work

In recent years, traffic forecasting has advanced from basic statistical models to deep learning approaches that handle complex spatial and temporal patterns. Early models like ARIMA and Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) captured linear and seasonal trends (Zhou et al., 2005; Yu et al., 2010), but failed to represent nonlinear dynamics (Li et al., 2017). Machine learning methods such as Support Vector Regression (SVR) and Gaussian Processes addressed non-linearity but faced scalability issues on large data (Cortes & Vapnik, 1995). Deep learning models including CNN, Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and GRU achieved better temporal learning (LeCun et al., 1998; Bai, Kolter & Koltun, 2018). The work reported in LeCun et al. (1998) demonstrated the power of convolution networks for spatial feature extraction, inspiring their use in traffic prediction. However, most models still emphasize temporal aspects and neglect spatial relationships, leading to less accurate spatio-temporal forecasting. Hybrid models, such as CNN, RNN combinations and attention integrated architectures, address this by modeling both spatial and temporal patterns. Dense CNN improve accuracy but increase computational load (Zhang et al., 2018), while Hybrid Spatio Temporal Network (HSTNet) balances efficiency with accuracy using attention and deformable convolutions (Zhang et al., 2020). Lightweight models like Fully Connected Sequential Network (FCSN) and 1 Dimensional-Convolutional Neural Network (1D-CNN) achieve lower MAE with reduced complexity and faster execution (Mohseni, Nikan & Shami, 2022). Table 1 summarizes the state-of-the-art literature and existing baseline models commonly used for comparison in spatio temporal prediction, where the notation “S” represents spatial components, “T” denotes temporal components, and “S, T” indicates the integration of both spatial and temporal components.

Table 1:

State-of-the-art literature.

Method	Dataset	S	T	S,T	Contributions	Limitations	Metrics
HA (Qiu et al., 2018)	City cellular (Azari et al., 2019)		T		Past average baseline	Ignores spatial + trends	RMSE
ARIMA (Moayedi & Masnadi-Shirazi, 2008)	City cellular (Azari et al., 2019)		T		Linear time series modeling	No spatial, non linear limits	MAE, RMSE
RNN (Qiu et al., 2018)	City cellular (Azari et al., 2019)		T		Sequential modelling	Weak long term memory	MAE, RMSE
LSTM (Wang et al., 2017)	Shanghai (Wang et al., 2015)		T		Captures long dependencies	Ignores spatial, costly	RMSE, R²
STDenseNet (Zhang et al., 2018)	Milan (Barlacchi et al., 2015)	S	T	ST	Dense CNN fusion	Overfit, compute intensive	RMSE, MAE
HSTNet (Zhang et al., 2020)	Milan (Barlacchi et al., 2015)	S	T	ST	Joint spatial temporal modeling	Complex tuning needed	RMSE, MAE, R²
DCGMAM (Xiao et al., 2025)	Milan (Barlacchi et al., 2015)	S	T	ST	Diffusion GRU + attention	High overhead, scaling issues	RMSE, MAE, R²
STMP (Gong et al., 2025)	Shanghai, Nanjing (Wang et al., 2015; Gong et al., 2024)	S	T	ST	Cross attn + spatial encoding	Difficult on large networks	RMSE, MAE
TFT (Kougioumtzidis et al., 2025)	Milan (Barlacchi et al., 2015)		T		Attn based temporal learning	No spatial, memory heavy	RMSE, MAE, R²
Proposed model (STAM)	Telecom Italia (Telecom Italia, 2015)	S	T	ST	Conv + EHA + temporal attention	Some tuning needed	RMSE, MAE, SMAPE, $R^{2}$

DOI: 10.7717/peerj-cs.3571/table-1

Recent innovations in attention-based models have advanced the modeling of dynamic spatial and temporal dependencies. Models like Non-Local Graph Neural Networks with Non-Local Attention Mechanism (NLG-NLAM) effectively capture evolving spatial correlations while highlighting critical temporal features (Rao et al., 2022). Lightweight techniques, such as DSC and $1 \times 1$ convolution, reduce computational complexity without compromising feature quality (Howard et al., 2017). The integration of DenseNet, derived from Residual Network (ResNet), enhances feature reuse and information flow (Huang et al., 2017). Attention modules like Convolution Block Attention Module (CBAM) further improve performance by combining spatial and channel attention (Woo et al., 2018). Optimizers such as Adaptive Moment Estimation with Weight Decay (AdamW) also contribute to faster and more stable convergence (Loshchilov & Hutter, 2017). Hybrid models, including correlation-based Convolutional Long Short-Term Memory (ConvLSTM) and three-dimensional residual convolutional networks (ResConv3D), enable efficient spatio temporal learning (Ma et al., 2023; Wang & Wong, 2022). Attention-based Multi-Scale Spatial-Temporal Cross Network (Att-MCSTCNet) leverages ConvLSTM and Convolutional Gated Recurrent Unit (ConvGRU) for enhanced temporal embedding and cross-domain fusion (Zeng et al., 2021). Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) based frameworks, integrating Temporal Convolutional Network (TCN), GRU, and attention, effectively capture both short and long-term patterns (Wang, Bao & Wang, 2023). Transformer-based networks, like Spatial-Temporal Decomposable Network (STD-Net), model complex spatial-temporal interactions in mobile traffic (Hu et al., 2022). Federated learning strategies have also emerged, such as the multi-objective federated traffic prediction model for vehicular networks, enabling collaborative, privacy-preserving, and context-aware traffic forecasting in intelligent transportation systems (Aalavanthar et al., 2025). Finally, Diffussion Convolutional GRU-Multi head Attention Mechanism (DCG-MAM) integrates Diffusion Convolutional GRU with Multi-Head Attention, enabling localized spatial learning and key temporal feature extraction. The Spatiotemporal Transformer Framework (STMP) model (Gong et al., 2025) integrates temporal cross-attention and hierarchical spatial encoding to jointly predict traffic and user density, effectively capturing semantic relationships and inter-station interactions. The TFT (Kougioumtzidis et al., 2025) enhances multivariate prediction through interpretable attention-based temporal fusion, showing adaptability to dynamic scenarios like next generation cellular networks. These approaches highlight the value of hybrid attention, spatial encoding, and temporal fusion in improving prediction accuracy. While deep learning models outperform traditional methods, they often incur higher computational costs.

Most existing traffic prediction models have certain limitations. Traditional models like ARIMA and LSTM mainly focus on time-based changes while ignoring spatial relationship. On the other hand, models like GCN or CNN-LSTM include spatial information but often use fixed structures and require heavy computation, making them slow for real-time use. These approaches fail to fully capture the dynamic relationship between spatial and temporal changes. To overcome these issues, the proposed model combines LC, EHA, and STAM in a unified attention framework. The LC part captures important local spatial patterns with fewer parameters, reducing computational cost. The EHA part helps the model focus only on the most relevant spatial and temporal information, avoiding unnecessary processing. Finally, STAM integrates both spatial and temporal learning in a balanced way, allowing it to adapt to changing traffic conditions more effectively. This makes the model faster, more accurate, and suitable for large-scale, real-time cellular traffic prediction, clearly addressing the gaps left by earlier methods.

Dataset description and preprocessing

In this study we have considered Telecom Italia (Telecom Italia, 2015) dataset, which contains detailed data on mobile network activity in Milan. The city is divided into a $100 \times 100$ grid, with each cell representing a specific geographic area. For each time interval and cell, the dataset reports the total number of incoming and outgoing SMS, calls, and internet usage. It also includes timestamps and a unique CellID to identify each location.

The dataset is represented as $Z_{f, m}$ , where each instance is a four-dimensional tensor with shape $[f, m, T, D]$ . Here, $f$ denotes the type of traffic data, which includes SMS, call, and internet usage. The variable $m$ indicates the temporal index, where $m \in 0, 1, 2, \dots, H$ , and H represents the final time step. The geographical area is partitioned into a grid of $T \times D$ cells, with T and D indicating the number of rows and columns, respectively. Each element in $Z_{f, m}$ is a real-valued entry, i.e., $Z_{f, m} \in R^{f \times m \times T \times D}$ , as shown in Eq. (1).

(1) $Z_{f, m} = [\begin{matrix} z_{f, m}^{(1, 1)} & \dots & z_{f, m}^{(1, d)} & \dots & z_{f, m}^{(1, D)} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ z_{f, m}^{(t, 1)} & \dots & z_{f, m}^{(t, d)} & \dots & z_{f, m}^{(t, D)} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ z_{f, m}^{(T, 1)} & \dots & z_{f, m}^{(T, d)} & \dots & z_{f, m}^{(T, D)} \end{matrix}] .$

Let $z_{f, m}^{(t, d)}$ denote the network traffic value at time interval $m$ for the spatial location identified by the cell at row $t$ and column $d$ .

Cellular network traffic is highly dynamic and nonlinear, making feature extraction and accurate prediction challenging. To address this, both spatial and temporal characteristics must be thoroughly analyzed. Using the Telecom Italia dataset, which records SMS, call, and internet usage at 10-min intervals, we examined traffic patterns in a selected area.

Figures 1, 2, 3 illustrates temporal trends, showing periodic user behavior over time. Identifying these patterns is crucial for building reliable predicting models, while accounting for sudden traffic spikes further improves accuracy. Spatial analysis in Figs. 4, 5, 6 reveals higher activity in central urban regions compared to peripheral areas. The 3D plot shows clear peaks in mobile usage for specific cells, highlighting the need to incorporate spatial features into predictive models for better performance.

Figure 1: Time domain distribution of call data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-1

Figure 2: Time domain distribution of SMS data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-2

Figure 3: Time domain distribution of Internet data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-3

Figure 4: Spatial domain distribution of Call data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-4

Figure 5: Spatial domain distribution of SMS data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-5

Figure 6: Spatial domain distribution of Internet data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-6

Data preprocessing

Preprocessing the data is an essential step to ensure that input features are reliable and consistent for model training. To preserve the temporal structure of the dataset, the original time intervals were retained. Any missing entries were handled using forward and backward fill techniques, maintaining continuity within the time series. Core traffic indicators such as SMS, call, and internet usage were normalized using Min-Max scaling, bringing all values into the [0, 1] range to support effective model learning. The dataset is then divided into training, validation, and testing subsets chronologically in 70%:15%:15% ratio respectively based on time, enabling a robust and temporally consistent evaluation of model performance. This research focuses on short-term forecasting, predicting the traffic load for the next 10-min interval using the previous 60 min of observations. Each model was trained on aggregated features including smsin, smsout, callin, callout, and internet usage to capture temporal dependencies in multi-service network behavior. This prediction window was chosen based on operational relevance in real-time network traffic optimization, enabling timely decisions for congestion control and resource allocation. The notations used throughout this article is summarized in Table 2.

Table 2:

Notations used throughout this article.

Symbol	Description
$R$	Set of real-valued continuous numbers
$Z_{f, m}$	2D traffic grid for feature $f$ at time $m$ with size $T \times D$
$z_{f, m}^{(t, d)}$	Traffic volume at cell $(t, d)$ for feature $f$ at time $m$
$f$	Feature index $\in {S M S, C a l l, I n t e r n e t}$
$m, H$	Time index; H is total predicting horizon
T, D	Number of rows and columns in spatial grid
$(t, d)$	Spatial location with row $t$ and column $d$
$A_{g}$	Number of input features (channels)
$T_{g}, D_{g}$	Height and width of spatial grid input
$Q_{g}$	Total number of time steps
E	Convolution kernel size ( $E \times E$ )
$A_{o}$	Number of output channels after convolution
$T_{o}, D_{o}$	Spatial dimensions of convolution output
$b, e$	Mini-batch size and total training epochs
Z	A mini-batch sample from training data
$R_{a}$	Parameters in standard convolution layer
$A_{f}$	Computation cost of standard convolution
$R_{p}$	Parameters in depthwise convolution
$A_{p}$	Cost of depthwise convolution
$R_{i}$	Parameters in pointwise ( $1 \times 1$ ) convolution
$A_{i}$	Cost of pointwise convolution
$R_{s}$	Total parameters in depthwise separable convolution
$A_{v}$	Total computation in separable convolution
ReLU $(x)$	Activation function: $max {0, x}$
$X_{g}^{(c, t)}$	Input to attention block at channel $c$ , time $t$
$X_{i}^{(c, t)}$	Batch-normalized and activated input
$D_{f}^{(c, t)}$	Channel attention weights (ECA)
$X_{f}^{(c, t)}$	Output after applying channel attention
$D_{l}^{(c, t)}$	Spatial attention weights via pooling + convolution
$X_{s}^{(c, t)}$	Refined spatial feature after attention
$X_{b}^{(c, t)}$	Intermediate map before SE block
$X_{v}^{(c, t)}$	Activated feature after SE normalization
$D_{o}^{(c, t)}$	SE block output channel-wise weights
$Q_{s}^{(c, t)}$	Final attention-refined output
B	STAM input tensor: $R^{Q_{g} \times T_{g} \times D_{g} \times A_{g}}$
B^′	Reshaped tensor: $R^{Q_{g} \times (T_{g} \cdot D_{g}) \times A_{g}}$
$A_{a t t}$	Learnable temporal attention projection matrix
$α$	Temporal attention scores (softmax-normalized)
$γ$	Weighted temporal output via attention
$\bar{R}$	Concatenated vector of raw and attention features
G	Weight matrix for final prediction layer
$\bar{P}$	Final prediction output after transformation

DOI: 10.7717/peerj-cs.3571/table-2

Proposed methodology

In this section, we have presented our proposed traffic prediction model based on STAM. The model begins with Lightweight Convolution (LC) module utilizing $1 \times 1$ convolution and DSC for efficient feature extraction. An EHA module then combines spatial and channel attention using begins with LC module utilizing $1 \times 1$ convolution and DSC for 188 efficient feature extraction. An EHA module then combines spatial and channel attention using 189 Efficient Channel Attention (ECA) and a Squeeze and Excitation Block (SE Block) to enhance feature quality. A fusion strategy incorporates temporal attention to capture time-dependent patterns. Finally, STAM is used to capture spatial and temporal dependencies for accurate traffic volume prediction. The overall architecture is presented in Fig. 7 and the detailed procedure of the proposed traffic prediction framework is outlined in Algorithm 1. Further, the model is evaluated through Telecom Italia (Telecom Italia, 2015) dataset. The components of our proposed model is presented in the subsequent subsections.

Figure 7: Spatio Temporal Attention Mechanism based Model.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-7

Algorithm 1:

Traffic prediction using spatio-temporal attention mechanism.

Input: Preprocessed spatio-temporal traffic data

Z_{A_{g}, t} \in R^{A_{g} \times t \times T_{g} \times D_{g}}

; batch size

b

; number of epochs

e

Output: Predicted traffic map

\bar{P}

at time

t

1 Initialize model weights for LC, EHA, SE, and STAM blocks;

2 Initialize optimizer and define loss functions (RMSE and MAE);

3 for epoch

= 1

e

4 Shuffle and split dataset into mini-batches of size

b

;

5 for each mini-batch

Z \in Z_{A_{g}, t}

// Step 1: Lightweight Convolution (LC) Block

6 Apply Batch Normalization and ReLU activation to Z;

7 Apply

1 \times 1

convolution to reduce channel depth;

8 Apply depthwise separable convolution to extract spatial features;

9 Obtain spatial feature map

A_{v} \in R^{A_{g} \times T_{g} \times D_{g}}

;

// Step 2: Efficient Hybrid Attention (EHA) Block

10 Apply global average pooling on

D_{f}

to obtain channel context;

11 Use 1D convolution and sigmoid activation to compute channel attention

D_{f}^{(c, t)}

;

12 Refine channels:

X_{f}^{(c, t)} = D_{f}^{(c, t)} \cdot X_{i}^{(c, t)}

;

13 Apply average and max pooling across channels, followed by

7 \times 7

convolution;

14 Compute spatial attention map

D_{l}^{(c, t)}

and apply:

X_{s}^{(c, t)} = D_{l}^{(c, t)} \cdot X_{f}^{(c, t)}

;

// Step 3: Squeeze-and-Excitation (SE) Block

15 Perform global average pooling on spatial features;

16 Pass through two dense layers with ReLU and sigmoid to generate weights

D_{o}^{(c, t)}

;

17 Enhance spatial-temporal features:

Q_{s}^{(c, t)} = D_{o}^{(c, t)} \cdot X_{v}^{(c, t)}

;

// Step 4: Spatio Temporal Attention Mechanism (STAM)

18 For each time step

t

, compute attention score

α_{t}

from

Q_{s}

;

19 Normalize scores using softmax:

α = s o f t m a x ({(B^{'} \cdot A_{a t t})}^{Q_{g}})

;

20 Compute weighted temporal context:

\bar{R} = B^{'} \oplus γ

;

// Step 5: Output and Learning

21 Concatenate spatial and temporal features if needed;

22 Pass

\bar{R}

to the prediction layer to obtain

\bar{P}

;

23 Backpropagate and update model parameters;

24 return Trained traffic prediction model;

DOI: 10.7717/peerj-cs.3571/table-8

Lightweight convolution module

In the proposed traffic prediction model, the LC module plays a crucial role in enhancing computational efficiency while maintaining high quality spatial feature extraction. In the proposed framework, the input data is represented in a structured spatio temporal format with dimensions $T_{g} \times D_{g} \times A_{g}$ , where $T_{g}$ and $D_{g}$ correspond to the spatial width and height of the grid (e.g., $100 \times 100$ ), and $A_{g}$ denotes the number of input channels typically normalized traffic features such as SMS, call volume, and internet usage.

To extract spatial features, a convolutional kernel of size $E \times E$ is applied, resulting in an output feature map of shape $T_{o} \times D_{o} \times A_{o}$ . This operation is performed with stride 1, no padding, and zero bias, ensuring that spatial locality is preserved without increasing the input dimensions. To reduce computational overhead, we employ DSC instead of standard convolution. The total number of trainable parameters $R_{a}$ in a conventional convolution is given by Eq. (2).

(2) $R_{a} = E \times E \times A_{g} \times A_{o}$ where $E \times E$ is the kernel size and $A_{o}$ is the number output channels. The corresponding computational cost $A_{f}$ is expressed by Eq. (3).

(3) $A_{f} = E \times E \times A_{g} \times D_{o} \times T_{o} \times A_{o} .$

This can be computationally expensive for large spatial grids and deep networks. To address this, we use DSC, which breaks the process into two lightweight stages. First, a depthwise convolution applies one $E \times E$ filter per input channel without combining them, resulting in the parameter count $R_{p}$ given by Eq. (4).

(4) $R_{p} = E \times E \times A_{g} .$

The corresponding computational cost $A_{p}$ is given by Eq. (5).

(5) $A_{p} = E \times E \times A_{g} \times D_{o} \times T_{o} .$

Subsequently, a pointwise convolution using $1 \times 1$ kernels combines channel-wise information. This operation adjusts the channel dimension and has a parameter count $R_{i}$ expressed in given by Eq. (6).

(6) $R_{i} = 1 \times 1 \times A_{g} \times A_{o}$ with its computational cost $A_{i}$ defined in Eq. (7).

(7) $A_{i} = 1 \times 1 \times D_{o} \times T_{o} \times A_{g} \times A_{o} .$

Thus, the total number of parameters $R_{s}$ and overall computation $A_{v}$ for the DSC are then significantly reduced, as defined by Eqs. (8) and (9).

(8) $R_{s} = E \times E \times A_{g} + 1 \times 1 \times A_{g} \times A_{o}$

(9) $A_{v} = E \times E \times A_{g} \times D_{o} \times T_{o} + 1 \times 1 \times D_{o} \times T_{o} \times A_{g} \times A_{o} .$

For example, with a kernel size $E = 3$ , number of input channels $A_{g} = 3$ , and output channels $A_{o} = 64$ , the standard convolution requires.

$3 \times 3 \times 3 \times 64 = 1, 728 p a r a m e t e r s .$

In contrast, the DSC only needs.

$3 \times 3 \times 3 + 1 \times 1 \times 3 \times 64 = 27 + 192 = 219 p a r a m e t e r s .$

This represents a significant reduction in parameter count and computational complexity.

Before feature extraction, the input is normalized using a Batch Normalization (BN) layer, and a Rectified Linear unit (ReLU) activation function is applied to introduce non-linearity, defined as in Eq. (10).

(10) $R e L U (x) = max {0, x} .$

Additionally, a preliminary $1 \times 1$ convolution is used to reduce the number of channels before the depthwise convolution, further minimizing parameter count. Together, these steps enable the LC module to efficiently extract spatial features from cellular traffic data while keeping the model lightweight and suitable for large-scale real-time applications.

Efficient hybrid attention module

To further enhance spatial and temporal feature representation while maintaining computational efficiency, the proposed model incorporates an EHA mechanism. This module focuses the model’s attention on the most informative features across both Channel Attention (CA) and Spatial Attention (SA), ensuring that important traffic patterns are effectively captured without introducing significant computational overhead. The process begins by taking the output $X_{g}^{(c, t)}$ from the LC layer, where $c$ and $t$ refer to the channel and time step, respectively. First, the input undergoes a normalization operation, followed by the application of the ReLU activation function to produce the feature map $X_{i}^{(c, t)}$ . This process can be mathematically expressed as shown in Eq. (11).

(11) $X_{i}^{(c, t)} = R e L U (B N (X_{g}^{(c, t)})) .$

Next, the feature map $X_{i}^{(c, t)}$ is refined using the ECA mechanism. Global average pooling is first applied, followed by a 1D convolution and sigmoid activation to generate channel attention weights $D_{f}^{(c, t)}$ . These weights are then multiplied element-wise with the input to obtain the output $X_{f}^{(c, t)}$ , as shown in Eq. (12). This lightweight approach highlights important feature channels while keeping parameter count low.

(12) $X_{f}^{(c, t)} = D_{f}^{(c, t)} \cdot X_{i}^{(c, t)} .$

Following channel attention, a SA mechanism is applied to focus on the most informative regions in the spatial domain. Max pooling and average pooling operations are performed along the channel axis, and their results are concatenated and passed through a convolutional layer with a $7 \times 7$ kernel. A sigmoid function is then used to generate the spatial attention weights $D_{l}^{(c, t)}$ , which are multiplied element-wise with the input to produce the output feature map $X_{s}^{(c, t)}$ , as defined in Eq. (13).

(13) $X_{s}^{(c, t)} = D_{l}^{(c, t)} \cdot X_{f}^{(c, t)} .$

To further refine the learned representation, the feature map $X_{b}^{(c, t)}$ is processed with another batch normalization and ReLU activation, yielding $X_{v}^{(c, t)}$ as shown in Eq. (14).

(14) $X_{v}^{(c, t)} = R e L U (B N (X_{b}^{(c, t)})) .$

Finally, a SE Block refines channel importance. Average pooling is applied, followed by two linear layers with ReLU and sigmoid activation’s to generate attention weights $D_{o}^{(c, t)}$ . These are multiplied with the input $X_{v}^{(c, t)}$ to produce the final output $Q_{s}^{(c, t)}$ , as defined in Eq. (15).

(15) $Q_{s}^{(c, t)} = D_{o}^{(c, t)} \cdot X_{v}^{(c, t)} .$

In summary, the EHA module combines lightweight channel and spatial attention mechanisms to selectively emphasize critical features in both domains. This results in improved model focus, higher prediction accuracy, and reduced computational burden making it particularly well suited for large-scale and real-time cellular traffic forecasting.

Spatio temporal model with attention mechanism

To enhance the precision of cellular traffic forecasting, the final stage of the proposed architecture introduces a spatio-temporal model with attention mechanism. This module is designed to capture complex interactions between spatial locations and their evolving temporal patterns. While the preceding LC and EHA layers focus on extracting spatial features and emphasizing salient channels or regions, the STAM module complements this by enabling the model to attend to significant time intervals and spatial regions dynamically. Let the input to the STAM be a feature map denoted as $B \in R^{Q_{g} \times T_{g} \times D_{g} \times A_{g}}$ , where, $Q_{g}$ is the number of time intervals (e.g., 144 for a full day), $T_{g}$ and $D_{g}$ represent the height and width of the spatial grid, $A_{g}$ denotes the number of feature channels (e.g., SMS, calls, internet).

To model temporal dependencies for each spatial location, we reshape the input tensor into $B^{'} \in R^{Q_{g} \times (T_{g} \cdot D_{g}) \times A_{g}}$ , treating each spatial location independently across time. A learnable attention matrix $A_{a t t} \in R^{A_{g} \times A}$ is used to project the features to a lower-dimensional space. The attention weights $α$ are then computed by the Eq. (16).

(16) $α = s o f t m a x ({(B^{'} \cdot A_{a t t})}^{Q_{g}}) .$

These attention scores determine the importance of each time interval at every spatial location. The refined temporal representation $γ$ is calculated using Eq. (17).

(17) $γ = α \cdot B^{'} .$

Feature fusion and output prediction is to retain both the original and temporally refined representations, we concatenate them along the feature axis by the Eq. (18).

(18) $\bar{R} = B^{'} \oplus γ .$

The concatenated features $\bar{R}$ are then passed through a fully connected layer with a weight matrix G, followed by a ReLU activation to produce the final prediction $\bar{P}$ by Eq. (19).

(19) $\bar{P} = R e L U (\bar{R} \cdot G) .$

In practical terms, the STAM functions like a dynamic reviewer that learns to re-focus on the most informative time intervals and spatial locations in the data. This mirrors human reasoning, where recent spikes in traffic or historically congested zones are prioritized for decision-making. By attending to these patterns adaptively, the model generates more accurate and context-aware traffic predictions.

The novelty of STAM lies in its adaptive integration of LC and EHA for joint spatial-temporal learning. Unlike CNN-LSTM or ConvLSTM models that rely on deep, sequential layers, STAM captures essential spatial and temporal features simultaneously with minimal parameters. This design achieves superior efficiency, faster inference, and better interpretability, making it a practical and scalable solution for real-time cellular traffic prediction.

Results and analysis

Environment setup

The experiments were conducted on a workstation equipped with an Intel Core i9-12900K CPU (3.2 GHz, 16 cores), 64 GB DDR5 RAM, and an NVIDIA RTX 3090 GPU (24 GB VRAM), Ubuntu 22.04 LTS. The models were implemented using Python 3.10 with PyTorch 2.0.1, PyTorch Lightning 1.7.7, and PyTorch Forecasting 0.10.3 frameworks. Supporting library includes NumPy 1.24.2, Pandas 1.5.3, Scikit-learn 1.2.2, and Matplotlib 3.7.1. To maintain experimental integrity each experiments were repeated three times and the mean values of RMSE, MAE, and SMAPE are reported for performance evaluation.

Evaluation indicators

The proposed model is evaluated through RMSE, MAE, and SMAPE metrics. The RMSE, is a commonly used statistic to evaluate the performance of regression models. It is employed to measure the difference between the observed actual values and the predicted values. RMSE can be used to evaluate a regression model’s performance in a continuous numerical prediction job, its expression is given below in Eq. (20).

(20) $R M S E = \sqrt{\frac{1}{X \times Y} \sum_{x = 1}^{X} \sum_{y = 1}^{Y} {({\hat{P}}^{(x, y)} - P^{(x, y)})}^{2}} .$

MAE is a commonly used metric to assess regression model’s efficiency. RMSE includes squaring operations, but MAE does not. Instead, it finds the mean of the absolute variances between the values that are actual and those that are expected. The average difference between expected and actual values can therefore be captured more precisely by MAE due to its improved adaptability and insensitivity to outliers. The MAE formula can be found in the following Eq. (21).

(21) $M A E = \frac{1}{X \times Y} \sum_{x = 1}^{X} \sum_{y = 1}^{Y} | {\hat{P}}^{(x, y)} - P^{(x, y)} |$ where $P^{(x, y)}$ represents the model’s predicted value and $p^{(x, y)}$ the actual value.

SMAPE is used to evaluate how close the predicted values are to the actual. It calculates the error as a percentage, giving a fair measure of both overestimation and underestimation. Unlike traditional error metrics that depend on data scale, SMAPE adjusts the error by considering both the predicted and actual values, making it effective for comparing model accuracy across different data ranges. It’s mathematical expression is presented in Eq. (22).

(22) $S M A P E = \frac{1}{N} \sum_{i = 1}^{N} \frac{| {\hat{P}}_{i} - P_{i} |}{(| P_{i} | + | {\hat{P}}_{i} |) / 2} .$

A lower SMAPE value indicates a higher level of forecasting precision and model robustness. To assess the model’s performance, we analyze the error loss between predicted and actual values across multiple iterations to confirm convergence. The model is trained and tested on three traffic types SMS, calls, and internet usage.

The prediction ability of the models used in this research was assessed in terms of the coefficient of determination i.e., $R^{2}$ which is a popular regression measure that estimates the degree of similarity between the model’s predictions and the real observed values. The $R^{2}$ score shows how much a model captures the variance of the target variable. Its mathematical expression is presented in Eq. (23).

(23) $R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$ where, $y_{i}$ is the actual value, ${\hat{y}}_{i}$ is the predicted values.

Accurate traffic prediction is critical for optimizing network efficiency in next-generation cellular systems. This study introduces a framework that integrates lightweight convolution, an efficient hybrid attention mechanism, and a temporal attention module. The proposed model is designed to enhance key performance indicators such as QoS, SE, and NU, with each metric mathematically defined to ensure robust performance in 5G environments.

Comparison with baseline models

In this section we have compared our proposed model with baseline models in terms of their RMSE, MAE, and SMAPE for SMS, call, and internet traffic data services. To ensure a fair comparison, all baseline models were implemented and tuned under consistent experimental conditions using the same training, validation, and testing data set splits. The hyperparameter tuning of the baseline models are presented in Table 3.

Table 3:

Hyperparameter tuning details of baseline and proposed model.

Model	Tuning method	Optimized hyperparameters	Evaluation metric
ARIMA	Grid search	$p = 2$ , $d = 1$ , $q = 2$	RMSE, MAE
HA	Manual	Window size = 12	RMSE, MAE, $R^{2}$
LSTM	Bayesian optimization	Learning rate = 0.001, Hidden units = 64, Dropout = 0.2	RMSE, MAE
RNN	Grid search	Learning rate = 0.001, Hidden units = 128, Dropout = 0.3	RMSE, MAE
ST-DenseNet	Bayesian optimization	Dense blocks = 3, Growth rate = 32, Dropout = 0.2	RMSE, MAE
CNN-LSTM	Bayesian optimization	Conv filters = 64, LSTM units = 128, Kernel size = 3, Dropout = 0.3	RMSE, MAE
ResLearn	Grid search	Residual blocks = 4, Learning rate = 0.0005, Batch size = 32	RMSE, MAE
STEHA	Bayesian optimization	Spatial filters = 32, Temporal heads = 4, Dropout = 0.2	RMSE, MAE
HSTNet	Bayesian optimization	Learning rate = 0.001, Hybrid blocks = 3, Attention heads = 4	RMSE, MAE
TFT	Bayesian optimization	Learning rate = 0.001, Hidden size = 64, Dropout = 0.1, Attention heads = 4	RMSE, MAE
STMP	Bayesian optimization	Graph layers = 2, Hidden units = 64, Dropout = 0.3	RMSE, MAE
DCG-MAM	Bayesian optimization	Learning rate = 0.001, Diffusion steps = 2, Attention heads = 4, GRU units = 64	RMSE, MAE
STAM (Proposed)	Bayesian optimization	Learning rate = 0.001, Conv filters = 32, Efficient Hybrid heads = 4, Temporal heads = 4, Dropout = 0.2	RMSE, MAE, SMAPE, $R^{2}$

DOI: 10.7717/peerj-cs.3571/table-3

The comparison of model performance can be found from Table 4, it is observed that the traditional models like Historical Average (HA) and ARIMA were limited in capturing dynamic and spatial traffic patterns. Deep learning models such as RNN and LSTM improved temporal learning but lacked spatial awareness, reducing their effectiveness on regionally distributed data.

Table 4:

Comparison of model performance for Call, SMS, and Internet traffic.

Approach	Call				SMS				Internet
	RMSE	MAE	SMAPE	$R^{2}$	RMSE	MAE	SMAPE	$R^{2}$	RMSE	MAE	SMAPE	$R^{2}$
ARIMA (Moayedi & Masnadi-Shirazi, 2008)	15.406	4.461	120.72	−0.1196	17.935	7.456	126.46	−0.1692	338.93	154.13	166.470	−0.0017
HA (Qiu et al., 2018)	14.776	6.482	148.99	−4.5409	17.687	8.747	131.60	−2.0903	338.54	163.590	166.989	−2.2326
LSTM (Graves & Graves, 2012)	12.823	4.623	136.17	0.5879	15.975	8.278	127.49	0.2982	5.3801	4.3504	110.920	0.2932
RNN (Qiu et al., 2018)	12.803	4.653	174.14	0.1637	15.961	6.959	125.27	0.1140	3.038	2.391	129.306	0.3649
STDeneseNet (Zhang et al., 2018)	12.798	4.870	160.94	0.1129	15.160	6.272	191.11	0.1122	2.947	2.256	168.212	0.3600
ResLearn (Manjunath et al., 2025)	13.208	5.953	139.46	0.1504	16.596	6.405	119.02	0.2008	265.74	119.05	94.7114	0.9956
CNN-LSTM (Wang et al., 2024)	13.187	6.315	142.32	0.1698	16.646	6.622	116.88	0.0598	267.81	115.99	153.200	0.3685
STEHA (Su et al., 2024)	12.838	5.537	127.35	0.1689	0.019	0.007	114.14	0.0861	3.149	2.440	161.20	0.3798
HSTNet (Zhang et al., 2020)	12.801	5.790	144.08	0.1665	0.011	0.081	116.61	0.1080	2.937	2.240	160.96	0.1683
TFT (Kougioumtzidis et al., 2025)	0.6840	0.527	150.19	0.1698	0.869	0.685	132.68	0.0612	0.735	0.439	162.52	0.3765
STMP (Gong et al., 2025)	0.275	0.230	151.66	0.1705	0.236	0.162	129.89	0.1007	0.297	0.239	173.03	0.3910
DCG-MAM (Xiao et al., 2025)	0.033	0.023	146.55	0.1739	0.034	0.024	113.76	0.1076	0.031	0.021	161.08	0.4040
STAM (proposed)	0.018	0.006	108.53	0.1751	0.010	0.006	113.53	0.3183	0.012	0.006	93.020	0.9999

DOI: 10.7717/peerj-cs.3571/table-4

The proposed STAM model shows strong accuracy while staying lightweight and efficient. Unlike models like DCG-MAM, STMP, and TFT, which need heavy computation or extra data, STAM uses simple yet powerful layers to detect local trends, connect past and future patterns, and understand how traffic changes across time and locations. While previous models such as STEHA performed well, they were often slow or required high-end devices. STAM balances performance and efficiency, making it a better fit for real-time use, especially in future 6G networks and systems with limited resources.

Runtime efficiency

The runtime efficiency of our proposed model is presented in Table 5 while comparing with other baseline models. The experimental outcomes clearly demonstrate that the proposed STAM attains the highest runtime efficiency among all compared models. This superior efficiency highlights STAM’s ability to achieve accurate predictions with minimal computational resources and reduced parameter complexity. In contrast, traditional architectures such as CNN-LSTM, STEHA, TFT, DCG-MAM, STMP, ResLearn exhibit significantly lower efficiency due to their higher model size and longer execution times.

Table 5:

Runtime efficiency comparison across baseline and proposed models.

Model	Params	RunTime (ms)	Efficiency
CNN-LSTM	25,130	3.257	0.00001222
ResLearn	37,241	3.889	0.0000069
STEHA	28,915	2.661	0.000013
TFT	33,857	1.624	0.00001819
STMP	40,112	2.018	0.00001235
DCG-MAM	5,381	0.9	0.00020649
STAM (Proposed)	4,780	0.751	0.00027857

DOI: 10.7717/peerj-cs.3571/table-5

Ablation study

To understand the importance of each module in the proposed framework, an ablation study was conducted. The complete model integrates LC, EHA, and STAM. For fair comparison, each module was removed individually while keeping the others intact. This allows us to isolate the contribution of each component and observe its effect on overall performance. The ablation analysis shows that each module contributes meaningfully to the overall framework presented in Table 6. From the ablation study it is evident that the proposed model is performing slightly better as compared to the model without LC or EHA or STAM.

Table 6:

Ablation study results isolating the contribution of each module.

Model	RMSE	MAE
w/o LC	0.31238	0.212613
w/o EHA	0.31374	0.209736
w/o STAM	0.31359	0.212262
Full Model	0.31228	0.209692

DOI: 10.7717/peerj-cs.3571/table-6

Comparison with related work

From the results obtained from our experiment, it is observed that the traffic prediction is influenced by many factors such as LC, EHA, and STAM. Our proposed STAM model is compared with existing models like ARIMA, LSTM, DCG-MAM, STMP, and TFT. While these models have their own advantages, they often miss complex time and location-based traffic patterns. STAM handles these better, giving more accurate results with fewer errors across different types of network traffic. Followings are the observations drawn from this study.

The proposed STAM model, which integrates LC, EHA, and STAM, shows significant improvements in traffic prediction accuracy.
Compared to the classical ARIMA (Moayedi & Masnadi-Shirazi, 2008) model, STAM achieves nearly 99.8% reduction in error across all traffic types, proving that traditional linear models are less effective for complex network data.
When compared with the LSTM (Graves & Graves, 2012) model, which is commonly used for time series prediction, STAM reduces both average and peak errors by over 99%, demonstrating the benefits of attention-based architectures over recurrent ones.
Spatio Temporal Efficient Hybrid Attention (STEHA) (Su et al., 2024) models spatial and temporal features separately, while STAM outperforms it by up to 98% and unifies lightweight convolution with spatio temporal attention, enabling joint learning of local, temporal, and spatial patterns. This unified design yields lower prediction errors.
In comparison with the TFT (Kougioumtzidis et al., 2025), a widely used attention-driven forecasting model, STAM outperforms it by up to 97%, showing its ability to offer accurate, lightweight, and efficient predictions across various Telecom datasets.
Compared to STMP, STAM (Gong et al., 2025) provides more than 93% improvement, indicating better handling of both spatial and temporal dependencies.
Against the hybrid DCG-MAM (Xiao et al., 2025) model, which uses graph-based learning and attention mechanisms, STAM still achieves up to 45% lower errors, highlighting its superior learning capacity and robustness.

Additionally, we conducted a comparative analysis to evaluate the effectiveness of our proposed model against the baseline models such as STEHA, TFT, STMP, DCG-MAM. Figures 8, 9, 10 illustrates the prediction results across three types of services such as SMS, call, and internet while comparing our proposed model with all the baselines. Where, $x$ axis represents time, while the $y$ axis shows the normalized traffic volume for each service. From the result it is evident that the predictions generated by our model align more closely with the actual values compared to those from the baseline model. This clearly demonstrates the proposed model’s capability to capture essential patterns in cellular networks.

Figure 8: Comparison of prediction results for Call service.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-8

Figure 9: Comparison of prediction results for SMS service.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-9

Figure 10: Comparison of prediction results for Internet service.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-10

Table 7 highlights the performance of the proposed STAM model across multiple datasets, including Telecom Italia, the Big Data Challenge, and cellular traffic data from the Milan area. Unlike previous methods that often showed inconsistent performance across different datasets and traffic types, STAM delivers consistently low error rates for Call, SMS, and Internet traffic across all benchmarks. The results clearly demonstrate the superior performance of the STAM model on the datasets, achieving remarkably low RMSE values of 0.018 for call, 0.010 for sms, and 0.012 for internet substantially outperforming earlier approaches and highlighting the model’s effectiveness in capturing spatio temporal patterns in cellular traffic data. This consistent accuracy across diverse datasets demonstrates the adaptability and generalization capability of the STAM model, making it highly effective for real-world cellular traffic prediction scenarios.

Table 7:

Model performance comparison while using different datasets.

Dataset	Approach	Call		SMS		Internet
		RMSE	MAE	RMSE	MAE	RMSE	MAE
Telecom Italia	2D-CNN Model (Gu et al., 2023)	20.67	14.36	25.98	18.47	–	–
Big Data Challenge	MLHN (Zeng et al., 2020)	–	–	–	–	127.65	47.58
Wireless cellular traffic data of Milan area	STC-N (Supriya & Chandrakala, 2022)	33.81	16.53	52.74	28.32	171.78	99.67
Telecom Cellular traffic dataset of Milan	GLSTTN (Zhan et al., 2021)	24.44	13.59	41.24	22.65	147.90	87.09
Telecom Italia	STAM (proposed)	0.018	0.006	0.010	0.006	0.012	0.006

DOI: 10.7717/peerj-cs.3571/table-7

STAM stands out by combining three streamlined yet powerful methods lightweight convolutions for efficient local pattern detection, hybrid attention that intelligently maps past data to future predictions, and spatio temporal attention to model how usage shifts both across time and regions. Unlike earlier models, which either relied on computationally heavy architectures (like 2D-CNN or hybrid networks) or separated spatial and temporal processes, STAM handles everything in a unified and resource-efficient manner. This cohesive design enables STAM to reduce average and extreme forecasting errors measured through RMSE and MAE by over 99% across three different telecom datasets, significantly outperforming traditional solutions. Its success reflects recent findings in spatio temporal attention research that emphasize the benefits of integrating attention mechanisms with efficient convolutional structures.

Estimation of QoS, network utility, and spectrum efficiency

In addition to forecasting cellular traffic, the proposed lightweight deep learning model is extended to support the estimation of key performance indicators including QoS, NU, and SE. These indicators are vital for maintaining reliable, fair, and efficient network operations, especially in dense and dynamic urban cellular environments.

Estimation of quality of service

QoS evaluation has become a crucial field of research due to the growing need for dependable and reliable telecommunications services. QoS is a measure of a network’s performance as seen by its users, usually impacted by packet loss, latency, and throughput. A mathematical methodology for assessing QoS is presented in this research utilizing an actual dataset of telecommunications activities, such as voice calls, SMS, and internet usage. The suggested method computes an overall QoS score by normalizing these metrics, combining them into a single throughput measure, and correcting for the negative effects of packet loss and latency. The proposed traffic prediction model can improve QoS by guaranteeing dependable and consistent network performance. The model’s ability to predict traffic patterns accurately can assist minimize congestion, which lowers latency and improves service quality (Mazhar et al., 2023). QoS is commonly described by the metrics latency $L_{P}$ , throughput $T_{P}$ , and packet loss rate $P_{L}$ , that can be expressed as per the following Eq. (24).

(24) $Q o S = \frac{T_{P}}{L_{n o r m} + P_{n o r m}} .$

Throughput $(T_{P})$ is computed by aggregating normalized and weighted components from SMS, call, and internet usage data. Equations (25) and (26) compute the weighted contributions of in and out SMS, and call volumes respectively, using their normalized values. Equation (27) accounts for normalized internet activity, also weighted accordingly. Finally, Eq. (28) combines these partial components to form the complete traffic input value $(T_{P})$ , capturing the composite activity across all modalities.

(25) $T_{P 1} = w_{s m s_i n} \cdot \frac{{S M S}_{i n}}{{S M S}_{i n, m a x}} + w_{s m s_o u t} \cdot \frac{{S M S}_{o u t}}{{S M S}_{o u t, m a x}}$

(26) $T_{P 2} = w_{c a l l_i n} \cdot \frac{{C a l l}_{i n}}{{C a l l}_{i n, m a x}} + w_{c a l l_o u t} \cdot \frac{{C a l l}_{o u t}}{{C a l l}_{o u t, m a x}}$

(27) $T_{P 3} = w_{i n t e r n e t} \cdot \frac{I n t e r n e t}{{I n t e r n e t}_{m a x}}$

(28) $T_{P} = T_{P 1} + T_{P 2} + T_{P 3}$ where w_{sms_in}, w_{sms_out}, w_{call_in}, w_{call_out}, represent the weights assigned to each metric based on its relative importance to network performance. The terms $S M S_{i n}$ , $S M S_{o u t}$ , $C a l l_{i n}$ , $C a l l_{o u t}$ , $I n t e r n e t$ denote the observed traffic values, while the terms $S M S_{i n, m a x}$ , $S M S_{o u t, m a x}$ , $C a l l_{i n, m a x}$ , $C a l l_{o u t, m a x}$ , and $I n t e r n e t_{m a x}$ denote the maximum values used to normalize the corresponding traffic features. Latency $L_{P}$ refers to the delay in data transmission over the network. In this framework, latency is normalized using the following Eq. (29).

(29) $L_{n o r m} = \frac{L_{P}}{L_{m a x}}$ where, $L_{P}$ represents the observed latency (in milliseconds). $L_{m a x}$ is the network’s maximum allowable latency threshold, such as 100 ms. Packet loss $P_{L}$ measures the percentage of data packets lost during transmission. It is normalized as expressed in the following Eq. (30).

(30) $P_{n o r m} = \frac{P_{L}}{P_{m a x}} .$

Let $P_{L}$ represent the observed packet loss as a percentage, and let $P_{m a x}$ denote the maximum acceptable packet loss threshold. The QoS score is calculated by combining the throughput with the effects of latency and packet loss. A penalty function is used to account for these impairments. The formula for QoS is in the Eq. (31).

(31) $Q o S = \frac{T_{P}}{L_{n o r m} + P_{n o r m} + ε} \cdot 100.$

Let $T_{P}$ denote the throughput, as previously defined. The normalized latency is represented as $L_{n o r m}$ , while the normalized packet loss is denoted by $P_{n o r m}$ . A small constant, ε, typically set to $10^{- 5}$ , is used to avoid division by zero. To ensure interpretability, the QoS score is expressed as a percentage and is limited to a maximum of $100 %$ the result shown in Fig. 11.

Figure 11: Quality of service.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-11

A crucial component of network performance, QoS indicates the network’s capacity to provide dependable and consistent services. For applications like virtual reality and driver less cars that need high bandwidth and low latency, QoS is crucial in the context of next generation cellular networks. Efficient resource allocation is made possible by accurate traffic prediction, which is essential to preserving QoS. The network can efficiently distribute bandwidth during peak hours by accurately predicting traffic loads. This reduces congestion and guarantees steady service quality, both of which are essential for improving user pleasure and experience.

Estimation of spectrum efficiency

To enhance resource utilization, the model leverages spatio temporal attention mechanisms. By accurately predicting traffic demand patterns, it enables more efficient spectrum allocation and ensures optimal use of available bandwidth. This capability is particularly crucial for cellular networks, where the growing number of connected devices demands effective management of limited spectrum resources. The throughput $(T P_{t})$ at time t is calculated as the average of the actual throughput $(A T_{t})$ and the predicted throughput $(P T_{t})$ represents in Eq. (32).

(32) $T P_{t} = \frac{A T_{t} + P T_{t}}{2}$ where $A T_{t}$ denotes the actual throughput at time $t$ (in bits per second), $P T_{t}$ represents the predicted throughput at the same time, and $T P_{t}$ is the average observed throughput at time $t$ , also measured in bits per second.

Spectrum efficiency is calculated as the ratio of throughput to the total bandwidth (BW). The spectrum efficiency $(S E_{t})$ at time $(t)$ is expressed in Eq. (33).

(33) $S E_{t} = \frac{T P_{t}}{B W} .$

The total bandwidth (in Hz) is represented by BW. Here, $B W = 20 M H z = 20 \times 10^{6} H z$ , and $S E_{t}$ represents the spectrum efficiency at time $t$ (in bps/Hz). The sum of the throughputs for each observation is the total throughput $T P_{t o t a l}$ for all time periods represents in Eq. (34).

(34) $T P_{t o t a l} = \sum_{t = 1}^{T} T P_{t}$ where $(T)$ the sum of all time periods. Throughput at time $(t)$ is represented by $(T P_{t})$ . The overall spectrum efficiency $S E_{t o t a l}$ , is calculated by dividing the entire throughput by the bandwidth in Eq. (35).

(35) $S E_{t o t a l} = \frac{T P_{t o t a l}}{B W}$ where total throughput (measured in bits per second) is represented by $T P_{t o t a l}$ . Bandwidth (measured in hertz) represented by bandwidth (BW). $S E_{t o t a l}$ the overall spectrum efficiency, expressed in bits per second. Total throughput, expressed in bits per second and represented by the symbol $T P_{t o t a l}$ , is the cumulative data rate over all time periods. Usually expressed in hertz (Hz), BW is the range of frequencies that can be employed for transmission. Bits per second/hertz (bps/Hz) is the unit of measurement for overall spectrum efficiency, $S E_{t o t a l}$ ,which is determined by dividing the entire throughput by the bandwidth. This measure offers an evaluation of how well data is being transmitted using the available bandwidth. The result shown in the Fig. 12.

This value reflects the total amount of data transmitted during the observed time period. However, its magnitude alone does not provide insight into the model’s effectiveness unless it is evaluated in the context of available bandwidth and network traffic demands. The overall spectrum efficiency achieved is 2.1188 bps/Hz, which serves as a key measure of how efficiently the bandwidth is utilized for data transmission. It represents the ratio of throughput to the bandwidth in use. To enhance spectrum efficiency and enable the network to manage fluctuating traffic loads while minimizing congestion (Hu & Qian, 2014), the proposed model employs advanced techniques. Achieving higher spectrum efficiency is vital for the optimal utilization of limited frequency resources, especially in next generation cellular systems. This strategy not only improves the quality of service but also reduces operational costs, ultimately boosting the performance and economic viability of the network infrastructure.

Estimation of network utility

The model improves network utility by accurately predicting traffic, enabling proactive resource allocation and reducing congestion risks. Its use of temporal attention and lightweight convolution ensures efficient, adaptive management of network loads. This leads to more stable performance and supports high demand applications like autonomous systems and smart cities.

Network utility is a key performance indicator used to assess the overall quality of a network. It provides a quantitative measure of how well the network performs by considering multiple factors such as throughput, latency, and resource allocation. A higher network utility score generally suggests better performance, with more efficient resource usage and optimized network conditions. The calculation of NU integrates three fundamental components efficiency, penalty, and QoS. Traffic Demand (TD) is calculated as the sum of incoming and outgoing calls and SMS denoted by $T D_{t}$ in the Eq. (36).

(36) $T D_{t} = {c a l l i n}_{t} + {c a l l o u t}_{t} + {s m s i n}_{t} + {s m s o u t}_{t} .$

Allocated Resources (AR) where $A R_{t}$ denotes the time index in the below Eq. (37).

(37) $A R_{t} = 1.1 \times T D_{t} .$

Throughput $T P_{t}$ is defined as the average of the actual throughput and the predicted throughput, as shown in Eq. (38).

(38) $T P_{t} = \frac{{a c t u a l_t h r o u g h p u t}_{t} + {p r e d i c t e d_t h r o u g h p u t}_{t}}{2} .$

Efficiency $E_{t}$ quantifies how effectively the allocated resources meet the traffic demand, as defined in Eq. (39).

(39) $E_{t} = min (\frac{A R_{t}}{T D_{t}}, 1) .$

Congestion Penalty $P_{t}$ is applied when traffic demand exceeds allocated resources in Eq. (40).

(40) $P_{t} = {\begin{matrix} k \times {(T D_{t} - A R_{t})}^{2} & i f T D_{t} > A R_{t}, \\ 0 & o t h e r w i s e . \end{matrix}$

$Q o S_{t}$ is calculated as a weighted combination of throughput and latency in Eq. (41).

(41) $Q o S_{t} = α \times T P_{t} - β \times {l a t e n c y}_{t}$ where ( $α$ ) is the weight for throughput. ( $β$ ) is the weight for latency. NU is the summation of utility scores over all time intervals in Eq. (42).

(42) $N U = \sum_{t = 1}^{T} (E_{t} - P_{t} + Q o S_{t})$ where $E_{t}$ is the efficiency, and $P_{t}$ is the congestion penalty. $(Q o s_{t})$ is the quality of service, $(T)$ is the total number of time intervals. Network utility quantifies the performance by assessing how well resource allocation meets traffic demands. NU Efficiency captures the alignment between allocated resources and user needs, while penalties account for congestion when demand exceeds capacity. To reflect user experience, the metric integrates QoS parameters higher throughput and lower latency increase utility. This combined approach offers a holistic measure for optimizing resource allocation under varying network conditions.

The model shows different learning behaviors for SMS, call, and internet data. SMS data converges quickly and stabilizes early, while call data takes longer, indicating higher complexity. Internet data shows minor fluctuations before rapid convergence. Figure 13 illustrates these trends, helping assess learning speed, stability, and overfitting. This analysis supports early detection of training issues and guides parameter tuning. Model convergence is verified by comparing predicted and actual errors across iterations using each data type.

Figure 13: Training and testing loss for the flow of data.

Download full-size image

DOI: 10.7717/peerj-cs.3571/fig-13

Conclusion

This study introduces a lightweight deep learning model for network traffic prediction in cellular environments, utilizing a hybrid attention mechanism that combines temporal, spatial, and channel attention to enhance feature learning. To ensure computational efficiency, the model incorporates depthwise separable convolutions, significantly reducing processing overhead without compromising performance. Evaluations on real world cellular traffic datasets demonstrate that the proposed model outperforms existing methods in both prediction accuracy and computational cost. Its effective resource prediction supports intelligent spectrum management and communication optimization, making it especially suitable for next generation networks where efficient bandwidth usage and low latency communication are critical.

Limitations and future scope

The proposed framework focuses on short-term forecasting. Hence, its performance over longer horizons may require further validations.
A grate computational effort is required when adopting the proposed framework on large scale datasets while performing long-term traffic forecasting.
The work reported in this study can be suitably extended for resource allocation and optimization in 5G and beyond cellular networks.

Supplemental Information

Model code.

DOI: 10.7717/peerj-cs.3571/supp-1

Download

[1] Aalavanthar A, Famila S, Sundaramurthy S, Cirillo S, Solimando G, Polese G. 2025. Multi-objective federated learning traffic prediction in vehicular network for intelligent transportation system. PeerJ Computer Science 11(11):e2922

[2] Azari A, Papapetrou P, Denic S, Peters G. 2019. Cellular traffic prediction and classification: a comparative evaluation of LSTM and ARIMA. ArXiv

[3] Bai S, Kolter JZ, Koltun V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ArXiv

[4] Barlacchi G, De Nadai M, Larcher R, Casella A, Chitic C, Torrisi G, Antonelli F, Vespignani A, Pentland A, Lepri B. 2015. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Scientific Data 2(1):1-15

[5] Box GE, Jenkins GM, Reinsel GC, Ljung GM. 2015. Time series analysis: forecasting and control. Hoboken: John Wiley & Sons.

[6] Cortes C, Vapnik V. 1995. Support-vector networks. Machine Learning 20(3):273-297

[7] Das BR, Hasan SR, Sabuj SR, Hossain MA, Ray SK. 2025. A comprehensive survey on emerging AI technologies for 6G communications: research direction, trends, challenges, and opportunities. International Journal of Intelligent Networks 6(1):113-150

[8] Gong J, Li T, Wang H, Liu Y, Wang X, Wang Z, Deng C, Feng J, Jin D, Li Y. 2024. KGDA: a knowledge graph driven decomposition approach for cellular traffic prediction. ACM Transactions on Intelligent Systems and Technology 15(6):1-22

[9] Gong J, Liu Y, Li T, Ding J, Wang Z, Jin D. 2025. STTF: a spatiotemporal transformer framework for multi-task mobile network prediction. IEEE Transactions on Mobile Computing 24(5):4072-4085

[10] Graves A, Graves A. 2012. Long short-term memory. In: Supervised Sequence Labelling with Recurrent Neural Networks. Cham: Springer. 37-45

[11] Gu B, Zhan J, Gong S, Liu W, Su Z, Guizani M. 2023. A spatial-temporal transformer network for city-level cellular traffic analysis and prediction. IEEE Transactions on Wireless Communications 22(12):9412-9423

[12] Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv

[13] Hu RQ, Qian Y. 2014. An energy efficient and spectrum efficient wireless heterogeneous network framework for 5G systems. IEEE Communications Magazine 52(5):94-101

[14] Hu Y, Zhou Y, Song J, Xu L, Zhou X. 2022. Citywide mobile traffic forecasting using spatial-temporal downsampling transformer neural networks. IEEE Transactions on Network and Service Management 20(1):152-165

[15] Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. 2017. Densely connected convolutional networks.

[16] Kougioumtzidis G, Poulkov VK, Lazaridis PI, Zaharis ZD. 2025. Mobile network traffic prediction using temporal fusion transformer. IEEE Transactions on Artificial Intelligence 6(10):2685-2699

[17] LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278-2324

[18] Li R, Zhao Z, Zheng J, Mei C, Cai Y, Zhang H. 2017. The learning and prediction of application-level traffic data in cellular networks. IEEE Transactions on Wireless Communications 16(6):3899-3912

[19] Loshchilov I, Hutter F. 2017. Decoupled weight decay regularization. ArXiv

[20] Ma X, Zheng B, Jiang G, Liu L. 2023. Cellular network traffic prediction based on correlation ConvLSTM and self-attention network. IEEE Communications Letters 27(7):1909-1912

[21] Manjunath YSK, Szymanowski M, Wissborn A, Li M, Zhao L, Zhang X-P. 2025. ResLearn: transformer-based residual learning for metaverse network traffic prediction.

[22] Mazhar T, Malik MA, Mohsan SAH, Li Y, Haq I, Ghorashi S, Karim FK, Mostafa SM. 2023. Quality of service (QoS) performance analysis in a traffic engineering model for next-generation wireless sensor networks. Symmetry 15(2):513

[23] Moayedi HZ, Masnadi-Shirazi M. 2008. ARIMA model for network traffic prediction and anomaly detection.

[24] Mohseni M, Nikan S, Shami A. 2022. AI-based traffic forecasting in 5G network.

[25] Qiu C, Zhang Y, Feng Z, Zhang P, Cui S. 2018. Spatio-temporal wireless traffic prediction with recurrent neural network. IEEE Wireless Communications Letters 7(4):554-557

[26] Rao Z, Xu Y, Pan S, Guo J, Yan Y, Wang Z. 2022. Cellular traffic prediction: a deep learning method considering dynamic nonlocal spatial correlation, self-attention, and correlation of spatiotemporal feature fusion. IEEE Transactions on Network and Service Management 20(1):426-440

[27] Saad W, Bennis M, Chen M. 2019. A vision of 6G wireless systems: applications, trends, technologies, and open research problems. IEEE Network 34(3):134-142

[28] Su J, Cai H, Sheng Z, Liu A, Baz A. 2024. Traffic prediction for 5G: a deep learning approach based on lightweight hybrid attention networks. Digital Signal Processing 146(3):104359

[29] Supriya HS, Chandrakala BM. 2022. Cellular traffic prediction through multi-layer hybrid network domain. Mathematical Statistician and Engineering Applications 71(4):6189-6201

[30] Telecom Italia. 2015. Telecommunications—SMS, Call, Internet—TN. Harvard Dataverse, V1.

[31] Wang D, Bao Y-Y, Wang C-M. 2023. A hybrid deep learning method based on CEEMDAN and attention mechanism for network traffic prediction. IEEE Access 11:39651-39663

[32] Wang J, Tang J, Xu Z, Wang Y, Xue G, Zhang X, Yang D. 2017. Spatiotemporal modeling and prediction in cellular networks: a big data enabled deep learning approach.

[33] Wang Z, Wong VW. 2022. Cellular traffic prediction using deep convolutional neural network with attention mechanism.

[34] Wang F, Xin X, Lei Z, Zhang Q, Yao H, Wang X, Tian Q, Tian F. 2024. Transformer-based spatio-temporal traffic prediction for access and metro networks. Journal of Lightwave Technology 42(15):5204-5213

[35] Wang H, Xu F, Li Y, Zhang P, Jin D. 2015. Understanding mobile traffic patterns of large scale cellular towers in urban environment.

[36] Woo S, Park J, Lee J-Y, Kweon IS. 2018. CBAM: convolutional block attention module.

[37] Xiao J, Cong Y, Zhang W, Weng W. 2025. A cellular traffic prediction method based on diffusion convolutional GRU and multi-head attention mechanism. Cluster Computing 28(2):125

[38] Yu Y, Wang J, Song M, Song J. 2010. Network traffic prediction and result analysis based on seasonal ARIMA and correlation coefficient.

[39] Zeng Q, Sun Q, Chen G, Duan H. 2021. Attention based multi-component spatiotemporal cross-domain neural network model for wireless cellular network traffic prediction. EURASIP Journal on Advances in Signal Processing 2021(1):1-25

[40] Zeng Q, Sun Q, Chen G, Duan H, Li C, Song G. 2020. Traffic prediction of wireless cellular networks based on deep transfer learning and cross-domain data. IEEE Access 8:172387–172397