When normalization hurts: An empirical study on layer stacking in graph neural networks
Abstract
Normalization layers are widely adopted to stabilize and accelerate the training of deep learning models, yet their influence on Graph Neural Networks (GNNs) is not uniformly beneficial. This work presents a seed-controlled, depth-swept empirical study that examines when normalization helps and when it hurts. Across six public node-classification benchmarks and a full depth range from 1 to 10 layers, we evaluate six normalization schemes—NoNorm, BatchNorm, GraphNorm, LayerNorm, NodeNorm, and PairNorm. Each configuration is trained under five random seeds, totaling 1,400 runs. Our analysis yields three key findings: (i) Normalization does not consistently improve shallow-layer accuracy, though it often reduces seed-to-seed variance, indicating enhanced stability rather than systematic gain. (ii) As depth increases, this stabilizing effect weakens, and several normalization methods exacerbate the performance decline instead of mitigating it. (iii) The onset and severity of deterioration vary across datasets, reflecting differences in graph density and feature distribution. Overall, our results reveal a clear depth-dependent trade-off: normalization can stabilize training in early regimes but may distort representation statistics and degrade accuracy when layers are stacked too deeply.