Leveraging video analysis for early detection of psychosocial distress in educational settings
Abstract
Early identification of psychosocial distress in students is essential for creating supportive educational environments and enhancing mental well-being. Traditional methods, relying on subjective self-reports or periodic assessments, often lack temporal granularity and are susceptible to underreporting, which limits timely intervention. Video analysis offers a non-intrusive approach to monitoring behavioral cues linked to psychosocial distress but is hindered by an overemphasis on coarse-grained action classification or frame-level features, overlooking the intricate temporal dynamics and subtle postural variations in real-world settings. To overcome these limitations, we propose a framework integrating structured video analysis with symbolic formalization, representation learning, and cross-sequence temporal reasoning. The Temporal Attentive Pose Embedding Network (TAPEN), an end-to-end trainable architecture, captures semantics and temporal continuity at the posture level using a dual-stream network with self-attention and recurrent layers. Complementing this, the Cross-Sequence Temporal Reconciliation (CSTR) strategy synthesizes multiple video streams into a unified behavioral timeline, enabling accurate predictions of distress signals across diverse contexts. Experimental results validate the proposed system’s ability to identify psychosocial cues with high accuracy under sparse annotations and varying environmental conditions, contributing to advancements in AI-driven systems designed to promote psychosocial health through adaptive and context-aware behavioral analysis.