Background: Graph representation learning has become a powerful paradigm for modeling the relational structure of visual data for computer vision tasks. Unlike traditional grid-based representations, graph representation learning offers a flexible approach to modeling non-Euclidean structures, which can capture long-range dependencies while facilitating structured reasoning. Existing surveys are more focused on task-oriented or data modalities, leaving the field without a unified conceptual framework for understanding how different graph schemas, which can shape visual reasoning, interact.
Methodology: Following standard literature search guidelines, we collected studies published between 2017 and 2025 from major CV/ML conferences and journals using targeted GRL-related keywords. From this corpus, we identified representative works covering seven graph schemas and extracted their graph designs, learning mechanisms, and task-level findings to build our taxonomy.
Results: Our analysis identifies seven fundamental graph schemas: pixel, point, skeleton, object, region, image, and label graphs, which can form a hierarchy ranging from low-level spatial detail to high-level semantic abstraction. For each schema, we review representative models, discuss graph construction strategies, and compare learning mechanisms, including graph convolutional networks, graph attention networks, spatiotemporal GNNs, and graph transformers. Common challenges across schemas include reliance on predefined topologies, uniform neighbor aggregation, difficulty modeling long-range dependencies, and scalability limitations in large graphs. Emerging solutions include graph structure learning, attention-based and adaptive message passing, hierarchical and multiscale graph designs, and integration of graph reasoning into transformer architectures.
Conclusions: This survey presents a unified taxonomy of graph representation learning in computer vision, synthesizing methodological trends across diverse graph schemas. By clarifying the structural roles of graphs from pixels to labels, we provide a coherent framework for understanding how graph- based reasoning advances visual recognition, geometric understanding, and scene-level interpretation. Traditional challenges, such as adaptive graph construction, efficient reasoning on large graphs, and standardized evaluation, highlight important directions for future research. The insights presented here aim to guide the development of more scalable, interpretable, and context-aware graph-based solutions.
If you have any questions about submitting your review, please email us at [email protected].