A robust detect and describe framework for object recognition in early childhood education

Lan Lv; Suhui Yao

doi:10.7717/peerj-cs.3080

A robust detect and describe framework for object recognition in early childhood education

Lan Lv, Suhui Yao

Preschool Education Institute, Zhengzhou Preschool Education College, Zhengzhou, China

DOI: 10.7717/peerj-cs.3080

Published: 2025-09-17
Accepted: 2025-07-04
Received: 2025-04-13

Academic Editor: Muhammad Asif

Subject Areas: Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Algorithms and Analysis of Algorithms, Artificial Intelligence, Visual Analytics
Keywords: Object detection, Contents recognition, Image to text, Preschool education, Elementary training

Copyright: © 2025 Lv and Yao
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Lv L, Yao S. 2025. A robust detect and describe framework for object recognition in early childhood education. PeerJ Computer Science 11:e3080 https://doi.org/10.7717/peerj-cs.3080

The authors have chosen to make the review history of this article public.

Abstract

Preschool education plays a vital role in the harmonious development of an individual. Understanding basic shapes, colors, and letters at an early age lays a strong foundation for academic excellence and emotional growth. At an early childhood stage, the skills of spatial reasoning and problem-solving can be developed by recognizing and comprehending the depicted objects. By exploring deep learning technology, this article presents a cognitive enhancement framework for recognizing nested objects. With cutting-edge models, such as You Only Look Once (YOLOv8) and Visual Geometry Group (VGG16), objects and intra-objects are detected. For semantic description, the neural network model, specifically long short-term memory (LSTM), is exploited, preceded by precise object recognition. The framework is implemented in Google Colab with the prominent packages of Ultralytics, PyTorch, and OpenCV. The models are trained and tested by a custom dataset: PreEduDS. The results of the systematic evaluation suggest that the framework has widespread applicability. A promising accuracy score of 94.4% is obtained for object recognition and 96.5% for predicting precise semantic textual description. The proposed system is well-suited for enhancing preschool education and training based on augmented reality (AR) applications.

Introduction

Nursery education is a crucial stage in a child’s lifelong cognitive development. Elementary education aims to mold the thinking and socio-emotional abilities of the learner through understanding and interaction with objects. Traditionally, teachers relied mostly on static tools, such as flashcards, magazines, and textbooks, which were inadequate for fully engaging learners. The static instructional material used in conventional teaching not only encourages rote learning but also limits creativity (Gyekye-Ampofo, Opoku-Asare & Andoh, 2023). Additionally, the development of holistic cognitive ability was always hindered due to a lack of sensory-rich resources (Laghari, 2024). Computer-based technologies provide a more dynamic and interactive learning experience. Moreover, educational software and multimedia resources are well-suited for the deeper engagement of young learners (Haugland, 2000; Wang et al., 2022), whereas digital storytelling is effective in enhancing cognitive development (Xiong, Liu & Huang, 2022). Moreover, such tools facilitate instructors in offering content tailored to the child’s preference and pace (Plowman & Stephen, 2003).

In all preschool pedagogical software, digital images containing letters, shapes and objects are analyzed. Hence, in the teaching-learning process, the detection, recognition, and textual representation of objects are necessary. Although algorithms have been proposed in the literature for detecting and labeling objects in input images, challenges such as occlusion and spatial hierarchies still need to be overcome. With the emergence of deep learning technologies, multi-class object recognition with promising accuracy has become possible. A deep learning-based image processing system is the ideal solution for robustly recognizing objects and accurately presenting image content, which is the focus of this research. Learning and development in thinking, feelings and social behaviors begin in preschool. In the beginning, kids are laying the groundwork for abilities in speaking, solving problems and understanding space. A great method to support these abilities is to use learning that involves children seeing and interacting with objects, shapes and letters. AI is starting to show encouraging ways it can influence education and transform traditional classrooms as technology advances. In particular, object detection and semantic description systems can help enhance the quality of education for preschool learners by offering engaging, personalized, and interactive ways to learn.

This study presents the design and implementation of a practical framework for accurately detecting objects in input images and precisely presenting their contents in textual form. The system operates in two phases: the objects extraction phase (OEP) and the object description phase (ODP). In the OEP, objects and intra-objects are detected in input images using the You Only Look Once (YOLOv8) model (Chen et al., 2025). Second phase: ODP, object recognition is performed by VGG16 (Simonyan & Zisserman, 2014), whereas semantic description in textual form is achieved by the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997). A custom dataset, PreEduDS (Preschool Education Dataset), comprising 730 images, is created and used for training, validation, and testing of the models. Images of the dataset, containing objects and nested objects, are gathered from elementary school books available in both soft and hard formats. Image augmentation is performed to increase the variation of images, followed by proper annotation. The YOLOv8 model in OEP exploits a single convolutional neural network (CNN) to split the image into a grid of cells. The class probability, in addition to the likelihood of a distinct object, is predicted for each cell. As a post-processing step, non-maximum suppression (NMS) is performed to prevent overlapping objects. In ODP, the VGG16 is employed as an encoder to extract features of objects from the pixel values, which are normalized in the range of 0 to 1. The recurrent neural network, LSTM, is used as a decoder to generate descriptive text based on the features of the encoder. For easy understanding of the objects carried in an image, object names are displayed at the detection stage alongside descriptive text for effortless comprehension. The working schematic of the framework is shown in Fig. 1.

Figure 1: Schematic of the working of the D&D framework.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-1

The framework is assessed from various perspectives, such as object detection in OEP and generation of seamless sequences after object recognition in ODP. Almost all objects and nested objects are accurately detected by the YOLOv8 model in the first phase, with an accuracy of 96.4% and an F1-score of 0.95 obtained in this phase. However, a comparatively low accuracy of 0.92% with an F1-score of 0.93 is reported when evaluating the performance of VGG16 in the second phase. As a whole, the framework applies to computer-based training (CBT) in general and to the emerging computer-based early years education (CBEY) in particular.

The remaining article is organized into four sections. “Literature Review” discusses related studies, and “Methodology” presents the methodology. Evaluation and result analysis are presented in “Implementation and Evaluation”, while the conclusion with future direction is covered in “Conclusion and Future Work”.

Literature review

Elementary education is of utmost importance in shaping the cognitive and psychomotor skills of young learners. It is evident from the literature that elementary education enhances an individual’s readiness for formal schooling and sharpens their basic skills in problem-solving and language learning (Liao et al., 2024). During this formative period, emotional and cognitive skills are molded; therefore, systematic teaching is required to enhance the learning outcomes (Shonkoff & Phillips, 2000). With the advancement of computing technology, systems have been devised to improve engagement among children. Interactive games, visuals, and animations are designed to support the harmonious growth of preschool children (Shonkoff & Phillips, 2000). To stimulate curiosity and sharpen critical thinking, virtual reality and augmented reality (AR)-based immersive learning platforms have also been proposed (Yi, Liu & Lan, 2024). Such technologies are beneficial for customized learning, particularly in addressing the issues of pace and style for each learner (Abrar et al., 2019).

The emerging technology of deep learning, particularly text generation, has revolutionized the domain of education (Ye et al., 2024; Liu et al., 2024). To better associate objects and recognize their real-world counterparts, computer vision projects play a crucial role (Simonyan & Zisserman, 2015). To better understand everyday objects, systems for object detection and recognition have been proposed (Redmon et al., 2016). For language development in early years, researchers are suggesting a text-generation system (Ye et al., 2024). To foster cognition, systems for detecting objects and generating descriptive captions are in use (Liu et al., 2016). Stance, an artificial intelligence (AI)-based system (Liu & Brailsford, 2023), interactively engages children with visual tools, whereas Chung et al. (2025) have successfully created a personalized learning environment using AI gaming. However, as stated in Chung et al. (2025), Ghandi, Pourreza & Mahyar (2023), and Lin et al. (2017), several issues arise in the detection and prediction of precise captioning in such a system. For accurate labeling, the pre-trained masked language models are proposed in Hossain et al. (2019), Shi, Dao & Cai (2025), Wang et al. (2025). However, these systems lack task-specific adaptation.

Recent studies indicate that technology is growing in importance for early learning. AR and virtual reality (VR) are being integrated into preschool classrooms to provide students with engaging experiences. For instance, AR is being explored to engage young children in learning about space and problem-solving. At the same time, VR is being utilized to support social and emotional skills in early education. They help create better learning experiences for children by letting them explore educational content interactively.

Additionally, deep learning technologies are proving very helpful for recognizing objects and generating text, helping to advance education at the beginning of a child’s learning journey. New developments in AI for personalized learning in preschool demonstrate that AI systems tailor learning content to meet each child’s individual needs. Because CNNs and recurrent neural networks (RNNs) are designed for image and text processing, they are well-suited to object recognition and generating descriptive texts in learning software (Shi, Hayat & Cai, 2024; Liu et al., 2024; Li et al., 2025). Therefore, there is a pressing need for a robust system that not only effectively detects objects but also presents the contents of images in a simple, textual form. Therefore, this research work is proposed with the intent of having an applicable framework that detects, recognizes, and describes the contents of images.

Methodology

Needless to say, preschool education has a substantial impact on an individual’s cognition development. Learning through visual aids is the widely accepted early education method to enhance children’s observational and problem-solving skills. The detection of nested objects in images is an active area of research. This study aims to design a framework that utilizes cutting-edge deep learning and transfer learning approaches to detect and describe objects in an input image accurately. The system not only helps students learn objects, digits, and alphabets easily but also enables teachers to make pedagogy more engaging and effective. The PreEduDS data was created by collecting 730 images from common online teaching materials and textbooks used for preschool learning. The images are drawings of objects, letters, digits, and shapes that are usually introduced in preschool. The reason for choosing these resources is that learners can explore different topics and fundamental educational themes. Information used in the dataset was obtained from both digital and printed learning materials.

Figure 1 lays out the main stages of the proposed framework we are considering. Before proceeding, the dataset is first resized, normalized, augmented, annotated, and tokenized, ensuring the images are ready for object recognition and text creation. When we resize the pictures, we make them all the same size. Normalization takes all pixel values and changes them to a uniform range. Augmentation adds more variety to the data. Annotation names everything in the images and tokenization changes the annotated text into input tokens for the text generation. During the second step, the OEP, objects within the images are identified, and overlapping detections are removed to prevent repeated findings. Once the objects pass the RS stage, they are sent to the ODP, where the recognized objects are further processed and textual descriptions are developed. At the output stage, you will see the detected objects and their descriptions, also known as tags, such as ‘A robot and apple.’ The sequence in the framework helps turn data into meaningful output with proper recognition and text. Details about the framework are presented in the following subsections.

Preprocessing

The PreEduDS dataset, comprising 730 images, is preprocessed for practical training. The standard Roboflow web app (Liu et al., 2025) is utilized to annotate the photos. The annotation is saved in the YOLOv8 format after normalization into the range of 0 and 1—the classes. A named file is created, containing the names of all the objects, and is made a part of the dataset. The dataset was annotated to include the relevant object classes for detection and recognition. While the dataset is diverse in terms of object types, potential biases may exist due to the sources of the images, as they primarily come from digital and physical resources commonly used in educational settings. These resources may not fully represent all cultural contexts or image variations found in a broader global context. Therefore, the dataset may have a bias toward the specific teaching materials available in certain educational systems. The steps performed for preprocessing are as follows.

Image resizing

To bring all PreEduDS images to a fixed standard size, image resizing is performed while preserving pixel information. A standard scaling factor $S$ is applied to image $I$ of height $h$ and width $w$ to get a resized height $h^{'}$ and width $w^{'}$ ,

(1) $S = \frac{h / w}{h^{'} / w^{'}} .$

Resizing is followed by bicubic interpolation ( $I_{b i c u b i c}$ ) to obtain accurate pixel intensity. Figure 2 shows this resizing process.

(2) $I^{'} = I_{b i c u b i c} (I, h^{'}, w^{'}) .$

Figure 2: Image resizing followed by bicubic interpolation.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-2

Normalization

To bring the pixel values of images into a standard range of [0,1], the method of z-score normalization is employed. A normalized image $I_{N}$ is obtained from an input image $I$ of ‘c’ number of channels as,

(3) $I_{N} = \frac{I_{c} - μ}{σ}$ where,

(4) $μ = \frac{1}{N} \sum_{c} I_{c}$

(5) $σ = \sqrt{\frac{1}{N} \sum_{c} {(I_{c} - μ)}^{2}} .$

Augmentation

To synthetically expand the variability of images, augmentation is performed over the dataset. The two standard transformations- rotation ( $τ_{R}$ ) and flipping ( $τ_{F}$ ) are performed to get the augmented images $I_{A u g}$ . The original labels of images are retained during the production of augmented images.

(6) $I m g_{A u g} = τ_{R} (I, θ)$

(7) $I m g_{A u g} = τ_{F} (I, d_{r})$ where $θ$ is the rotation angle and $d_{r}$ The direction of flipping (vertical, horizontal). The augmentation of an input image is shown in Fig. 3. The training data was modified using image augmentation to avoid overfitting. We first rotated a portion of the images, typically by ±30 degrees, to enable the model to detect items at various angles. Additionally, we flipped some images to help it cope with flipped versions of objects. The additional information enables the model to comprehend unseen objects and various angles and placements of objects.

Figure 3: Input image (Orig.) with its augmentation versions.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-3

Annotation

Annotation is performed to label objects in the dataset. For $n$ objects of an image $I$ , the object $O_{m}$ is annotation as,

(8) $O_{m} = {C l, B x, B y, B w, B h}$ where Cl represents the class label, $B x$ , $B y$ the coordinates of the bounding box around the object, and $B w$ , $B h$ is the width and height of the box. The text file in YOLOv8 format of the image- $I_{f}$ containing n objects is presented as,

(9) $I_{f} = [\begin{matrix} C l_{1} & B x_{1} & B y_{1} \\ : & : & : \\ C l_{n} & B x_{n} & B y_{n} \end{matrix} \begin{matrix} B w_{1} \\ : \\ B w_{n} \end{matrix} \begin{matrix} B h_{1} \\ : \\ B h_{n} \end{matrix}] .$ The process of annotation is performed for each $I \in P r e E d u D S . T r a i n$ with the Roboflow App. Once annotation is completed, the labels with caption files are integrated into PreEduDS, as shown in Fig. 4.

Figure 4: The process of annotation and updation of the PreEduDS.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-4

Tokenization

The neural network models encode $I \in P r e E d u D S$ into a d-dimensional feature vector $F_{v}$ ;

(10) $F_{v} = E n c o d e s (I), F_{v} \in R^{d} .$

For label $L$ of object O in image $I$ having words- $w_{1}$ to $w_{n}$ , is given as,

(11) $L = {w_{1}, w_{2}, \dots ., w_{n}} .$

To have the start and end of each label, a starting word STR and trailing word END is added,

(12) $L = {S t r, w_{1}, w_{2}, \dots ., w_{n}, E n d} .$

With a tokenizer- $T r$ , each word $w_{i}$ is mapped into the token $t_{i}$ , to have the corresponding label in tokens $L_{T}$ within a range of 1 to vocabulary size $V$ ,

(13) $t_{i} = T r (w_{i}) t_{i} \in {1, 2, \dots . | V |}$

(14) $L_{T} = {t_{1}, t_{2}, \dots . t_{n}} T \in Z^{n} .$

With an embedding matrix M, the token $t_{i}$ is embedded into a dense vector

(15) $e_{i} = M [t_{i}], e_{i} \in R^{k}$ the label is thus represented as a dense embedded vector of dimension $k$ for onward processing,

(16) $M_{L} = {e_{1}, e_{2}, \dots e_{n}} M_{L} \in R^{k} .$

Serving as a decoder, the LSTM generates a predicted token ( $P t$ ) for the feature vector at the hidden state- $h_{t_{i}}$ and time $t_{i}$ ,

(17) $P t = D e c o d e r (t_{i}, F_{v}, h_{t_{i}}) .$

OEP

In the first phase, objects and intra-objects are detected in the input image by the prominent YOLOv8 neural network. Unlike other CNN models that require multiple passes over an image for object identification, the YOLO model is exploited in the first phase to extract objects efficiently. The model identifies and classifies objects in a single pass by treating the input image as a grid of cells. OEP comes first and feeds the information gathered into the ODP. Detection and location of objects in the image are accomplished with YOLOv8 in OEP. Following detection, the bounding boxes are transmitted to ODP, where VGG16 identifies the objects, and the LSTM model generates write-ups based on the characteristics of the objects detected by VGG16. Because detection leads directly to description, things are not only recognized but also provided with meaningful context-specific text. While examining which part of an object lies in which cell, multiple objects are simultaneously identified. Initially, the feature pyramid network, containing convolutional layers, is established to process the input image at multiple resolutions for detecting various objects. The reason for having a feature pyramid network is that it can handle objects of any size. Using a top-down network with lateral links, it assembles a pyramid of multi-level features, allowing the network to detect objects of any size. For this reason, YOLOv8 demonstrates better results for objects that may appear as small or large in an input image.

The LSTM model describes the features identified by the object recognition process. YOLOv8 first detects objects and VGG16 then finds meaningful features from those objects before the LSTM uses these features as input. After that, the LSTM works as a decoder to turn the features into a sequence of words. The features from VGG16 help form a detailed explanation of the objects found in the image. LSTM processes information by converting the sequence of features from VGG16 into an output sequence (descriptive text). The model generates sentences about objects and outputs them as text descriptions. When object recognition and LSTM are used, the framework can identify and describe objects, providing a comprehensive explanation of what is in the image. In the feature maps (feature pyramid ( $ρ_{x}$ )). The fine details for simple and small objects are extracted by $ρ_{1}$ whereas semantic features by larger receptive fields ( $ρ_{2} - ρ_{5}$ ). If $F_{i}$ and $F_{o}$ are for input and output features, $w$ the weights of convolutional and $b$ the bias, the process is represented as:

(18) $F_{o} = w * F_{i} + b .$ By the neck of YOLOv8, feature refinements and upsampling (U) are performed using single (C) and two convolutional layers (C2f), as shown in Fig. 5, demonstrating the working of the YOLOv transformer. The goal is to enhance the spatial resolution of feature maps and to harmonize the features of the deeper and shallower layers. For bilinear up-sampling, the interpolation is performed as:

(19) $F_{u} (x, y) = \sum_{k = 0}^{1} \sum_{l = 0}^{1} I W_{k, l} * F_{i} (k^{'}, j^{'})$ where $I W$ is the interpolation weight.

Figure 5: The YOLOv8 architecture where Pn represents the feature pyramid levels, C single convolution layer, C2F two convolution layer with feature fusion, and U represents the un-sampling operation.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-5

The head of YOLO predicts the objects’ class using softmax after tracing its location in the image map using the probability- $C i$ of the $i^{t h}$ class:

(20) $C i = \frac{e^{x k}}{\sum_{l = 1}^{C} e^{x l}}$ where $x k$ and $x l$ represents the logit of the $k^{t h}$ and $l^{t h}$ classes associated with raw scores.

The binary cross entropy loss is computed by using the entropy loss ( $E L$ ) for each single prediction to ensure that a cell of the grid contains an object or not. Where $t$ is the ground truth and $p$ is the predicted probability.

(21) $E L = - [y . \log (p) + (1 - y) . \log (1 - p)] .$

To avoid the possibility of overlapping and redundant cell detection, non-maximum suppression (NMS) is performed as a post-prediction operation. NMS enhances raw predictions and ensures accurate detection of objects. With NMS, the technique removes extra boxes that appear for the same object by choosing only the one with the top score. Boxes that share large regions are marked for removal, as this leads to more precise object detection. By filtering out false detections and saving only the important ones, NMS leads to a more accurate and no-repeated detection outcome. For the highest scoring cell, for every cell, the intersection over union ( $I U$ ) is computed from the area of overlap ( $A O$ ) and area of union ( $A U$ ); as shown in Fig. 6,

(22) $I U = \frac{A O}{A U}$ If $I U > 0.5$ , the cell overlapping cells are deemed redundant and are avoided for onward processing.

Figure 6: Object detection with NMS to avoid false detection and overlapping.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-6

In our framework, the interaction between the YOLOv8, VGG16, and LSTM models occurs in two main phases. In the first phase, YOLOv8 is used for object detection, where it identifies and locates objects and intra-objects in the input image. YOLOv8 works by dividing the image into a grid, and each grid cell predicts class probabilities and bounding box coordinates. Once objects are detected, their bounding boxes are passed on to the second phase. In the second phase, the VGG16 model utilizes the detected objects for feature extraction. It processes the images of detected objects, extracting relevant features from the pixel values. These extracted features are then fed into the LSTM model. The LSTM serves as a decoder, generating a semantic description of the objects in the image. The LSTM model uses the feature representations from VGG16 to predict a sequence of words, effectively generating a textual description of the objects present. This interaction between detection, feature extraction, and text generation enables our framework to not only detect objects but also describe them in a meaningful way.

ODP

In the second phase, a meaningful descriptive text about an input image is generated after recognizing objects within the image. The CNN-based deep learning model, VGG16, is utilized for object recognition, and the LSTM is employed for text generation. The classes of objects predicted by YOLOv8 in OEP are fed to the ODP for accurate recognition. If $D_{o} = {O_{1}, O_{2}, ., ., O_{n}}$ are the set of detected objects for an image; VGG16 is trained to recognize $O_{x} \in D_{o}$ where LSTM to extract the exact label $l_{x} \in L_{o}$ where $L_{o} = {l_{1}, l_{2}, ., ., l_{n}}$ . In the decoding process, the annotation- $A = {D_{o}, L_{o}}$ is used for predicting labels for the detected object and generating meaningful descriptive text. The working of the phase is presented in Fig. 7.

Figure 7: Schematic of the ODP from preprocessing to generation of semantic caption.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-7

Encoding in ODP

For encoding the VGG16, introduced by Simonyan & Zisserman (2014) is exploited. The model comprises 13 convolutional layers and three fully connected layers. To these layers, a $3 \times 3$ filter is applied to capture details of an input image. Moreover, the feature map is reduced by using recurrent stacks of convolutional layers, which are then followed by a set of max-pooling layers. If $m \times m$ is the filter size where $m = 3$ , $r$ and $s$ the spatial position in an input image $I$ with weight $w$ and bias $b$ , the convolution operation ( $γ$ ) of VGG16 is given as,

(23) $γ_{(r, s)} = \sum_{i = 1}^{m} \sum_{j = 1}^{m} I_{(r + i) (s + j)} . w_{i, j} + b .$

To take the maximum value from the pooling window, the max-pooling (MP) operation of VGG16, with $i, j$ being the indices representing the offsets pooling window, is given in Eq. (24):

(24) $M P (r, s) = max_{i, j \in {0, 1}} I_{(r + 1, s + j)} .$

The feature map $F_{m}$ , where $F_{m} = V G G 16 (I)$ is employed for onward processing. With a sliding window over $F_{m}$ , region proposals $R_{p} = {r_{1}, r_{2}, ., ., r_{k}}$ is generated, whereas the probability (P) that a region contains an object or part of an object is given as:

(25) $P (r_{i} | F_{m}) = σ (w_{t}, F_{m}, b_{r})$ where $σ$ is the sigmoid function, $w_{t}$ and $b_{r}$ are the trainable weights. With the softmax function, a relevant class c is assigned to region- $r_{i}$ as:

(26) $P (c | r_{i}) = \frac{e^{s}}{\sum_{i = 1}^{C} e^{s_{i}}}$ where s is the score of class c and C is the number of classes.

An appropriate class ( $C_{p}$ ) for $r_{i}$ is predicated on having the maximum probability for $C_{p}$ :

(27) $C_{p} = a r g m a x_{c} P (c | r_{i}) .$

For nested objects, the attention weight $A T_{i j}$ for region i, and j are computed from the expected feature maps ( $F_{m}^{i}$ , $F_{m}^{j})$ as:

(28) $A T_{i j} = \frac{(F_{m}^{i}, F_{m}^{j})}{\sum_{m = 1}^{P} e x p (F_{m}^{i}, F_{m}^{j})} .$ In the proposed system, a resized image $I$ of dimension $224 \times 224 \times 3$ is utilized for VGG16. The image pixels are normalized in [0,1] before training, and excluding the fully connected layers, $I$ is fed to the convolutional layer of VGG16 to extract $F_{m}$ of dimension $d$ ; $F_{m} \in R^{d}$ .

Decoding in ODP

As LSTM is an improved RNN with backpropagation (Hochreiter & Schmidhuber, 1997), the model is utilized for generating descriptive text for the detected objects. In the hidden layer, LSTM consists of interconnected memory cells. An input gate controls the input to a memory cell, while an output gate manages the output from a memory cell to the network. The model maps an input sequence $X = (x_{1}, x_{2}, \dots, x_{n})$ to an output sequence $Y = (y_{1}, y_{2}, \dots, y_{n})$ By iteratively applying unit activation using the following equations.

(29) $I_{t} = χ (ω_{i} [τ_{t - 1}, x_{t}] + β_{i})$ (30) $F_{t} = χ (ω_{f} [τ_{t - 1}, x_{t}] + β_{f})$ (31) $O_{t} = χ (ω_{O} [τ_{t - 1}, x_{t}] + β_{O})$ where $I$ , $F$ and $O$ are input, forget and output gates. $ω$ is the weight matrices (e.g., $ω_{i}$ is the matrix of weights from the input gate and $ω_{f}$ , $ω_{O}$ are weight matrices for forgetting and output neurons, respectively). $τ_{t - 1}$ represents the previous state of LSTM at time stamp $t - 1,$ whereas $β_{x}$ represents bias for the respective gates. The network output activation functions (Softmax) are represented by $χ$ . Although LSMT has several versions, LSTM’s simple architecture consists of a memory cell and three gates (input, output, and forget) to regulate the flow of knowledge. The basic structure of a simple LSTM is shown in Fig. 8.

Figure 8: The standard structure of the LSTM model.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-8

In the ODP, LSTM is trained by the caption of the image with the object class names. The model serves as a decoder to generate semantic text based on the features extracted by the encoder. At the text level, captions are tokenized into subwords. The vocabulary is transformed into sequences of integer indices. The label of objects- $L = {w_{1}, w_{2}, \dots ., w_{n}}$ is tokenized, followed by embedding for LSTM operation.

(32) $L_{T} = {t_{1}, t_{2}, \dots . t_{n}} T \in Z^{n}$

The tokens are embedded into dense vector $D_{v}$ of dimension k,

(33) $D_{v} = {e_{1}, e_{2}, \dots e_{n}} D_{v} \in R^{k}$

Thus, the training data consists of image features, tokenized sequences, and embedded captions, as shown in Fig. 8.

The softmax activation function of LSTM computes probability distribution ‘ $P_{d}$ ’ for the prediction of a word- $w x_{i}$ to be in the un-normalized log probability $U$ as:

(34) $P_{d} (y = w x_{i} | U) = \frac{e^{U_{i}}}{\sum_{c} e^{U_{c}}}$ where $U$ is the network’s output prior to applying softmax and c is the corresponding class. The temperature ( $T m p$ ) of LSTM is kept moderate to avoid perspective twists in the production of text. With the inclusion of ‘ $T m p$ ’, the function of softmax is altered as:

(35) $P_{d} (y = w x_{i} | U) = \frac{e^{\frac{U_{i}}{T e m p}}}{\sum_{c} e^{\frac{U_{c}}{T e m p}}} .$

A 10% portion of the dataset was reserved for training and used as a validation split while the models were being trained. Such splitting means that the models learn from the training data without overlearning, so they can do well with unseen information. Every time a model is trained on the training data (80% of the set), it is tested on the validation data to check its accuracy, precision, recall and F1-score. We store the model that performed best on the validation set and test it further. Additionally, k-fold cross-validation ensures that our model is robust and the outcomes are not biased by a single data partition. After training, the models are tested to determine if any updates are needed, and this process is repeated 120 times in the example outlined. During text prediction from the feature map $F_{m}$ and sequences of tokens- $L$ , LSTM generates term $w_{x}$ at each hidden state ( $h_{s}$ ). Each next term $w_{x + 1}$ is appended with the predicted text, where the process is repeated till the generation of a meaningful caption for the image (see Fig. 9). Equations representing the working of $h_{s}$ and next term are presented as;

(36) $h_{s} = L S T M (h_{s - 1}, [w_{x}, F_{m}])$

(37) $w_{x + 1} = a r g m a x (P (w_{x + 1} |, h_{s})) .$

Figure 9: Image features with tokenized and embedded captions.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-9

Implementation and evaluation

The proposed framework is implemented in Python using the Ultralytics package in Colab. PyTorch and TensorFlow libraries are used for training, and NLTK is used for text tokenization and preprocessing. The Matplotlib and OpenCV are exploited for visualization and image manipulation.

Dataset

A custom dataset, PreEduDS, comprising 730 images, is created locally from online teaching resources and textbooks. Each of the images contains objects, letters, digits, and/or shapes normally taught at the preschool level. Besides the images folder, the captions are stored in a separate CSV file. The file is used for tokenization (using NLTK Punkt) and one-hot encoding. The W2V model is trained using the vocabulary dictionary with starting and trailing words as [‘<start>’, ‘Two’, ‘cats’, ‘with’, ‘digit’, ‘2’, ‘<end>’]. Besides the images and caption.csv, the labels and class names are also included in the dataset, as shown in Fig. 10.

Figure 10: Illustration of object recognition and textual description in ODP.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-10

Training with validation

The system is trained using 80% of the images from the PreEduDS dataset, whereas it is validated and evaluated using 10% of each of the datasets. In the training of YOLOv8, images and label files are used, along with the classes. names file with the annotated images and captions for the training of VGG16 and LSTM. For effective training, validation is performed in parallel, with epochs set to 120 and a batch size of 8 in each. The history of the first 200 epochs is presented in Fig. 11.

Figure 11: Structure of the PreEduDS custom dataset.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-11

Evaluation of the OEP

The first phase is evaluated by assessing how accurately YOLOv8 detects objects in both training images and unseen images from the test split. As in most preschool education software, teaching is based on pre-defined and built-in image datasets, which is why the train split is also included in the evaluation. For YOLOv8, a promising accuracy of 99.2% is achieved on the training set, whereas 93.7% is completed on the test split, resulting in an average accuracy of 96.4%. Precision, recall and F1-score obtained for the model are shown in Table 1, whereas the area under curve (AUC) is shown in Fig. 12. The formulas for precision (P), recall (R) and F1-score are given in the following equations.

(38) $P = \frac{T P}{T P + F P}$ (39) $R = \frac{T P}{T P + F N}$ (40) $F 1 = 2 \times \frac{P \times R}{P + R} .$

Table 1:

Outcomes of the evaluation of OEP.

Image source	AUC	Precision	Recall	Accuracy	F1-score
Training	0.97	0.98	0.96	0.992	0.97
Testing	0.92	0.93	0.91	0.937	0.94

DOI: 10.7717/peerj-cs.3080/table-1

Figure 12: The cumulative accuracy-vs-epochs plot of the models.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-12

If $α$ represents the number of correct classifications and $β$ the total number of predicted classes, accuracy A is given as:

(41) $A = \frac{α}{β} .$

Similarly, if $p$ and $q$ are the coordinates of the points on the curve, then AUC is represented as:

(42) $A U C \approx \sum_{j = 1}^{j - 1} \frac{(p_{j + 1} - p_{j})}{2} \times (q_{j} + q_{j + 1}) .$

As clear from Fig. 12, the quick raise at the top-left indicates the efficient performance of YOLOv8.

Evaluation of the ODP

The performance of the two models, VGG16 and LSTM, is assessed in terms of speed and accuracy using the same dataset splits. Although VGG16 is reported to have comparatively low accuracy for object recognition, LSTM generates captions with a promising accuracy of 96.5%. The AUCs of VGG16 and LSTM are shown in Fig. 13, while their precision, recall, and F1-scores are presented in Table 2.

Figure 13: The area under ROC curve about the performance of YOLOv8.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-13

Table 2:

Outcomes of the evaluation of ODP.

Model	Image source	AUC	Precision	Recall	Accuracy	F1-score
VGG16	Training	0.94	0.9	0.91	0.96	0.96
VGG16	Testing	0.88	0.87	0.85	0.89	0.90
LSTM	Training	0.98	0.93	0.97	0.98	0.94
LSTM	Testing	0.94	0.91	0.93	0.95	0.92

DOI: 10.7717/peerj-cs.3080/table-2

The performance of the models is also cross-checked against other state-of-the-art models (Li et al., 2025; Kriouile et al., 2024; He et al., 2017; see Table 3). As shown in Fig. 14, a promising accuracy can be achieved when the number of epochs reaches 200. Therefore, assigning a value below 500 for epochs may further enhance the inference time, in addition to the other metrics. To check the accuracy of our results more thoroughly, we found the 95% confidence intervals for object recognition and text generation. 94.4% of the time, objects were correctly identified and the confidence interval suggests that number could be as high as 95.6%. With a 95% confidence interval between 95.2% and 97.8%, text generation was accurate to 96.5%. Paired t-test was applied to see if differences were statistically significant and results with p-values below 0.05 were considered so.

Table 3:

Comparison of the three models with other standard models.

Model	Inference time (ms)	Params (M)	Memory (MB)	FLOPs	MACs
Mask region-based CNN (Faster R-CNN) (Mask R-CNN) (Kriouile et al., 2024)	99	41	168	341	170
Faster region-based CNN (Faster R-CNN) (He et al., 2017)	82	31	121	230	115
Feature pyramid network (FPN) (Wang et al., 2025)	181	72	289	571	285
YOLOv8 (Chen et al., 2025)	15	7.8	31	17.1	77
VGG16 (Simonyan & Zisserman, 2014)	93	121	328	13.2	90
LSTM (Hochreiter & Schmidhuber, 1997)	12	1.9	7	1.4	13

DOI: 10.7717/peerj-cs.3080/table-3

Figure 14: The area under ROC curve of (A) VGG16 and (B) LSTM.

Download full-size image

DOI: 10.7717/peerj-cs.3080/fig-14

Conclusion and future work

Elementary and early childhood education has a significant role in the life-long cognitive and socio-emotional development of an individual. Visuals and images illustrating shapes and objects have a crucial role in childhood education, not only helping young learners to understand patterns with spatial relationships but also providing a strong foundation for math and problem-solving skills. The dedicated preschool illustrative software enables instructors to present content in a preferred manner and at a suitable pace. Moreover, young learners are engaged in an immersive way by fostering curiosity and creativity in learning. In all such visually engaging software, proper detection and recognition of objects are needed. Although several AI-based gamified software have been proposed, little attention has been paid to the precise detection and description of objects. As emerging deep learning technologies possess immense potential to recognize accurately and precisely present objects in input images, they are well-suited for this purpose. This research work presents a practical framework for accurately detecting objects and nested objects in an input image and precisely describes their contents. YOLOv8 and VGG16 are utilized for object detection and recognition, whereas the efficient LSTM model is used for predicting descriptive text. The framework is implemented in Colab using Ultralytics, PyTorch, OpenCV, and NLTK packages. A custom dataset, PreEduDS, containing 730 images, is used for training and testing. Accuracy scores of 94.4% and 96.5% are achieved for object detection and descriptive text generation, respectively. The comparative analysis, focusing on processing costs and resource utilization, demonstrates that the framework applies to emerging preschool tutoring software. The framework can be applied in interactive preschool apps, enhancing vocabulary by recognizing objects and generating descriptive text. It can also be integrated into AR tools, allowing children to interact with real-world objects. Additionally, it could assist educators by automatically labeling and describing objects in classroom materials. As part of our future work, the framework will be enhanced to pronounce the detected objects and present descriptive text in various natural languages.

Supplemental Information

Code.

DOI: 10.7717/peerj-cs.3080/supp-1

Download

[1] Abrar MF, Islam MDR, Hossain MS, Islam MM, Kabir A. 2019. Augmented reality in education: a study on preschool children, parents, and teachers in Bangladesh. In: Chen J, Fragomeni G, eds. 21st International Conference on Human-Computer Interaction: Virtual, Augmented and Mixed Reality: Applications and Case Studies. Cham: Springer.

[2] Chen J, Pan S, Peng W, Xu W. 2025. Bilinear spatiotemporal fusion network: an efficient approach for traffic flow prediction. Neural Networks 187:107382

[3] Chung K, Kim S, Jang Y, Choi S, Kim H. 2025. Developing an AI literacy diagnostic tool for elementary school students. Education and Information Technologies 30(1):1013-1044

[4] Ghandi T, Pourreza H, Mahyar H. 2023. Deep learning approaches on image captioning: a review. ACM Computing Surveys 56(3):1-39

[5] Gyekye-Ampofo MF, Opoku-Asare NA, Andoh GB. 2023. Early childhood care and education in the 21st century: a review of the literature. British Journal of Education 11(4):81-95

[6] Haugland SW. 2000. Early childhood classrooms in the 21st century: using computers to maximize learning. Young Children 55(1):12-18

[7] He K, Gkioxari G, Dollár P, Girshick R. 2017. Mask R-CNN.

[8] Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural Computation 9(8):1735-1780

[9] Hossain M, Sohel F, Shiratuddin MF, Laga H. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51(6):36-118

[10] Kriouile Y, Ancourt C, Wegrzyn-Wolska K, Bougueroua L. 2024. Nested object detection using mask R-CNN: application to bee and varroa detection. Neural Computing and Applications 36:22587-22609

[11] Laghari S. 2024. Impact of traditional teaching methodologies on the performance of students at primary level in government schools of Hyderabad City. Annals of Human and Social Sciences 5(3):53-59

[12] Li L, Cherouat A, Snoussi H, Wang T. 2025. Grasping with occlusion-aware ally method in complex scenes. IEEE Transactions on Automation Science and Engineering 22:5944-5954

[13] Li R, Wang Y, Sun S, Zhang Y, Ding F, Gao H. 2025. UE-extractor: a grid-to-point ground extraction framework for unstructured environments using adaptive grid projection. IEEE Robotics and Automation Letters 10(6):5991-5998

[14] Liao H, Xia J, Yang Z, Pan F, Liu Z, Liu Y. 2024. Meta-learning based domain prior with application to optical-ISAR image translation. IEEE Transactions on Circuits and Systems for Video Technology 34(8):7041-7056

[15] Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. 2017. Feature pyramid networks for object detection.

[16] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC. 2016. SSD: single shot multibox detector.

[17] Liu H, Brailsford T. 2023. Reproducing “show, attend and tell: neural image caption generation with visual attention”.

[18] Liu S, Kong W, Liu Z, Sun J, Liu S, Gašević D. 2025. Dual-view cross attention enhanced semi-supervised learning method for discourse cognitive engagement classification in online course discussions. Expert Systems with Applications 278(4):127339

[19] Liu X, Li Z, Zhou Y, Peng Y, Luo J. 2024. Camera–radar fusion with modality interaction and radar gaussian expansion for 3D object detection. Cyborg and Bionic Systems 5:0079

[20] Plowman L, Stephen C. 2003. A ‘benign addition’? Research on ICT and preschool children. Journal of Computer Assisted Learning 19(2):149-164

[21] Redmon J, Divvala S, Girshick R, Farhadi A. 2016. You Only Look Once: unified, real-time object detection.

[22] Shi H, Dao SD, Cai J. 2025. LLMFormer: large language model for open-vocabulary semantic segmentation. International Journal of Computer Vision 133(2):742-759

[23] Shi H, Hayat M, Cai J. 2024. Unified open-vocabulary dense visual prediction. IEEE Transactions on Multimedia 26:8704-8716

[24] Shonkoff JP, Phillips DA. 2000. From neurons to neighborhoods: the science of early childhood development. National Academy Press.

[25] Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv

[26] Simonyan K, Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition.

[27] Wang T, Li J, Wu H-N, Li C, Snoussi H, Wu Y. 2022. ResLNet: deep residual LSTM network with longer input for action recognition. Frontiers of Computer Science 16(6):166334