Intelligent toy tracking trajectory design based on mobile cloud terminal deployment and depth-first search algorithm

Yang Zhang; Hu Zhang

doi:10.7717/peerj-cs.2187

Intelligent toy tracking trajectory design based on mobile cloud terminal deployment and depth-first search algorithm

Yang Zhang , Hu Zhang

July 15, 2024

Author and article information

Abstract

The popularization of intelligent toys enriches the lives of the general public. To provide the public with a better toy experience, we propose the intelligent toy tracking method by the mobile cloud terminal deployment and depth-first search algorithm. Firstly, we construct a toy detection model via Transformer, which realizes the positioning of toys in the image through the refined region adaptive boundary representation. Then, using these detected continuous frames, we improve the toy tracking based on a depth-first search. Long-short-term memory (LSTM) constructs the continuous frame tracking structure, and the depth-first search mechanism is embedded to realize the accurate tracking of multiple targets in continuous frames. Finally, to realize the terminal marginalization of the proposed method, this chapter proposes a lightweight model deployment method based on mobile cloud terminals to realize the maintenance of the optimal machine state of intelligent toys. The experiment proves that our proposed target method can reach the world-leading level and obtain the mAP value of 0.858. Our tracking method can also perform excellently with a MOTA value of 0.916.

Cite this as

Zhang Y, Zhang H. 2024. Intelligent toy tracking trajectory design based on mobile cloud terminal deployment and depth-first search algorithm. PeerJ Computer Science 10:e2187 https://doi.org/10.7717/peerj-cs.2187

Main article text

Introduction

With the booming development of the smart toy market, trajectory tracking technology has become a crucial innovation point that captures consumers’ attention. This technology enables realtime tracking of the movement trajectory of smart toys, providing users with a more intelligent, personalized, and highly interactive gaming experience. Consequently, it significantly enhances user satisfaction and deepens their reliance and trust in smart toys. More importantly, leveraging trajectory tracking technology for smart toys, we can develop intelligent toys that greatly benefit children’s learning and growth. These toys not only entertain but also possess educational value, thus exhibiting tremendous practical application significance. The trajectory tracking of intelligent toys has distinct technical characteristics (Akdeniz & Ozdinc, 2021; Moradi, Amiri & Ghanavi, 2017). First, the trajectory tracking method of intelligent toys requires realtime acquisition, processing, and display of the location information of toys in real time to ensure realtime monitoring and interactive gaming experience. Secondly, accurate positioning information is the basis of intelligent toy track tracking, and positioning technology is required to provide high-precision location data indoors and outdoors (Druga, Williams & Park, 2018). Secondly, due to the portable type of toys, trajectory tracking technology must be designed with lightweight, small, and simple sensors and algorithms that can be easily integrated into various toys (Luo, 2023). Finally, smart toys may be used in different environments, including indoor, outdoor, different terrain, etc. The trajectory-tracking technology should have adaptability and stability (Wang, Yin & Zhang, 2021; McStay & Rosner, 2021; Chen, Wang & Huang, 2006). Therefore, factors such as high-precision positioning, changing scenes and complex path movement are difficulties tracking intelligent toys (Lemma, Celik & Katzenbeisser, 2008).

To deal with the above difficulties, many researchers focus on single-object tracking technology to conduct tracking research on intelligent toys (Qian, Li & Xue, 2023; Yang, Lu & Wu, 2018; Delprino, Piva & Tommasi, 2018). Frossard & Urtasun (2018) designed an end-to-end tracking method, which includes different network structures for processing point cloud and image data, to realize realtime target detection, information matching and linear optimization because both the detection and matching modules adopt the DNN (Zhang, Zhou & Sun, 2019). Luiten, Fischer & Leibe (2020) employed 3D reconstruction for the occlusion and improved the tracking. The MOTSFusion framework proposed by LutsFusion consists of two stages. Cao et al. (2023) designed a tiny-object detection and tracking method DT by SiamFC tracking network structure by adding improved HOG and Harris algorithms. Experiments show that the model assesses faster tracking speed and higher accuracy. Pang et al. (2023) proposed a target-tracking model by image block matching to overcome the inaccurate tracking caused by blurred and noisy images of high-speed moving targets. The experiment can conclude that this method performs better in high-speed motion tracking. In addition, some scholars (Zhang, Wang & Zhang, 2023) improve the tracking effect by enhancing the object detection algorithm.

1)

However, the current approaches are predominantly terminal-based model architectures, which struggle to address the realtime and lightweight requirements essential for smart toys. Furthermore, these existing methods fail to achieve multi-scene trajectory tracking and realtime positioning in unknown environments. To address these limitations, this article proposes a novel smart toy trajectory tracking method called TTNet based on a mobile cloud terminal and depth-first search algorithms. The TTNet method leverages mobile cloud terminals’ computing power and storage capabilities to ensure realtime performance and lightweight implementation. TTNet can maintain rapid response times even for resource-intensive tracking algorithms by offloading complex computations to the cloud. This allows for seamless tracking of smart toy movements, ensuring a smooth and engaging user experience.
2)

Moreover, TTNet incorporates depth-first search algorithms to enable multi-scene trajectory tracking. By utilizing the search paradigm of depth-first search, TTNet can effectively explore and map unknown environments, enabling accurate realtime positioning even in unexplored scenarios. This capability significantly enhances the adaptability and versatility of the trajectory tracking system, making it suitable for a wide range of real-world applications. Main contributions are as follows: We propose an intelligent toy detection model based on Transformer, which realizes precise positioning of toys by designing an adaptive boundary regression model.
3)

We propose an intelligent toy trajectory tracking method based on depth-first search, which constructs the inter-frame relationships to achieve realtime tracking.
4)

Using our approach, we propose a lightweight model deployment method based on mobile cloud terminals to enable mobile applications.

Intelligent toy track tracking algorithm based on mobile cloud deployment and depth-first search

Aiming at the problems of background interference and difficult multi-target concurrency in intelligent toy tracking, we propose the intelligent toy tracking method based on a mobile cloud terminal and depth-first search (DFS). A transformer, long short-term memory network (LSTM), and deep search algorithm are adopted to develop an intelligent toy track-tracking method that integrates intelligent toys and multiple scenes.

Intelligent toy detection model based on Transformer

First, we regard the intelligent toy detection model by Transformer as the toy target locator. Its codec structure has a strong receptive field, whose network is in Fig. 1. The Transformer model exhibits significant advantages in object detection tasks. With its unique self-attention mechanism, the Transformer can establish direct connections between elements at different positions in the input sequence, effectively capturing long-range dependencies in images. This capability enables the Transformer to more precisely identify object features in images, especially when it involves subtle differences between objects and their backgrounds and the interrelationships between multiple objects.

Figure 1: Intelligent toy detection based on Transformer.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-1

Furthermore, the Transformer’s encoder-decoder architecture allows it to map image features directly to object-bounding boxes and categories, achieving end-to-end object detection. This concise and efficient approach simplifies the object detection, improving detection efficiency and accuracy. Notably, the Transformer can process larger images and more objects in a single computation due to its ability to handle large amounts of input data and parameters. This gives the Transformer a significant advantage in handling complex scenes, especially in applications that require detecting multiple objects and processing high-resolution images.

The input image generates a multi-dimensional original feature map through the convolutional neural network (CNN) backbone network and then converts the multi-dimensional original feature map into a one-dimensional feature map. Combined with the image position coding, the Transformer encoder outputs fixed-length vectors. The decoder is fed with the target query. The adaptive boundary regression model processes the decoder output to get the associated target region and the intelligent toy region.

The adaptive boundary representation of the intelligent toy positioning region is based on the prior shape of the toy, and the accurate toy positioning region is obtained by fitting the coordinates of its boundary points through the adaptive boundary regression model (ABRM). The adaptive boundary regression model is shown in Fig. 2. The text area boundary suggestion box is obtained through RPN and ROI. Then, the boundary suggestion box is optimized to obtain the accurate boundary box for the specific associated target and intelligent toy in the scene. The boundary points of the boundary suggestion box are expressed as $(x_{1}, y_{1}, x_{2}, y_{2}) \dots \dots (x_{i}, y_{i}, x_{i + 1}, y_{i + 1})$ . In the inference stage, the network model will adaptively output the optimal number of boundary points according to the prior shape of the toy. By incorporating a refined region-adaptive boundary representation technique, the model can accurately capture the position and boundaries of toys in images, thereby achieving efficient localization of smart toys.

Figure 2: Adaptive boundary regression model.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-2

The loss of each suggestion box is defined as classifying loss, border regression loss and boundary point regression loss. The loss is as follows:

(1) $L = L_{C l s} (p, t) + t \sum_{i \in x_{i}, y_{i}, x_{i + 1}, y_{i + 1}} L_{r e g} (v_{i}, v_{i}^{*}) + (t - 1) \sum_{i \in x_{i}, y_{i}, x_{i + 1}, y_{i + 1}} L_{r e g} (u_{i}, u_{i}^{*})$ where the toy area loss $L_{c l s} (p, t) = - \log p_{t}$ . The t represents the classification label. When t = 1, it is a toy area; when t = 0, it is not a toy area. The argument $p = (p_{0}, p_{1})$ is the confidence of the toy zone and the non-toy zone calculated by softmax.

The border regression loss and the boundary point regression loss can be expressed in a uniform form $(w, w^{*})$ , $L_{r e g} (w, w^{*})$ is presented as follows:

(2) $L_{r e g} (w, w^{*}) = s m o o t h_{L 1} (w - w^{*})$

Intelligent toy trajectory tracking based on depth-first search

After obtaining the toy location area of each frame image, we build the interrelation between frames. LSTM is used to determine the correlation of toy targets between different frames. LSTM mainly consists of a memory unit and three gating structures. Through the cooperative calculation of the three gating structures, the model can store and update the sequence information in the memory unit for a long time. A single neuron cell of LSTM is shown in Fig. 3. During the learning process, the weight that LSTM needs to update is memory cell ct. At time t, the input and output divisions of LSTM are defined as at and ht, then the input gate as i can be presented as follows:

(3) $i_{t} = σ (W_{i a} a_{t} + W_{i h} h_{t - 1} + b)$ where W_ia and W_ih are linear transformation matrices, the input gate controls the at and updates the memory cell c_t. The forgetting gate is the most characteristic structure of LSTM, which can control the data that memory cells need to forget to reduce the probability of being interfered with by unnecessary information in the following moments. The forgetting gate can also play a role in dimensionality reduction while discarding the memory information. Through the calculation of the forgetting gate, the memory cell c_t-1 of the previous moment will be updated to c_t, as shown in the formula:

(4) $f_{t} = σ (W_{f a} a_{t} + W_{f h} h_{t - 1} + b)$

(5) $c_{t} = f_{t} c_{t - 1} + i_{t} \tanh (W_{c a} a_{t} + W_{c h} h_{t - 1} + b)$

Figure 3: The diagram of LSTM basic structure.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-3

Through the updated memory cell c_t, the output gate of the current moment t is constructed. The output gate mainly calculates the influence c_t on the output value h_t, as shown in the formula:

(6) $o_{t} = σ (W_{o a} a_{t} + W_{o h} h_{t - 1} + b)$

(7) $h_{t} = o_{t} \tanh (c_{t})$ where W_oa and W_oh are linear mapping functions.

However, the association between toys constructed by LSTM is limited to frame to frame, and it isn’t easy to achieve multi-frame multi-toy target tracking. To solve the above problems, we embed a depth search algorithm in the loop process of LSTM to ensure the uniqueness of multiple targets in different frames.

First, we represent the toy targets across frames as a deep search graph (DS-Graph). The DS-Graph is divided into two layers: global and local relational graphs G_k. Subsequently, a deep search tree is constructed to facilitate cross-frame target search. The core idea is to establish an overall relationship between the subgraphs within each site based on the global S-Graph ∑. For each distributed site, we utilize the fundamental concept of recursive calls to preprocess the data subgraph G_k(1 ≤ k ≤ p, k ≠ i) within that site. This preprocessing involves computing the region’s local S* -Graph and determining its various T_i components. We then apply a merge algorithm to obtain the DFS tree of G_k, identifying leaf nodes. Using these leaf nodes, we search for the next G_k, repeating the process with unsearched leaf nodes as roots until no more leaf nodes can serve as roots, as illustrated in Fig. 4. Finally, we establish target chains using the relationships between the completed search leaf nodes and root nodes, enabling the tracking of multiple targets across multiple frames.

Figure 4: The process of the deep search graph.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-4

In toy tracking, LSTM can be used to process time series data related to toy motion. For example, by analyzing the historical position data of the toy, LSTM can learn the motion pattern of the toy and predict its future position. This predictive ability can be used to improve the accuracy and efficiency of toy tracking. At the same time, the depth-first search algorithm can be used to optimize graph search or path planning problems related to toy tracking. For example, when constructing a path map of possible movements for the toy, DFS can be used to search for all possible paths. By integrating the DFS strategy with the outputs of deep learning networks, we can significantly enhance the performance of target tracking. The deep learning network not only provides precise feature representations and initial position predictions of the target but also, combined with the DFS strategy, ensures that all potential target locations or states are thoroughly explored, leading to the most accurate tracking results. During the tracking process, we start from the initial frame of the video and utilize the target feature representations extracted by the deep learning network to search for the most similar location or state of the target in each subsequent frame. If a satisfactory result is not found in a particular frame, we leverage the backtracking capability of DFS to return to previous frames and explore other possible paths. Through this continuous process of searching and backtracking, we can gradually narrow down the search range and precisely locate the accurate position or state of the target across different frames. This approach improves the accuracy of target tracking and enhances its robustness in complex scenarios.

Model lightweight deployment based on mobile cloud terminal

To realize the transplantation of the proposed method in the intelligent toy terminal, we must realize its realtime and portability. We propose a lightweight model deployment method based on mobile cloud terminals. Mobile cloud terminals provide great convenience for information resource access and collaborative work. Still, their computing performance and power supply can not meet the needs of many screen rendering and complex computing tasks for smart toys. Mobile terminals formulate computing migration strategies according to network delay, bandwidth and energy consumption, and local process tasks with high latency requirements to improve user experience. Migrate tasks that consume too much computing resources to cloud platform services.

When the mobile cloud terminal assigns the task to the cloud for processing, the model is not lightweight. When the mobile cloud terminal assigns tasks to the cloud for local processing, we will lighten the proposed model. The Transformer-based smart toy detection model is computationally intensive in the entire tracking approach. Therefore, we reduce and optimize its structure. For individual attention modules, we use Transformer architecture similar to Segformer (Xie, Wang & Yu, 2021). Reducing the key-value sequence on a single scale is not conducive to segmenting small objects. Therefore, we use shunt self-attention (SSA) to improve the segmentation. SSA obtains the context interrelationships of multiple scales of the target by reducing the key-value sequence at multiple scales. SSA is grouped by header, and different header key-value sequences are scaled differently. Assuming i is the index of the header, then SSA can be represented as:

(8) $Q_{i} = X_{k} W_{i}^{Q}$

(9) $K_{i} = D S (X_{k}, r_{i}) W_{i}^{K}$

(10) $V_{i} = D S (X_{k}, r_{i}) W_{i}^{V}$

(11) $V_{i} = V_{i} + L E (V_{i})$ where X_k represents the input sequence, and Q_i, K_i, and V_i refer to the query, key and value features mapped from the i-th header, respectively. All W are trainable linear transformation matrices. DS(·) denotes the downsampling of the i-th head, and LE(·) refers to the enhancement operation on V_i. Self-attention is lightened by the above method, leaving the rest of the calculations unchanged.

The lightweight deployment of the smart toy tracking model enables it to run more efficiently on mobile devices, reducing delays and stuttering and providing users with a smoother and more natural interactive experience. The model undergoes lightweight processing to reduce data volume and computational complexity, lowering the hardware requirements for mobile devices. The smart toy tracking model requires realtime tracking of the toy’s position and status to allow users to control and interact accurately. Lightweight deployment reduces the time cost of model operation and improves realtime performance while maintaining high tracking accuracy, ensuring that users receive timely and accurate information feedback.

In summary, the scalability issue has been effectively addressed through comprehensive optimization measures such as algorithm optimization, model lightweighting, data processing, system architecture improvements, and scalability testing. These measures enable the system to maintain efficient and stable operation in large-scale data and complex environments.

Experiment and analysis

Dataset and implement details

We use the video from the Trajectory Robot Dataset (https://www.zenodo.org/record/6337847, 10.5281/zenodo.6337847) to test an intelligent toy trajectory tracking algorithm based on mobile cloud deployment and depth-first search. The dataset comprises color and depth videos capturing the movements of the Panda robot, along with their respective joint and Cartesian trajectories. Additionally, it encompasses the trajectories of a receiver robot involved in object handover. Each motion instance encompasses six files: RGB video, depth video, and four trajectories (pertaining to the giver and receiver) presented in a time series format. The dataset encompasses a total of 38,393 motion samples. This dataset comprises video data captured in various real-world scenarios, encompassing diverse lighting conditions, weather situations, and object movement patterns. It offers training data for multi-object tracking, assisting robots in learning how to manage tracking tasks for multiple targets efficiently. By training on this diverse dataset, video trajectory robot models can better adapt to various real-world scenarios, enhancing their generalization capabilities.

Since the Transformer model requires extensive computational resources for training on large datasets, this chapter leverages the robust computing capabilities of the calibration supercomputer center, equipped with CPUs (Xeon(R) E5-2640 v4) and GPUs (4 × Nvidia Tesla V100), to facilitate the setup of the environment and subsequent model training. The chosen deep learning framework remains Pytorch. To adapt to the cross-modal training task of the Transformer, the experiment pre-trained Faster R-CNN. Unlike image grid features, the method in this chapter needs to input accurate target category and location information. Therefore, datasets Objects365, MSCOCO, OpenImages, and Visual Genome were employed to complete the pre-training of the object detection model. Experimental parameters are presented in Table 1.

Table 1:

Implementation details.

Parameters	Value
Initial learning rate	$4 \times 10^{- 4}$
Epoch	40
Batch-size	10
Decay	0.9
Gradient descent mode	Adam
Image input size	420 × 420
Image feature dimensions	512

DOI: 10.7717/peerj-cs.2187/table-1

Considering that intelligent toy track tracking is a task to judge successive frames, we employ mean square error (mAP) and F-value as the evaluation criteria of the method, which can be as follows:

(12) $P r e c i s i o n = \frac{V (g t ⋂ p r)}{V (p r)}$

(13) $R e c a l l = \frac{V (g t ⋂ p r)}{V (g t)}$

(14) $F = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(15) $m A P = \frac{1}{N} \times \sum P r e c i s i o n_{n} \times R e c a l l_{n}$ where pr denotes the result from the model, and gt refers to the ground truth. In the context of smart toy tracking, the rationale for selecting mAP and F-value as evaluation metrics lies in their ability to comprehensively and accurately reflect the precision and recall of the model in detecting multiple target toys, thereby effectively assessing the model’s performance.

Compare our detection method with other methods

We conducted an experiment using our method on the Video-the Trajectory Robot Dataset dataset. We selected some excellent feature models, such as SVM (Chauhan, Dahiya & Sharma, 2019), Faster R-CNN, Transformer (Vaswani, Shazeer & Parmar, 2017), Bert (Deepa, 2021), Vit (Khan, Naseer & Hayat, 2022), Deit (Touvron, Cord & Douze, 2021) and SwinTransformer (Liu, Lin & Cao, 2021), and compared their performance. We present the results in Fig. 5 and Table 2. While comparing with other target location methods, our method achieves the highest values among all evaluation indexes, namely 0.854 recall, 0.876 precision, 0.867 F-value and 0.834 mAP. Compared with SVM, our method increases the mAP value by more than 27%, mainly because SVM is difficult to achieve the fitting of large-scale data sets. Compared to Faster R-CNN, our method has more than 12% lead. Unlike SVM, Faster R-CNN can achieve rapid convergence but cannot completely avoid the interference caused by complex backgrounds. Self-attention-based Transformer and Bert can handle detection tasks in complex environments, and our approach still improves mAP scores by more than 5.5%. Vit and Deit are the most advanced detection methods, and they can achieve more than 80% mAP value thanks to excellent model performance. Our method achieves a lead of more than 4% compared to this. Finally, our approach is also completely ahead of the curve compared to Swin Transformer. Our approach uses an adaptive candidate box mechanism to help the model overcome the error detection caused by complex scenes to obtain more efficient detection results.

Table 2:

Compare the proposed method with others.

Methods	Recall	Precision	F	mAP
SVM	0.548	0.517	0.523	0.584
Faster R-CNN	0.763	0.734	0.753	0.732
Transformer	0.786	0.814	0.799	0.783
Bert	0.804	0.834	0.816	0.795
Vit	0.821	0.843	0.832	0.801
Deit	0.837	0.856	0.845	0.812
SwinTransformer	0.854	0.876	0.867	0.834
Ours	0.873	0.883	0.879	0.858

DOI: 10.7717/peerj-cs.2187/table-2

In addition, according to the method’s structure, our training is compared with other models. The loss curve of the model is in Fig. 6. It can be seen that ours reaches the fit at epoch 35. On the other hand, the model usually fits at epoch 40. In addition, compared with Bert, Vit and Swin Transformer, we can conclude that our model can converge earlier and the training process is smoother.

Figure 6: The loss of our model comparing with others.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-6

Compare our tracking method with other methods

At the end of the validation of our object detection method, we evaluate the performance of our trajectory tracking model with Expect Average Overlaprate (EAO) and Multiple Object Tracking Precision (MOTP). We compare our method with some excellent models, such as DeepSORT (Veeramani, Raymond & Chanda, 2018), FlowTrack (Zhu, Wu & Zou, 2018), CenterTrack (Zhou, Koltun & Krahenbuhl, 2020) and Spatio-temporal Transformer (Yan, Peng & Fu, 2021). In Table 3, Figs. 7 and 8, our method has demonstrated outstanding performance in comparison with other tracking algorithms, achieving the highest scores across all evaluation metrics, specifically an EAO score of 0.351, a MOTP score of 0.885, and a MOTA score of 0.916. Compared to DeepSORT, our method significantly improves the EAO score by 3.9% and the MOTP score by 6.2%. Compared to FlowTrack, we achieve a lead of over 5% in the MOTP score, albeit with a slightly higher inference time and model parameters. However, this minor cost results in significant performance gains. Compared to CenterTrack, our method takes the lead across all metrics, outperforming it in every aspect, requiring fewer model parameters and a shorter inference time. This comprehensive advantage is primarily attributed to integrating the deep search mechanism, which significantly enhances the performance of our method. Finally, compared with the spatiotemporal transformer, our method achieves improvements of 1.3% in EAO score, 1.8% in the MOTP score, and 1.4% in the MOTA score. This is due to our utilization of a structurally simple LSTM in constructing the tracking method, coupled with the further enhancement of LSTM performance through a deep search algorithm, which enables us to maintain performance advantages while reducing the model’s data size. By leveraging deep learning networks and incorporating the concept of depth-first search, a model can explore potential locations or states of a target across different video frames and then combine this information with the outputs of a neural network. Experiments have shown that through continuous searching and backtracking, we can gradually narrow down the search space and ultimately identify the accurate position or state of the target in different frames.

Table 3:

Compare our tracking method with other methods.

Methods	EAO	MOTP	MOTA
DeepSORT	0.312	0.823	0.867
FlowTrack	0.345	0.831	0.879
CenterTrack	0.324	0.856	0.896
Spatio-temporal transformer	0.338	0.867	0.902
Ours	0.351	0.885	0.916

DOI: 10.7717/peerj-cs.2187/table-3

Figure 7: Compare our tracking method with other methods.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-7

Figure 8: Model efficiency comparison with other methods.

Download full-size image

DOI: 10.7717/peerj-cs.2187/fig-8

Ablation experiments

We conduct ablation experiments on a dataset focusing on three sub-modules: Transformer, LSTM, and DFS, to evaluate their impacts on model performance. Initially, we observe the performance of the baseline model, which achieves an EAO (Expected Average Overlap) score of 0.256 and a MOTP (Multiple Object Tracking Precision) score of 0.792, as seen in Table 4. Subsequently, we embed the Transformer, LSTM, and DFS modules into the baseline model separately to explore their contributions to improving the model’s performance. The experimental results show that embedding the Transformer increases the EAO score to 0.298, embedding the LSTM raises it to 0.304, and embedding the DFS further enhances it to 0.319. These results strongly demonstrate the effectiveness of these three modules in enhancing model performance. To further investigate the combined effects of these modules, we conduct ablation experiments with pairwise combinations. The experimental results reveal that when combining Transformer and LSTM, the model achieves a MOTP score of 0.859; when combining Transformer and DFS, the MOTP score reaches 0.865; and when combining LSTM and DFS, the MOTP score attains 0.868. These findings further illustrate the complementarity between different modules and their synergistic effects on different tasks. Finally, we simultaneously embed all three modules—Transformer, LSTM, and DFS—into the baseline model to explore their optimal performance when combined. The experimental results indicate that the combination of these three modules achieves an outstanding MOTP score of 0.885. This result validates the respective advantages of these three modules and showcases the powerful synergistic effect they can produce when combined.

Table 4 :

Ablation experiments.

Transformer	LSTM	DFS	EAO	MOTP
			0.256	0.792
O			0.298	0.831
	O		0.304	0.829
		O	0.319	0.811
O	O		0.338	0.859
O		O	0.328	0.865
	O	O	0.341	0.868
O	O	O	0.351	0.885

DOI: 10.7717/peerj-cs.2187/table-4

Conclusion

To improve the playability and portability of smart toys, we propose an intelligent toy tracking model by the mobile cloud terminal deployment and depth-first search algorithm. An intelligent toy detection model via a Transformer is proposed to improve the toy detection effect of a single frame. LSTM and embedded depth-first search mechanisms are used to realize the trajectory tracking of intelligent toys in successive frames. In addition, to boost the portability of our method, a lightweight model deployment method based on a mobile cloud terminal is proposed to realize the multi-functional application of smart toys. Experiments can demonstrate that our proposed toy positioning method can obtain a mAP value of 0.858 and achieve accurate detection within a single frame. Our continuous frame toy tracking method can obtain the MOTA value of 0.916, which realizes the real-time tracking function of intelligent toys.

Supplemental Information

Code.

DOI: 10.7717/peerj-cs.2187/supp-1

Download

Additional Information and Declarations

Competing Interests

Both authors are employed by Beijing AIQI Technology Co., LTD, The authors declare that they have no competing interests.

Author Contributions

Yang Zhang conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, and approved the final draft.

Hu Zhang performed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The code is available in the Supplemental File.

The dataset is available at Zenodo: Mavsar, M. (2022). Video-Trajectory Robot Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6337847.

Funding

The authors received no funding for this work.

[1] Akdeniz M, Ozdinc F. 2021. Maya: an artificial intelligence based smart toy for pre-school children. International Journal of Child-Computer Interaction 29:100347

[2] Bewley A, Ge Z, Ott L, Ramos F. 2016. Simple online and realtime tracking.

[3] Cao J, Pang J, Weng X, Khirodkar R, Kitani K. 2023. Observation-centric sort: rethinking sort for robust multi-object tracking.

[4] Chauhan VK, Dahiya K, Sharma A. 2019. Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review 52(2):803-855

[5] Chen S, Wang M, Huang Y. 2006. Man-machine interface of developing intelligent toys based on tree structure.

[6] Chen L, Yang J, Kong H. 2017. Lidar-histogram for fast road and obstacle detection.

[7] Deepa MD. 2021. Bidirectional encoder representations from transformers (BERT) language model for sentiment analysis task. Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12(7):1708-1721

[8] Delprino F, Piva C, Tommasi G. 2018. ABBOT: a smart toy motivating children to become outdoor explorers.

[9] Druga S, Williams R, Park HW. 2018. How smart are the smart toys? Children and parents’ agent interaction and intelligence attribution.

[10] Frossard D, Urtasun R. 2018. End-to-end learning of multi-sensor 3d tracking by detection.

[11] Khan S, Naseer M, Hayat M. 2022. Transformers in vision: a survey. ACM Computing Surveys (CSUR) 54(10s):1-41

[12] Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25 (NIPS 2012).

[13] Lan W, Dang J, Wang Y. 2018. Pedestrian detection based on YOLO network model.

[14] Lemma A, Celik M, Katzenbeisser S. 2008. Watermarking for content aware intelligent toys.

[15] Liu Z, Lin Y, Cao Y. 2021. Swin transformer: hierarchical vision transformer using shifted windows.

[16] Luiten J, Fischer T, Leibe B. 2020. Track to reconstruct and reconstruct to track. IEEE Robotics and Automation Letters 5(2):1803-1810

[17] Luo C. 2023. A voice recognition sensor and voice control system in an intelligent toy robot system. Journal of Sensors 2023:1-8

[18] Ma L, Li Y, Li J, Wang C, Wang R, Chapman MA. 2018. Mobile laser scanned point-clouds for road object detection and extraction: a review. Remote Sensing 10(10):1531

[19] McStay A, Rosner G. 2021. Emotional artificial intelligence in children’s toys and devices: ethics, governance and practical remedies. Big Data & Society 8(1):2053951721994877

[20] Meng R, Rice SG, Wang J. 2018. A fusion steganographic algorithm based on faster R-CNN. Computers, Materials & Continua 55(1):1-16

[21] Miao Z, Chen J, Pan H, Zhang R. 2021. Pvgnet: a bottom-up one-stage 3d object detector with integrated multi-level features.

[22] Moradi H, Amiri SE, Ghanavi R. 2017. Autism screening using an intelligent toy car.

[23] Noh J, Lee S, Ham B. 2021. Hvpr: Hybrid voxel-point representation for single-stage 3d object detection.

[24] Pang Z, Li J, Tokmakov P, ChenD D. 2023. Standing between past and future: spatio-temporal modeling for multi-camera 3D.

[25] Qian X, Li Y, Xue L. 2023. Children’s toy design based on multiple intelligence theory-research case of “spatial intelligence children’s toy design”.

[26] Ren S, He K, Girshick R. 2015. Faster R-CNN: towards realtime object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28 (NIPS 2015).

[27] Shi W, Rajkumar R. 2020. Point-gnn: graph neural network for 3d object detection in a point cloud.

[28] Touvron H, Cord M, Douze M. 2021. Training data-efficient image transformers & distillation through attention.

[29] Vaswani A, Shazeer N, Parmar N. 2017. Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NIPS 2017).

[30] Veeramani B, Raymond JW, Chanda P. 2018. DeepSort: deep convolutional networks for sorting haploid maize seeds. BMC Bioinformatics 19:289

[31] Wang X, Yin N, Zhang Z. 2021. Smart design of intelligent companion toys for preschool children. AI EDAM 35(2):151-164

[32] Wojke N, Bewley A, Paulus D. 2017. Simple online and realtime tracking with a deep association metric.

[33] Xie E, Wang W, Yu Z. 2021. SegFormer: simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34:12077-12090

[34] Yan B, Peng H, Fu J. 2021. Learning spatio-temporal transformer for visual tracking.

[35] Yang J, Lu Z, Wu J. 2018. Smart-toy-edge-computing-oriented data exchange based on blockchain. Journal of Systems Architecture 87(5):36-48

[36] Yang Z, Sun Y, Liu S, Jia J. 2020. 3dssd: point-based 3d single stage object detector.