Adaptive resilient containment control using reinforcement learning for nonlinear stochastic multi-agent systems under sensor faults

View article
PeerJ Computer Science

Introduction

Multi-agent systems (MASs) have garnered considerable attention due to their ability to organize vast and intricate systems into smaller, intercommunicating, easily coordinated, and manageable subsystems. Currently, MASs find widespread applications in various domains such as aircraft formation, sensor networks, data fusion, parallel computing and cooperative control of multiple robots (Antonio et al., 2021; Tang et al., 2016; Liu et al., 2020; Zhao et al., 2023; De Sá & Neto, 2023). As a class of classical control problems from cooperative control, the containment control approach guarantees the convergence of all followers to a dynamic convex hull formed by multiple leaders. Numerous findings on containment control have been documented in the last decade (Li et al., 2022; Li, Pan & Ma, 2022; Li et al., 2023; Liang et al., 2021).

It is noteworthy that optimal control, formally introduced by Bellman (1957) and Pontryagin et al. (1962) half a century ago, has become the foundation and prevailing design paradigm of modern control systems. The key to solving the optimal control problem lies in solving the Hamilton–Jacobi–Bellman (HJB) equation. Theoretically, solving optimal control based on the HJB equation is nearly impossible using analytical methods due to its strong nonlinearity (Beard, Saridis & Wen, 1996). Fortunately, Werbos (Werbos, 1992) introduced the approximate technique referred to as Adaptive Dynamic Programming (ADP) or Reinforcement Learning (RL), providing an effective method for solving the HJB equation. To date, this technique has witnessed significant development and achievements, as seen in Wen, Xu & Li (2023), Chen, Dai & Dong (2022), Gao & Jiang (2018), Zargarzadeh, Dierks & Jagannathan (2012), Zargarzadeh, Dierks & Jagannathan (2012), Li, Sun & Tong (2019), Song & Dyke (2013), Hu & Zhu (2015), Rajagopal, Balakrishnan & Busemeyer (2017), Wen, Xu & Li (2023). In Wen, Xu & Li (2023), RL was combined with backstepping to design actual controls and virtual controls, optimizing the overall control of high-order systems. In Chen, Dai & Dong (2022), this technique was applied to underactuated surface vessels, ensuring optimal tracking performance for ship control. Gao & Jiang (2018) addressed the computation problem of adaptive nearly optimal trackers without prior knowledge of system dynamics. In Zargarzadeh, Dierks & Jagannathan (2012), investigated neural network-based adaptive optimal control for nonlinear continuous-time systems with known dynamics in strict-feedback form. Zargarzadeh, Dierks & Jagannathan (2015) extended their work to address nonlinear continuous-time systems characterized by uncertain dynamics in strict feedback form. They accomplished this by adapting the standard backstepping technique, as outlined in Zargarzadeh, Dierks & Jagannathan (2015), transforming the optimal tracking problem into an equivalent optimal control problem and generating adaptive control inputs. Li, Sun & Tong (2019) presented a data-driven robust approximate optimal tracking scheme for a subset of strict-feedback single-input, single-output nonlinear systems characterized by the presence of unknown non-affine nonlinear faults and unmeasured states. In addition to deterministic nonlinear systems, various optimal control methods have been explored for stochastic systems in the past decade. The numerical techniques, proposed by Song & Dyke (2013), aimed to reduce system responses under extreme loading conditions with stochastic excitations. Hu & Zhu (2015) introduced a stochastic optimization-based bounded control strategy for multi-degree-of-freedom strongly nonlinear systems. In Rajagopal, Balakrishnan & Busemeyer (2017), an offline ADP method based on neural networks was developed to address finite-time stochastic optimal control problems. Specifically, in Wen, Xu & Li (2023) applied the RL strategy with the actor-critic architecture to stochastic nonlinear strict-feedback systems. However, for more complex nonlinear stochastic MASs, the above methods have not been fully studied. The challenges lie in the stability analysis process where the quadratic form of the Lyapunov function is no longer applicable, necessitating a reproof of system stability. Furthermore, in contrast to the single-agent stochastic strict-feedback system discussed in Wen, Xu & Li (2023a), we consider complex multi-agent systems. Many practical multi-agent systems, especially in areas like intelligent transportation and smart grids, tackle complex large-scale problems that surpass the capabilities of individual nonlinear systems. Therefore, research on nonlinear multi-agent systems is more meaningful.

Furthermore, in real-world scenarios, MASs comprise numerous actuators and sensors. Faults of some actuators or sensors can lead to the deviation from global control objectives. Therefore, investigating fault-tolerant control for MASs can enhance their safety and reliability. For instance, Ding et al. (2018) applied a region-based segmentation analysis to overcome caused by multiple sensor faults in strict-feedback systems. Wang et al. (2018) introduced a fault model to achieve fault-tolerant consensus for a multi-vehicle wireless network system with different actuator faults. Cao et al. (2021) fully considered consensus problems in MASs with sensor faults, utilizing neural networks not only for identifying unknown nonlinearities but also for designing adaptive compensatory controllers. Although there have been studies related to sensor faults, the conclusions from the above research cannot be directly applied to randomly occurring systems with statistical characteristics.

Inspired by the discussions above, This paper presents an enhanced backstepping control method tailored for a class of nonlinear stochastic strict-feedback MASs experiencing sensor faults. The primary contributions are summarized as follows:

(1) In this article, the optimal backstepping (OB) control method is extended to the nonlinear stochastic MASs with multiple leaders, which is more general than the consensus control results of MASs and can solve the optimal containment control problem.

(2) Suppressing sensor faults is important to enhance the system’s safety and reliability. To tackle the challenge posed by sensor faults in stochastic MASs, consideration is given to an adaptive neural network (NN) compensation control method. This method is designed to alleviate the adverse effects of sensor faults on the MASs.

(3) The proposed adaptive control scheme successfully solves the problem of contained control with sensor faults, and the designed RL optimization method can optimize the control of unknown or uncertain stochastic dynamic systems.

Preliminariers and Problem Formulation

Graph theory

In the context of a group of N + M agents, the associated directed graph 𝔊 can be described by G = V , E , Λ , where 𝔙 = 1, 2, …, N, …, N + M constitutes a set of nodes, and E = j , i V × V represents a set of edges. The adjacency matrix is Λ = a i j R N + M × N + M , j , i E implies that nodes j and i can share information with one another. aij is defined as a i j = 1 , if i , j E 0 , if i , j E

where the set of neighbors for a node i is denoted by N i = j V : j , i E . The Laplacian matrix L = l i j M + N × M + N = D Λ R N + M × N + M is defined as l i j = a i j , if i j j N i a i j , if i = j

where 𝔇 = {diagd1, …, dN} represents the degree matrix and di = ∑j∈𝔑iaij. In this paper, the focus is on N + M agents, comprising N followers and M leaders, within a directed graph topology. It is assumed that each follower has at least one neighbor. Consequently, one can observe L = L 1 L 2 0 M × N 0 M × M

where 𝕃1 ∈ ℝN×N, 𝕃2 ∈ ℝN×M.

Assumption 1: Each follower is connected to a minimum of one leader through a directed path, while leaders themselves lack neighboring nodes.

Lemma 1: According to Assumption 1, the matrix 𝕃1 issymmetric and positive definite, each element of L 1 1 L 2 is nonnegative scalar, and all row sums of L 1 1 L 2 equal to 1.

Assumption 2: (Yoo, 2013) The multiple leaders’ outputs yld, l ∈ (N + 1, …, N + M) and their derivatives y ̇ l d , y ̈ l d , , y l d n are bounded.

Lemma 2: (Tong et al., 2011a) Existing continuously differentiable function V(tx) ∈ ℝ+, it meets the conditions ν 1 x V t , x ν 2 x L V t , x a V t , x + c

where a > 0, c > 0 are constants, ν1(⋅ ), ν2(⋅ ) are K ∞ functions, the differential Eq. (9) has a singular, robust solution, and subsequent inequality is satisfied: E V t , x e a t V 0 , x 0 + c a .

Inequality Eq. (6) signifies that the solution x(t) showcases SGUUB when considering expectations.

Lemma 3: (Wang, Wang & Peng, 2015) Defining s∗1 = [s11,s21, …, sN1]T, yi = [y1y2, …, yN]Twe have s∗1 = 𝔏1yi + 𝔏2yld. Then the following inequality holds: y i + L 1 1 L 2 y l d s 1 / η ¯ L 1

where η ¯ L 1 is the minimum singular value of 𝕃1.

Lemma 4 (Young’s Inequality (Tong et al., 2011)): For all x, y ∈ ℝ+, the subsequent inequality is held: x y 1 p x p + 1 q y q

where p > 0, q < 0, 1/p + 1/q = 1.

Stochastic systems statement

Consider a group of nonlinear stochastic MASs described as follows: d x i m = x i m + 1 + f i m x ¯ i m d t + ψ i m x ¯ i m d w d x i n = u i + f i n x ¯ i n d t + ψ i n x ¯ i n d w y i = h x i 1

where x ¯ i m = x i 1 , , x i m T R m m = 1 , , n 1 represents the state vector. ui ∈ ℝ denotes the control input, yi ∈ ℝ is the system output. h(xi1) = ki(t)xi1 + ρi(t) ,where ki(t) and ρi(t) denote the parameters of sensor faults. fim(⋅) ∈ ℝm and ψim(⋅) ∈ ℝm depict uncertain smooth functions. w ∈ ℝr denotes the independent r-dimensional standard Brownian motion.

Neural network approximation

It has been shown that a neural network (NN) can approximate any continuous function F x : R n R m to a desired accuracy within a specified compact set Ωx ⊂ ℝn. The neural network approximation function can be represented as follows: F NN x = W T S x

where W ∈ ℝq×m is the weight matrix, q is the quantity of neurons, S x = s 1 x , , s q x T R q is the Gaussian basis function vector with s i x = exp x ν i T x ν i / φ i 2 R , ν i = ν i 1 , , ν in T R n represents the centers of receptive fields, and φi is the width of the Gaussian function.

To fulfill Eq. (10), there must exist an ideal weight W, and the function F x can be rewritten as F x = W T S x + ɛ x

where ɛ x R m is the approximation error required to meet | | ɛ x | | δ with δ being a positive constant.

The ideal weight matrix W can be shown as W = arg min W R p × m sup x Ω x F x W S x .

The Eq. (12) implies that the NN approximation error in Eq. (11) represents the minimum achievable deviation between F x and W T S x .

Sensor faults

Within sensor fault model (Bounemeur, Chemachema & Essounbouli, 2018), the unspecified parameters adhere to 0 < k ¯ i min k i t 1 and ρ ¯ i ρ i t ρ ¯ i , where k ¯ i min > 0 represents the minimum sensor effectiveness, ρ ¯ i , ρ ¯ i are the lower bound and the upper bound respectively. The parameters of the sensor fault models can be summarized as below:

  1. If k i t = 1 and ρi(t) is a constant, the sensor exhibits bias fault.

  2. If k i t = 1 , |ρi(t)| = ιt, 0 < ι ≪ 1, the sensor experiences a drift fault.

  3. If k i t = 1 , | ρ i t | < ρ ¯ i , ρ i t 0 , this signifies that the sensor has incurred a loss of accuracy.

  4. If 0 < k ¯ i m i n k i t 1 , ρ i t = 0 , this suggests that the sensor has undergone a loss of effectiveness.

Denote fsi = (ki(t) − 1)xi1 + ρi(t). Then yi can be reformulated as yi = xi1 + fsi. The derivative of yi can be rewritten as y ̇ i = x ̇ i 1 + f p s i , where f p s i = f ̇ s i .

Operator 𝔏

For function V(tx), calculate its differential operator 𝔏 as Mao, (2006) L V = V x T f x + g x u x + 1 2 T r ψ T 2 v x x T ψ

where Tr signifies the matrix trace.

Distributed Adaptive Optimal Containment Control

The backstepping technique is employed for controller design. Before we begin, to clearly demonstrate our ideas and process, let’s provide a brief overview using Fig. 1.

OB design in the i th agent, i = 1, …,n.

Figure 1: OB design in the i th agent, i = 1, …,n.

Figure 1 illustrates the application process of RL in the design of optimized backstepping control. This process employs a Critic-Actor architecture to address the leader-following consensus control issue for nonlinear MASs. Within this method, the actor network is responsible for generating control actions, while the critic network evaluates the performance of the current control strategy. By iterating these two networks, the RL algorithm can learn an optimized control strategy that optimizes the control performance of the entire system.

Specifically, the optimal control problem is transformed into solving the HJB equation. However, due to the nonlinearity of the HJB equation, solving it directly is very challenging. To overcome this difficulty, a neural network-based RL method is proposed. This method derives the RL update rules from the negative gradient of a simple positive function, thereby avoiding the direct handling of multiple nonlinear terms in the HJB equation. This not only simplifies the algorithm but also relaxes the requirements for known system dynamics and persistent excitation.

During the RL learning process, the critic network first evaluates the performance of the current control strategy and provides it as feedback to the actor network. The actor network then adjusts its control actions based on this feedback, with the expectation of improving the system’s performance. In this way, the RL algorithm can continuously learn and optimize the control strategy through iteration until the optimal solution is found.

To start with, the i-th subsystem’s distributed containment error is defined as s i 1 = j = 1 N a i j y i y j + = N + 1 N + M a i y i y l d s i m = x i m α i m 1 m = 2 , , n

where αim−1 denotes the virtual controller. The OB control method is designed as follows.

Step 1: With Eq. (14) and Itô formula, the containment error can be calculated as follows: d s i 1 = d i x i 2 + f p s i + f i 1 x i 1 j = 1 N a i j x ̇ j 1 + f ̇ s j = N + 1 N + M a i l y ̇ l d d t + d i ψ i 1 x i 1 j = 1 N ψ j 1 x j 1 d w = d i x i 2 j = 1 N a i j x j 2 + F i 1 d t + Ψ i 1 d w

where: F i 1 = d i f p s i + f i 1 x i 1 j = 1 N a i j f p s j + f j 1 x j 1 = N + 1 N + M a i y ̇ d Ψ i 1 = d i ψ i 1 x i 1 j = 1 N ψ j 1 x j 1

Representing virtual control by αi1, the performance index function is formulated as J i 1 s i 1 = t c i 1 s i 1 s , α i 1 s i 1 s d s

where c i 1 s i 1 , α i 1 = s i 1 2 t + α i 1 2 is the cost function.

Replace α i 1 with α i 1 (optimal virtual control) in Eq. (16), the function is obtained as J i 1 s i 1 = t c i 1 s i 1 s , α i 1 s i 1 s d s

According to the previous introduction, the function is given as follows: E J i 1 s i 1 = min α i 1 Ψ Ω E t c i 1 s i 1 , α i 1 d s

where Ω is a predefined compact set containing origin. By viewing xi2 as optimal control α i 1 , the HJB equation linked with Eqs. (15) and (17) can be rewritten H i 1 s i 1 , α i 1 , d J i 1 s i 1 d s i 1 = s i 1 2 + α i 1 2 + d J i 1 d s i 1 d i α i 1 + F i 1 j = 1 N a i j x j 2 + 1 2 d 2 f i 1 d s i 1 2 Ψ i 1 T Ψ i 1 = 0

The optimal virtual controller α i 1 can be derived by solving H i 1 / α i 1 = 0 as α i 1 = 1 2 d J i 1 s i 1 d s i 1

To attain the tracking control, the term d J i 1 s i 1 d s i 1 is partitioned as d J i 1 s i 1 d s i 1 = 2 γ i 1 d i s i 1 + 1 2 β i 1 d i s i 1 3 + 2 d i h i 1 x i 1 , s i 1 + 1 d i J i 1 0 x i 1 , s i 1

where γi1 > 0, βi1 > 0 are two designed constants, hi1(xi1si1) = Fi1 + si1||Ψi1||4 and J i 1 0 x i 1 , s i 1 = 2 γ i 1 d i s i 1 1 2 β i 1 d i s i 1 3 2 d i h i 1 x i 1 , s i 1 + d J i 1 s i 1 d s i 1 R . Substituting Eqs. (21) into (20) yields α i 1 = 1 d i γ i 1 s i 1 1 4 β i 1 s i 1 3 h i 1 x i 1 , s i 1 1 2 J i 1 0 x i 1 , s i 1

Since two functions hi1(xi1si1) and J i 1 0 x i 1 , s i 1 are uncertain yet continuous, they can be approximated by NN as h i 1 x i 1 , s i 1 = W h i 1 T S h i 1 x i 1 , s i 1 + ɛ h i 1 x i 1 , s i 1 J i 1 0 x i 1 , s i 1 = W J i 1 T S J i 1 x i 1 , s i 1 + ɛ J i 1 x i 1 , s i 1

where W h i 1 T ∈ℝp1 and W J i 1 T ∈ℝq1 are the ideal NN weights, Shi1(xi1si1) ∈ℝp1 and SJi1(xi1si1) ∈ℝq1 are basis function vectors, and ɛhi1(xi1si1) ∈ ℝ, ɛJi1(xi1si1) ∈ ℝ denote approximation errors. Substitute Eqs. (23) and (24) into Eqs. (21) and (22), separately d f i 1 s i 1 d s i 1 = 1 d i 2 γ i 1 s i 1 t + 1 2 β i 1 s i 1 3 t + 2 W h i 1 T S h i 1 x i 1 , s i 1 + W j i 1 T S j i 1 x i 1 , s i 1 + ɛ i 1 α i 1 = 1 d i γ i 1 s i 1 t 1 4 β i 1 s i 1 3 t W h i 1 T S h i 1 x i 1 , s i 1 1 2 W J i 1 T S J i 1 x i 1 , s i 1 1 2 ɛ i 1

where ɛi1 = 2ɛhi1 + ɛJi1. The optimal control Eq. (26) is unattainable due to the two ideal weights W h i 1 T and W J i 1 T are uncertain constant vectors.

To acquire an effective optimized virtual control, the implementation involves applying RL through the identifier-critic-actor architecture, utilizing the NNs. The uncertain function hi1(xi1si1) of adaptive identifier is constructed in the following: h ˆ i 1 x i 1 , s i 1 = W ˆ h i 1 T t S h i 1 x i 1 , s i 1

where h ˆ i 1 x i 1 , s i 1 is the identifier output, and W ˆ h i 1 T t R p 1 is the NN weight. The weight experiences updates based on the following law: W ˆ ̇ h i 1 t = Γ i 1 S h i 1 x i 1 , s i 1 s i 1 3 t σ i 1 W ˆ h i 1 t

where Γi1 is a positive-definite constant matrix, σi1 > 0 is constant. Designing the critic to evaluate the control performance aligns with Eq. (25) as d J ˆ i 1 s i 1 d s i 1 = 1 d i 2 γ i 1 s i 1 t + 1 2 β i 1 s i 1 3 t + 2 W ˆ h i 1 T t S h i 1 x i 1 , s i 1 + W ˆ c i 1 T t S J i 1 x i 1 , s i 1

where d J ˆ i 1 s i 1 d s i 1 R is the estimation of d J i 1 s i 1 d s i 1 , W ˆ c i 1 T R q 1 is the NN weight of critic. The weight experiences updates based on the following law: W ˆ ̇ c i 1 t = γ c i 1 S J i 1 x i 1 , s i 1 S J i 1 T x i 1 , s i 1 W ˆ c i 1 t

where γci1 > 0 is constant. The formulation of the actor, responsible for executing the control action, corresponds to Eq. (25) as articulated below: α ˆ i 1 = 1 d i γ i 1 s i 1 t 1 4 β i 1 s i 1 3 t W ˆ h i 1 T t S h i 1 x i 1 , s i 1 1 2 W ˆ a i 1 T t S J i 1 x i 1 , s i 1

where α ˆ i 1 is the optimized virtual control, W ˆ a i 1 T t R q 1 is the NN weight of actor. The weight experiences updates based on the following law: W ˆ a i 1 t = S J i 1 x i 1 , s i 1 S J i 1 T x i 1 , s i 1 × γ a i 1 W ˆ a i 1 t W ˆ c i 1 t + γ c i 1 W ˆ c i 1 t

where γai1 > 0 is constant. These determined parameters, βi1, γi1, γci1, and γai1, are selected to satisfy β i 1 > 0 , γ i 1 > 3 , γ a i 1 > β i 1 2 , γ a i 1 > γ c i 1 > γ a i 1 2

According to Eqs. (19), (29) and (31), the HJB equation is calculated as H i 1 s i 1 , α ˆ i 1 , d J ˆ i 1 d s i 1 = s i 1 2 t + 1 d i 2 γ i 1 s i 1 t 1 4 β i 1 s i 1 3 t W ˆ h i 1 T t S h i 1 x i 1 , s i 1 1 2 W ˆ a i 1 T t S J i 1 x i 1 , s i 1 2 + 1 d i 2 2 γ i 1 s i 1 t + 1 2 β i 1 s i 1 3 t + 2 W ˆ h i 1 T t S h i 1 x i 1 , s i 1 + W ˆ c i 1 T t S J i 1 x i 1 , s i 1 × γ i 1 s i 1 t 1 4 β i 1 s i 1 3 t W ˆ h i 1 T t S h i 1 x i 1 , s i 1 1 2 W ˆ a i l T t × S J i l x i 1 , s i 1 + f i 1 x i 1 + ψ i 1 T x i 1 d w d t y ̇ d + 1 2 d 2 J i 1 d s i 1 2 ψ i 1 x i 1 2

Building upon the preceding analysis, the optimized control α ˆ i 1 is foreseen as the sole solution to achieve H i 1 s i 1 , a ˆ i 1 , d J ˆ i 1 / d s i 1 0 . Assuming the existence of H i 1 s i 1 , a ˆ i 1 , d J ˆ i 1 d s i 1 = 0 and its unique solution, it is equivalent to the following equation: H i 1 s i 1 , a ˆ i 1 , d J ˆ i 1 d s i 1 W ˆ a i 1 = 1 2 S J i 1 x i 1 , s i 1 S J i 1 T x i 1 , s i 1 × W ˆ a i 1 t W ˆ c i 1 t = 0

Define the positive function Pi1(t) as P i 1 t = W ˆ a i 1 t W ˆ c i 1 t T W ˆ a i 1 t W ˆ c i 1 t

It is evident that Eq. (35) is the equivalent to P i 1 t = 0 . Given the fact that P i 1 t / W ˆ a i 1 t = P i 1 t / W ˆ c i 1 t = 2 W ˆ a i 1 t W ˆ c i 1 t , the time derivative of P i 1 t along with Eqs. (29) and (31) is d P i 1 d t = P i 1 W ˆ a i 1 T W ˆ ̇ a i 1 + P i 1 W ˆ c i 1 T W ˆ ̇ c i 1 = P i 1 W ˆ a i 1 T S J i 1 S J i 1 T γ a i 1 W ˆ a i 1 W ˆ c i 1 + γ c i 1 W ˆ c i 1 = γ a i 1 P i 1 W ˆ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 W ˆ c i 1 = γ a i 1 2 P i 1 W ˆ a i 1 T S J i 1 S J i 1 T P i 1 W ˆ a i 1 0

The inequality Eq. (37) suggests that the updating laws Eqs. (30) and (32) can ensure eventually. The key benefits of the RL design approach include: (1) comparatively, the optimized control algorithm demonstrates a substantially simpler structure than existing optimal methods, such as Vamvoudakis & Lewis (2010), Liu et al. (2013), Wen, Ge & Tu (2018). (2) this can alleviate the necessity for persistent excitation, a requirement prevalent in many optimal control methods. Replace xi2 with α i 1 + s i 2 in the dynamic Eq. (14) to have d s i 1 = d i α ˆ i 1 + s i 2 + F i 1 j = 1 N a i j x j 2 d t + Ψ i 1 d w

The Lyapunov function candidate is designed as L i 1 = 1 4 s i 1 4 + 1 2 W h i 1 T Γ i 1 1 W h i 1 + 1 2 W c i 1 T W c i 1 + 1 2 W a i 1 W a i 1

where W ̃ h i 1 t = W ˆ h i 1 t W h i 1 , W ̃ c i 1 t = W ˆ c i 1 t W J i 1 and W ̃ a i 1 t = W ˆ a i 1 t W J i 1 represent corresponding errors. Compute the 𝔏 of Li1, along with Eqs. (28), (30), (32) and (39) to yield L L i 1 = s i 1 3 d i α ˆ i 1 + s i 2 + F i 1 j = 1 N a i j x j 2 + 3 2 s i 1 2 Ψ i 1 2 + W ˜ h i 1 T S h i 1 s i 1 3 σ i 1 W ˆ h i 1 γ c i W ˆ c i 1 T S J i 1 S J i 1 T W ˆ c i 1 W ˜ a i 1 T S J i 1 S J i 1 T γ a i 1 W ˆ a i 1 W ˆ c i 1 + γ c i 1 W ˆ c i 1

Design optimal virtual controller α ˆ i 1 = 1 d i γ i 1 s i 1 1 4 β i 1 s i 1 3 W ˆ h i 1 T S h i 1 1 2 W ˆ a i 1 T S J i 1

and then Eq. (31) becomes L L i 1 = s i 1 3 γ i 1 s i 1 1 4 β i 1 s i 1 3 W ˆ h i 1 T S h i 1 1 2 W ˆ a i 1 S J i 1 + s i 2 d i + F i 1 j = 1 N a i j x j 2 + 3 2 s i 1 2 Ψ i 1 | | 2 + W ˜ h i 1 T s h i 1 s i 1 3 σ i 1 W ˆ h i 1 γ c i 1 W ˜ c i 1 T S J i 1 S J i 1 T W ˆ c i 1 γ a i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 + γ a i 1 γ c i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ c i 1

With Young’s inequality Eq. (8), there are following results: d i s i 1 3 s i 2 3 4 d i s i 1 4 + 1 4 d i s i 2 4 s i 1 3 j = 1 N a i j x j 2 3 4 s i 1 4 + 1 4 j = 1 N a i j x j 2 4 3 2 s i 1 2 | | Ψ i 1 | | 2 s i 1 4 | | Ψ i 1 | | 4 + 9 16 1 2 s i 1 3 W ˆ a i 1 T S J i 1 1 4 β i 1 s i 1 6 + β i 1 4 W ˆ a i 1 S J i 1 S J i 1 T W ˆ a i 1

Substituting inequalities Eqs. (43), (44), (45) and (46) into (42) has L L i 1 γ i 1 3 4 d i 3 4 s i 1 4 s i 1 3 W ˆ h i 1 T S h i 1 h i 1 + W h i 1 T S h i 1 s i 1 3 σ i 1 W ˆ h i 1 γ c i 1 W ˜ c i 1 T S J i 1 S J i 1 T W ˆ c i 1 γ a i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 + γ a i 1 γ c i 1 W ˜ a i 1 T S J i 1 T S J i 1 T W ˆ c i 1 + β i 1 4 W ˆ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 + 1 4 j = 1 N a i j x j 2 4 + 9 16 + 1 4 d i s i 2 4

where hi1 = Fi1 + si1||Ψi1||4. Substituting Eqs. (24) into (47) results in the following inequality: L L i 1 γ i 1 3 4 d i 3 4 s i 1 4 + s i 1 3 ɛ h i σ i 1 W ˜ h i 1 T W ˆ h i 1 γ c i 1 W ˜ c i 1 T S J i 1 S J i 1 T W ˆ c i 1 γ a i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 + β i 1 4 W ˆ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 + γ a i 1 γ c i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ c i 1 + 1 4 i = 1 N a j i x j 2 4 + 1 4 d i s i 2 4 + 9 16

From the facts W ̃ h i 1 t = W ˆ h i 1 t W h i 1 , W ̃ c i 1 t = W ˆ c i 1 t W J i 1 and W ̃ a i 1 t = W ˆ a i 1 t W J i 1 , the following equations can be derived: W ˜ h i 1 T W ˆ h i 1 = 1 2 W ˜ h 1 i T W ˜ h i 1 + 1 2 W ˆ h i 1 T W ˆ h i 1 1 2 W h i 1 T W h i 1 W ˜ c i 1 T S J i 1 S J i 1 T W ˆ c i 1 = 1 2 W ˜ c i 1 T S J i 1 T W ˜ c i 1 + 1 2 W ˆ c i 1 T S J i 1 T W ˆ c i 1 1 2 W J i 1 T S J i 1 S J i 1 T W J i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 = 1 2 W ˜ a i 1 T S J i 1 S J i 1 T W ˜ a i 1 + 1 2 W ˆ a i 1 T S J i 1 S J i 1 T W ˆ a i 1 1 2 W J i 1 T S J i 1 S J i 1 T W J i 1

With Young’s inequality Eqs. (8) and limitation of (33), subsequent inequalities obtained: s i 1 3 ɛ h i 1 3 4 s i 1 4 + 1 4 ɛ h i 1 4 γ a i 1 γ c i 1 W ˜ a i 1 T S J i 1 S J i 1 T W ˆ c i 1 γ a i 1 γ c i 1 2 W ˜ a i 1 T S J i 1 S J i 1 T W ˜ a i 1 + γ a i 1 γ c i 1 2 W ˆ c i 1 T S J i 1 S J i 1 T W ˆ c i 1

Substituting Eqs. (49)(53) into (48) yields L L i 1 γ i 1 3 2 3 4 d i s i 1 4 σ i 1 2 W ˜ h i 1 T W ˜ h i 1 γ c i 1 2 W ˜ c i 1 T S J i 1 S J i 1 T W ˜ c i 1 γ c i 1 2 W ˜ a i 1 T S J i 1 S J i 1 T W ˜ a i 1 γ c i 1 γ a i 1 2 W ˆ c i 1 T S J i 1 2 γ a i 1 2 β i 1 4 W ˆ a i 1 T S J i 1 2 + B i 1 + d i 4 s i 2 4 + 1 4 j = 1 N a i j x j 2 4

where B i 1 t = γ c i 1 2 + γ a i 1 2 W J i 1 T S J i 1 2 + σ i 1 2 | | W h i 1 | | 2 + 1 4 ɛ h i 1 4 + 9 16 and |Bi1(t)| ≤ bi1, because all its terms are bounded, and 1 4 j = 1 N a i j x j 2 4 will be handled in step 2′s hi2(xi2si2).

Step m (2 ≤ m ≤ n − 1):Define the containment error as s i m = x i m α ˆ i m 1 . According to Eq. (9), the error dynamic, along with Eq. (13), is d s i m = x i m + 1 + f i m x i m L α ˆ i m 1 d t + Ψ i m d w

where Ψ i m = ψ i m x ¯ i m j = 1 m 1 α ˆ i m 1 o x i j ψ i j . Let αim denote virtual controller, the performance index function can be defined as J i m s i m = t c i m s i m s , α i m s i m s d s

where c i m s i m , α i m = s i m 2 t + α i m 2 is the cost function. Denoted α i m as the optimal virtual controller, substitute α i m into Eq. (56), the function can be rewritten as J i m s i m = t c i m s i m s , α i m s i m s d s .

Similar to Step 1, Eq. (57) manifests the subsequent characteristic E J i m s i m = min α i m Ψ Ω E J i m s i m .

By viewing xim+1(t) as optimal control α i m , the HJB equation relate to Eqs. (55) and (57) is H i m s i m , α i m , d J i m d s i m = s i m + α i m 2 + d J i m d s i m × α i m + f i m + Ψ i m d w d t L α ˆ i m 1 + 1 2 d 2 J i m d s i m 2 Ψ i m T Ψ i m = 0

where (dw)/(dt) represents the white noise. Besides, α i m is obtained by solving H i m / α i m = 0 as α i m = 1 2 d J i m d s i m

To attain the containment control, the term d J i m s i m / d s i m is segmented as d J i m d s i m = 2 γ i m s i m + 1 2 β i m s i m 3 + 2 h i m + J i m 0

where γim > 0 and βim > 0 are two designed constants, h i 2 = f i 2 + s i 2 | | Ψ i m | | 4 1 4 j = 1 N a i j x j 2 4 R , h i m = f i m + s i m | | Ψ i m | | 4 R m 3 , and J i m 0 = 2 γ i m s i m 1 2 β i m s i m 3 2 h i m + d J i m d s i m R . By substituting Eqs. (61) into (60), optimal control transforms into α i m = γ i m s i m 1 4 β i m s i m 3 h i m J i m 0

Since two functions h i m x ¯ i m , s i m and J i m 0 x ¯ i m , s i m are uncertain yet continuous, they can be approximated by NN as h i m x ¯ i m , s i m = W h i m T S h i m x ¯ i m , s i m + ɛ h i m x ¯ i m , s i m J i m 0 x ¯ i m , s i m = W J i m T S J i m x ¯ i m , z i m + ɛ J i m x ¯ i m , s i m

where W h i m T R p m and W J i m T R q m are the ideal NN weights, Shim(ximsim) ∈ℝpm, SJim(ximsim) ∈ℝqm are basis vectors, ɛhim(ximsim) ∈ ℝ, ɛJim(ximsim) ∈ ℝ are bounded approximation errors. Substituting Eqs. (63) and (64) into Eqs. (61) and (62) has d J i m s i m d s i m = 2 γ i m s i m + 1 2 β i m s i m 3 + 2 W h i m T S h i m x ¯ i m , s i m + W J i m T S J i m x ¯ i m , s i m + ɛ i m α i m = γ i m s i m 1 4 β i m s i m 3 W h i m T S h i m 1 2 W J i m T S J i m 1 2 ɛ i m

where ɛim = 2ɛhim + ɛJim.The optimal control Eq. (66) is impractical due to the two ideal weights W h i m T and W Jim T are uncertain. To obtain a practical optimized control, RL is constructed based on Eqs. (65) and (66) as follows. The adaptive identifier is formulated as follows: h ˆ i m x ¯ i m , s i m = W ˆ h i m T S h i m x ¯ i m , s i m

where h ˆ i m x i m , s i m is the identifier output, W ˆ h i m T t R p m is the NN weight. The weight experiences updates based on the following law: W ˆ ̇ h i m = Γ i m S h i m x ¯ i m , s i m s i m 3 σ i m W ˆ h i m

where Γim is a positive-definite constant matrix, σim > 0 is constant. The critic is designed in the following: d J ˆ i m s i m d s i m = 2 γ i m s i m + 1 2 β i m s i m 3 + 2 W ˆ h i m T S h i m + W ˆ c i m T S J i m

where d J ˆ i m s i m / d s i m R is the estimation of d J i m s i m / d s i m , W ˆ c i m T t R q m is the NN weight of critic. The weight experiences updates based on the following law: W ˆ ̇ c i m = γ cim S J i m S J i m T W ˆ c i m

where γcim > 0 is constant. The actor is designed as follows: α ˆ i m = γ i m s i m 1 4 β i m s i m 3 W ˆ h i m T S h i m 1 2 W ˆ a i m S J i m

where α ˆ i m is the optimized virtual control, W ˆ a i m T t R q m is the NN weight of actor. The weight experiences updates based on the following law: W ˆ ̇ a i m = S J i m S J i m T γ a i m W ˆ a i m W ˆ c i m + γ c i m W ˆ c i m

where γaim > 0 are constant. These designed parameters, βim, γim, γcim and γaim satisfy the following conditions:

β i m > 0 , γ i m > 4 , γ a i m > β i m 2 , γ a i m > γ c i m > γ a i m 2 .

Define containment error of the step m+1 as s i m + 1 = x i m + 1 α i m + 1 . Replace xim+1 with α i m + 1 + s i m + 1 in the dynamic Eq. (55) to have d s i m = α ˆ i m + s i m + 1 + f i m L α ˆ i m 1 d t + Ψ i m d w

Select the Lyapunov function candidate: L i m = j = 1 m 1 L i j + 1 4 s i m 4 + 1 2 W ˜ h i m T W ˜ h i m + 1 2 W ˜ c i m T W ˜ c i m + 1 2 W ˜ a i m T W ˜ a i m

where L i j = 1 4 s i j 4 + 1 2 W ̃ h i j T Γ i j 1 W ̃ h i j + 1 2 W ̃ c i j T W ̃ c i j + 1 2 W ̃ a i j T W ̃ a i j , and W ̃ h i m t = W ˆ h i m t W h i m , W ̃ c i m t = W ˆ c i m t W J i m and W ̃ a i m t = W ˆ a i m t W J i m . Computing the infinitesimal generator 𝔏 of Lim, along with Eqs. (68), (70), (72) and (74) has L L i m = j = 1 m 1 L L i j + s i m 3 α ˆ i m + s i m + 1 + f i m L α ˆ i m 1 + 3 2 s i m 2 | | Ψ i m | | 2 + W ˜ h i m T S h i m s i m 3 σ i m W ˆ h i m γ c i m W ˜ c i m T S J i m S J i m T W ˆ c i m W ˜ a i n T S J i m S J i m T γ a i m W ˆ a i m W ˆ c i m + γ c i m W ˆ c i m

Substituting the virtual control Eqs. (71) into (76) holds L L i m = i = 1 m 1 L L i j + s i m 3 γ i m s i m 1 4 β i m s i m 3 W ˆ h i m T S h i m 1 2 W ˆ a i m T S J i m + s i m + 1 + f i m L α ˆ i m 1 + 3 2 s i m 2 | | Ψ i m | | 2 + W ˜ h i m T S h i m s i m 3 σ i m W ˆ h i m γ c i m W ˜ c i m T S J i m S J i m T W ˆ c i m W ˜ a i n T S J i m S J i m T γ a i m W ˆ a i m W ˆ c i m + γ c i m W ˆ c i m j = 1 m 1 L L i j γ i m 3 2 s i m 4 + β i m 4 W ˆ a i m T S J i m S J i m T W ˆ a i m + 1 4 s i m + 1 4 + 1 4 L α ˆ i m + 1 + 9 16 W ˜ h i m T S h i m s i m 3 h i m + W ˜ h i m T S h i m s i m 3 σ i m W ˆ h i m γ c i m W ˜ c i m T S J i m S J i m T W ˆ c i m γ a i m W ˜ a i m T S J i m S J i m T W ˆ a i m + γ a i m γ c i m W ˜ a i m T S J i m S J i m T W ˜ c i m

From the fact s i m 3 t L α ˆ i m 1 3 / 4 s i m 4 t + 1 / 4 L α ˆ i m 1 4 and previous results, following numerous operations resembling those in Eqs. (43)(54) in Step 1, (87) can be expressed as L L i m j = 1 m 1 a i j L i j + b i j γ i m 4 s i m 2 σ i m 2 λ Γ i m 1 m a x W ˜ h i m T Γ i m 1 W ˜ h i m γ c i m 2 λ S J i m m i n W ˜ c i m T W ˜ c i m γ c i m 2 λ S J i m m i n W ˜ c i m T W ˜ c i m + B i m + 1 4 s i m + 1 4

where λ Γ i m 1 m a x is the maximal eigenvalue of Γ i m 1 , λ S J i m m i n is the minimal eigenvalue of S J i m S J i m T .

And B i m = γ c i m 2 + γ a i m 2 W J i m T S J i m 2 + σ i m 2 | | W h i m | | 2 + 1 4 L α ˆ i m 1 4 + 1 4 ɛ h i m 4 + 9 16 ,which satisfied |Bim| ≤ bim.Define a i m = min 4 γ i m 4 , σ i m λ Γ i m 1 m a x , γ c i m λ S J i m m i n , and then Eq. (78) can become the following one: L L i m j = 1 m a i j L i j + b i j + 1 4 s i m + 1 4

Step n: The optimized control ui is obtained here. Based on Eq. (9), s i n = x i n α ˆ i n 1 can be derived from Eq. (13) as follows: d s i n = u i + f i n x ¯ i n L α ˆ n 1 d t + Ψ i n d w

where Ψ i n = ψ i n j = 1 n 1 α i n 1 x i j ψ i j .The performance index function related to Eq. (80) can be written as J i n s i n = t c i n s i n s , u i s i n s d s

where c i n s i n , u i = sin 2 + u i 2 is cost function. Denoted u i as optimal control, the function can be rewritten as J i n s i n = t c i n s i n s , u i s i n s d s

The function Eq. (82) implies the following property: E J i n s i n = min u i Ψ Ω E J i n s i n

The HJB equation related to Eqs. (80) and (82) is H i n s i n , u i , d J i n d s i n = sin 2 + u i 2 + d J i n d s i n u i + f j n L α ˆ i n 1 + Ψ i n d w d t + 1 2 d 2 J i n d sin 2 Ψ i n T Ψ i n = 0

Solving H i n / u i = 0 yields u i = 1 2 d J i n s i n d s i n

Split the term d J i n d s i n as d J i n d s i n = 2 γ i n s i n + 1 2 β i n sin 3 + 2 h i n + J i n 0

where γin > 0 and βin > 0 are two designed constants, and hin = fin + sin||Ψin||4 ∈ ℝ, J i n 0 = 2 γ i n s i n 1 2 β i n sin 3 2 h i n + d J i n d s i n R . Substituting Eqs. (86) into (85) has u i = γ i n s i n 1 4 β i n sin 3 h i n 1 2 J i n 0

Since the unknown functions h i n x ¯ i n , s i n and J n 0 x ¯ i n , s i n are continuous, which can be approximated by NN as h i n x ¯ i n , s i n = W h i n T S h i n x ¯ i n , s i n + ɛ h i n x ¯ i n , s i n J n 0 x ¯ i n , s i n = W J i n T S J i n x ¯ i n , s i n + ɛ J i n x ¯ i n , s i n

where W h i n T ∈ℝpn, W J i n T R q n are the ideal NN weights, Shin(xinsin) ∈ ℝpn and SJin(xinsin) ∈ ℝqn are the basis function vectors, ɛhin(xinsin) ∈ ℝ, ɛJin(xinsin) ∈ ℝ are the bounded approximation errors. Substituting Eqs. (88) and (89) into Eqs. (86) and (87) yields d J i n d s i n = 2 γ i n s i n + 1 2 β i n sin 3 + 2 W h i n T S h i n + W J i n T S J i n + ɛ i n u i = γ i n s i n 1 4 β i n sin 3 W h i n T S h i n 1 2 W J i n T S J i n 1 2 ɛ i n

where ɛin = 2ɛhin + ɛJin.The adaptive identifier is formulated as h ˆ i n x ¯ i n , s i n = W ˆ h i n T S h i n x ¯ i n , s i n

where h ˆ i n x i n , s i n is the identifier output, W ˆ h i n T t R p n is the NN weight of identifier.

The weight experiences updates based on the following law: W ˆ ̇ h i n = Γ i n S h i n x ¯ i n , s i n sin 3 σ i n W ˆ h i n

where Γin is a positive-definite constant matrix, σin > 0 is constant. The critic is d J ˆ i n s i n d s i n = 2 γ i n s i n + 1 2 β i n sin 3 + 2 W ˆ h i n T S h i n x ¯ i n , s i n + W ˆ c i n T S J i n x ¯ i n , s i n

The weight experiences updates based on the following law: W ˆ ̇ c i n = γ c i n S J i n x ¯ i n , s i n S J i n T x ¯ i n , s i n W ˆ c i n

where γcin is a constant. The actor is u ˆ i = γ i n s i n 1 4 β i n sin 3 W ˆ h i n T t S h i n x ¯ i n , s i n 1 2 W ˆ a i n T S J i n x ¯ i n , s i n

The weight experiences updates based on the following law: W ˆ ̇ a i n = S J i n x ¯ i n , s i n S J i n T x ¯ i n , s i n × γ a i n W ˆ a i n W ˆ c i n + γ c i n W ˆ c i n

These parameters are required to meet the following limitation: β i n > 0 , γ i n > 4 , γ a i n > β i n 2 , γ a i n > γ c i n > γ a i n 2 .

Select the Lyapunov function candidate for overall backstepping control as L i n = j = 1 n 1 L i j + 1 4 sin 4 + 1 2 W ˜ h i n T Γ i n 1 W ˜ h i n + 1 2 W ˜ c i n T W ˜ c i n + 1 2 W ˜ a i n T W ˜ a i n

where W ̃ h i n t = W ˆ h i n t W h i n , W ̃ c i n t = W ˆ c i n t W J i n , W ̃ a i n t = W ˆ a i n t W J i n . Compute 𝔏 of Lin, along with Eqs. (80), (93), (95) and (97), and then apply (96), resulting in the following: L L i n = j = 1 n 1 L L i j + sin 3 γ i n s i n 1 4 β i n sin 3 W ˆ h i n T S h i n 1 2 W ˆ a i n T S J i n + f i n L α ˆ n 1 + 3 2 s i n | | Ψ i n | | 2 + W ˜ h i n T S h i n sin 3 σ i n W ˆ h i n γ c i n W ˜ c i n T S J i n S J i n T W ˆ c i n W a i n T S J i n S J i n T γ a i n W ˆ a i n W ˆ c i n + γ c i n W ˆ c i n

The following expression is derived from Eqs. (100): L L i n j = 1 n 1 a i j L i j + b i j γ i n 4 sin 4 σ i n 2 λ Γ i n 1 m a x W ˜ h i n T Γ i n 1 W ˜ h i n γ c i n 2 λ s J i n m i n W ˜ c i n T W ˜ c i n γ c i n 2 λ s J i n m i n W ˜ a i n T W ˜ a i n + B i n

where λ Γ i n 1 m a x is the maximal eigenvalue of Γ i n 1 , λ S J i n m i n is the minimal eigenvalue of S J i n S J i n T . And B i n = γ c i n 2 + γ a i n 2 W J i n T S J i n 2 + σ i n 2 | | W h i n | | 2 + 1 4 L α ˆ i n 1 4 + 1 4 ɛ h i n 4 + 9 16 ,which satisfied |Bin| ≤ bin.Let a i n = min 4 γ i n 4 , σ i n / λ Γ i n 1 max , γ c i n λ S J i n min , and then Eq. (101) can become the following one L L i n j = 1 n a i j L i j + b i j .

Stability Analysis

Theorem 1: Consider MASs described by Eq. (13) and subjected to Assumptions 1-2, operating within a directed graph and employing the adaptive laws Eqs. (32), (72) and (97), together with the virtual controllers Eqs. (31) and (71), and the actual controller Eq. (96), the containment control protocol unequivocally guarantees the SGUUB of all signals within the closed-loop system. Furthermore, for a given ∀t > 0, tuning the design parameters leads the containment error to converge within an arbitrarily small neighborhood, as expressed: | | y i + L 1 1 L 2 y d | | ɛ ¯

Proof: Consider the overall Lyapunov function L given by: L = i = 1 N j = 1 n L i j

Define a i = min a i 1 , a i 2 , , a i j and b i = j = 1 n b i j . Subsequently, Eq. (104) can be expressed as L L | a i L + b i

Based on Lemma 2, the following inequality is deduced from Eq. (105): E L e a i t L 0 + b i a i E L E L 0 + b i a i

For s∗1 = [s11s21, …, sN1]T, based on the definition of Lin and Eq. (99) E | | s 1 | | 4 E s 11 2 + s 21 2 + + s N 1 2 2 E s 11 4 + s 21 4 + + s N 1 4 4 N E L 0 + b i a i

where N denotes quantity of follower agents. With Eq. (99), for ∀ɛ > 0: E L 0 + b i a i ɛ ¯ 8 η ¯ L 1 4

Taking Eq. (109) and Lemma 3 into account to obtain E | | y i + L 1 1 L 2 y d | | E | | s 1 | | 4 | | η ¯ L i | | 4 ɛ ¯

The proof is completed and the RL control strategy process diagram is illustrated in Fig. 2.

The RL control scheme.

Figure 2: The RL control scheme.

Simulation Example

In this section, the effectiveness of OB, RL and containment control is illustrated by a numerical example. For the nonlinear stochastic MASs consisting of 4 followers and 2 leaders, the following system dynamics are considered: d x i 1 = 0 . 9 x i 2 0 . 8 x i 1 2 sin x i 2 d t + ψ i 1 x ¯ i 1 d w d x i 2 = u i + 0 . 9 sin x i 1 d t + ψ i 2 x ¯ i 2 d w

where xi1xi2 ∈ ℝ, u ∈ ℝ is the control input, ψ i 1 x ¯ i 1 = 0 . 3 sin x i 1 , ψ i 2 x ¯ i 2 = 0 . 01 sin 0 . 1 sin x i 1 .The leaders are defined as: y 5 r = 0 . 1 sin 2 t 0 . 1 y 6 r = 0 . 45 0 . 5 e t + 2

The communication graph that we used in the simulation is visualized in Fig. 3.

Communication graph.

Figure 3: Communication graph.

According to Fig. 3, the Laplacian matrix as: L = 2 1 0 0 1 0 0 2 0 0 1 1 0 1 3 1 0 1 1 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 .

The NN update parameters are designed as: γai1 = 20, γai2 = 15, γci1 = 14, γci2 = 14, σi1 = 14. The design parameters for the optimized virtual control action α ˆ i corresponding to Eq. (41) are: γi1 = 12, βi1 = 5. The parameters of the optimized actual control action corresponding to Eq. (42) are set as γi2 = 5, βi2 = 2.

The simulation results, illustrating the application of the proposed OB method for stochastic nonlinear MASs, are presented in Figs. 412. Figure 46 depict the boundedness of the actor, identifier, and critic NN weights. The actor for performing the control action α ˆ i and the optimized control actor u ˆ i are illustrated in Figs. 78. Figure 9 displays the trajectories of leaders and followers, demonstrating the asymptotic convergence of all followers to the convex hull formed by the leaders. The distributed containment errors are shown in Figs. 1011. The results verify that all closed-loop system signals are SGUUB. The simulation results demonstrate that the OB method used in MASs can achieve the desired control performance. Besides, Fig. 12 is the error curve without considering the adaptive compensation scheme in this paper. By comparing simulation results, it can be seen that through RL, adjusting the adaptive rate accelerates the convergence speed of the optimization algorithm, allowing sensor errors to converge more quickly.

The actor NN weight a in step m.

Figure 4: The actor NN weight a in step m.

The identifier NN weight h in step m.

Figure 5: The identifier NN weight h in step m.

The critic NN weight c in step m.

Figure 6: The critic NN weight c in step m.

The optimized virtual control action in step 1.

Figure 7: The optimized virtual control action in step 1.

The optimized actual control action in step 2.

Figure 8: The optimized actual control action in step 2.

The trajectories of four followers and two leaders.

Figure 9: The trajectories of four followers and two leaders.

The distributed containment errors s in step 1.

Figure 10: The distributed containment errors s in step 1.

The distributed containment errors s in step 2.

Figure 11: The distributed containment errors s in step 2.

The distributed containment errors q in step 2.

Figure 12: The distributed containment errors q in step 2.

Conclusion

This article introduces an optimized backstepping control based on RL, which has been developed and applied to a class of nonlinear stochastic strict-feedback MASs experiencing sensor faults. Crafting virtual and actual controls as optimized solutions for their respective subsystems, an overall optimization of the backstepping control has been achieved. To address sensor faults, an adaptive neural network compensation control method has been constructed. Utilizing the RL framework based on neural network approximation, the rules for updating RL have been deduced from the negative gradient of a basic positive function linked to the HJB equation. In comparison with existing methods, not only did this approach significantly simplify the RL algorithm, but it also relaxed the requirements for known dynamics and persistent excitation. Additionally, the proposed control scheme has that the outputs of all followers converge to the dynamic convex hull formed by the leaders.

Supplemental Information

Simulation Code

The reinforcement learning (RL) framework based on neural network approximation is employed, deriving RL update rules from the negative gradient of a simple positive function correlated with the Hamilton–Jacobi-Bellman (HJB) equation. This significantly simplifies the RL algorithm while relaxing the constraints for known dynamics and persistent excitation.

DOI: 10.7717/peerj-cs.2126/supp-1