Optimizing implementations of linear layers using two and higher input XOR gates

View article
PeerJ Computer Science


In recent years, lightweight cryptography has gained increasing attention due to the usage of resource-constrained devices like smart devices, wearable devices, and Internet of Things (IoT) devices. Since these small, constrained devices can manipulate private data, designing novel, lighter cryptographic primitives with low implementation costs is crucial (Duval & Leurent, 2018).

Basically, a block cipher includes three components: a non-linear layer, a linear (or diffusion) layer, and a key schedule. A diffusion layer with a maximum branch number of n+1, that uses a diffusion matrix of size n×n facilitates the measurement of diffusion rate in block ciphers. This diffusion layer ensures maximum diffusion, and matrices that satisfy this condition, such as MDS matrices, are commonly utilized in block ciphers (Rijmen et al., 1996; Daemen, Knudsen & Rijmen, 1997; Daemen & Rijmen, 2002; Schneier et al., 1998). Furthermore, MDS matrices are used in stream ciphers (Watanabe et al., 2002) and hash functions (Barreto, Rijmen & Nv, 2000; Choy et al., 2012; Gazzoni Filho, Barreto & Rijmen, 2006; Gauravaram et al., 2009; Guo, Peyrin & Poschmann, 2011), as indicated in various studies. In the literature, there are different types of methods to construct efficient MDS matrices, such as direct construction (Cui, Jin & Kong, 2015; Sajadieh et al., 2012; Gupta & Ray, 2013; Güzel et al., 2019), search based construction (Wu, Wang & Wu, 2013; Chand Gupta & Ghosh Ray, 2014; Li & Wang, 2016; Sarkar & Syed, 2016, 2017; Sakalli et al., 2020), and hybrid construction (Pehlivanoğlu et al., 2018).

Despite having maximum diffusion property, MDS matrices have high implementation costs. So, finding lightweight MDS diffusion layers, especially involutory ones, with minimized hardware requirements is a challenging task (Toh et al., 2018). The cost of implementation can be quantified using two key metrics: XOR count (which is divided into three types: direct-XOR (d-XOR), general-XOR (g-XOR), and sequential XOR (s-XOR) count (see “Definition and Notations” section for more details)) and circuit depth. XOR count represents the number of XOR gates required for the circuit implementation of the diffusion matrix, while circuit depth refers to the number of layers of gates required to implement the linear layer in hardware. Minimizing the number of gates, particularly expensive ones such as XOR gates ensures low-cost and efficient circuit implementations of the diffusion matrix. Similarly, minimizing circuit depth guarantees reduced latency. On the other hand, Gate Equivalent (GE) is a metric utilized to compare the size of logic gates. Banik, Funabiki & Isobe (2019) observed that utilization of higher input XOR gates may result in a reduced area (i.e., GE) in ASIC libraries. That idea brought new directions in research on optimized circuit implementations; hence, the researchers started to search for circuits not only with the minimum number of XOR gates and circuit depth but also with reduced GE cost by considering multiple-input XOR gates (e.g., Baksi et al., 2021; Banik, Funabiki & Isobe, 2019; Liu et al., 2022b).

Related work

To address the challenge of finding efficient circuit implementations of a given linear layer, in the beginning, a variety of local optimization techniques (e.g., Gupta & Ray, 2013; Sim et al., 2015; Beierle, Kranz & Leander, 2016; Sarkar & Sim, 2016; Sarkar & Syed, 2017; Pehlivanoğlu et al., 2018) were proposed in order to reduce the number of XOR counts. Local optimization means the selection of the coefficients of the matrix with minimum XOR counts, but it does not guarantee the finding of efficient circuits. Because the fixed cost of connecting entries remains without being optimized in local optimization methods.

Then, the authors started to address the task of globally optimizing linear layers. This involves estimating the hardware cost of a linear layer by identifying an SLP that corresponds to it (Li et al., 2019). Two such algorithms are the Paar1 and Paar2 heuristics (Paar, 1997), which generate cancellation-free SLPs that do not contain the same variables in both variables of an XOR pair. Although Paar’s heuristics do not necessarily yield the optimal circuit implementations, the Paar1 heuristic is easy to implement and produces fast results, even for large matrix sizes. Boyar-Peralta’s heuristic (Boyar & Peralta, 2010) and its variant (Boyar, Matthews & Peralta, 2012) were an inspiration to improve new global optimization heuristics. In Boyar, Find & Peralta (2017), the authors modified Paar’s heuristic by using preprocessing steps and allowing cancellations. In the same article, minimizing the number of AND gates was handled beside XOR gates. In Duval & Leurent (2018), the authors proposed a new approach based on searching the circuit space to find optimal circuits of MDS matrices by using the tree-based Dijkstra searching technique. In Li et al. (2019), the authors modified Boyar-Peralta’s heuristic (Boyar, Matthews & Peralta, 2012) by considering the circuit depth metric to determine the optimal circuit implementations. Boyar, Find & Peralta (2019) proposed a new heuristic creating smaller linear and nonlinear circuits for a given circuit depth bound. Tan & Peyrin (2019) proposed Randomized Normal Boyar Peralta (RNBP) heuristic and two non-deterministic algorithms A1 and A2. All these given heuristics focus on the reduction of XOR counts by using temporary intermediate signals (gates) to determine the globally optimized implementations of a diffusion matrix. In Banik, Funabiki & Isobe (2021), the authors extracted lower circuit depth implementations by adding randomness in the tweaked algorithm given in Li et al. (2019) (this new version is simply called BFI heuristic). In Liu et al. (2022a), especially considering the low latency criteria, the authors proposed a new framework based on forward and backward search strategies that can find optimal solutions with a minimized circuit depth. Pehlivanoglu & Demir (2022) designed a new framework that combines some of these recently proposed global optimization heuristics to find better circuit implementations. It should be noted that all these global optimization methods given above generate circuit implementations with two-input XOR gates under the g-XOR metric.

The idea given in Banik, Funabiki & Isobe (2019) has opened up new directions for research on the usage of multiple-input XOR gates in SLP. In the same article, Banik, Funabiki & Isobe (2019) designed a graph-based heuristic to explore circuits featuring both two-input and three-input XOR gates. Specifically, they converted circuits constructed using only two-input gates into new ones with a combination of two-input and three-input XOR gates. Then, Baksi et al. (2021) introduced enhanced versions of BP heuristic (originally presented in Boyar & Peralta (2010) and Tan & Peyrin (2019)), simply called BDKCI. These improved versions support two-input, three-input, and four-input XOR gates. Recently, Liu et al. (2022b) proposed two algorithms: the transform algorithm and the graph extending algorithm. By combining these two algorithms, they generated better circuit implementations.

In the literature, there are few heuristics based on optimizing implementations of diffusion matrices under only the s-XOR metric without using any temporary intermediate signals. Optimizing a diffusion matrix under the s-XOR metric is based on the problem of optimal pivoting in the Gauss-Jordan elimination (Kölsch, 2019). In Jean et al. (2017), the authors proposed an exhaustive search algorithm to find out the optimal circuit implementations for small matrix sizes such as 4×4 and 8×8 under s-XOR metric. Xiang et al. (2020) proposed a new heuristic, called XZLBZ, that was capable of reducing the implementation cost (in terms of s-XOR count) of 16×16 and 32×32 involutory/non-involutory binary MDS matrices. More recently, Yang, Zeng & Wang (2021) proposed a new heuristic, called IX algorithm, which was an improved variant of the heuristic given in Xiang et al. (2020). IX heuristic found better circuit implementation under the same run-time with higher accuracy. However, the circuit depth metric is not taken into consideration in all of these heuristics designed to decrease the s-XOR count.

Motivation and our contribution

In this article, we focus on two challenging research questions: (1) how to improve BP heuristic by considering the circuits using two-input XOR gates with low latency criteria (especially for depth 3), and (2) how to enhance BDKCI heuristic by incorporating circuit depth awareness. To address the first research question, we propose a new heuristic, called SBP, that is the improved version of Boyar-Peralta’s heuristic (Boyar, Matthews & Peralta, 2012) by considering low latency criteria. We introduce a new randomized way of choosing actions that would lead to better circuit solutions, especially with minimum circuit depth 3. To address the second research question, we give the enhanced (depth-bounded) version of BDKCI heuristic that is capable of producing depth-limited circuits.

The main contributions of this article can be given as follows:

  • We give a new 4×4 involutory MDS linear layer over F24 whose circuit implementation requires the lowest number of XORs (i.e., 41 g-XORs saving four from the previous best result (Liu et al., 2022a)) with the minimum depth 3.

  • We apply our new heuristic SBP to the previously given 4×4 linear layers and find many low-latency circuits which are better than the other best previous results given in Liu et al. (2022a).

  • For further improvement, we enhance the recently proposed BDKCI heuristic algorithm by incorporating circuit depth awareness, which limits the depth of the circuits created.

  • By using the proposed circuit depth-bounded version of BDKCI, we present better circuit implementations of linear layers of block ciphers than those given in the literature. Moreover, the given circuit for the AES MixColumn matrix only requires 44 XOR gates/depth 3/240.95 GE in ASIC4 library, while the previous best-known result is 55 XOR gates/depth 5/243.00 GE.

  • Our new 4×4 involutory MDS matrix requires only 19 XOR gates with depth 3 by using the circuit depth-bounded version of BDKCI. In the ASIC1, ASIC2 (STM 65 nm), ASIC3 (TSMC 65 nm), and ASIC4 libraries, the circuit costs are 79.75, 88.486, 100.65, and 101.84 GE, respectively. These results not only outperform the state-of-the-art but also demonstrate that our circuit has the lowest cost.

All the source codes (SBP heuristic, depth-bounded version of BDKCI) and experimental results are available at https://github.com/demirmehmet0/SBP.


This article is organized as follows: In “Definition and Notations” section, we give some basic notations and definitions. In “SBP Heuristic” section, we propose a new heuristic SBP for global optimization to generate low-latency circuit implementations of linear layers. Next, in “Depth-bounded version of BDKCI Heuristic” section, we present the depth-bounded version of BDKCI heuristic and some good experimental results. Finally, we conclude and highlight some possible future works for further results in “Conclusion and Future Works” section.

Definition and notations

This section reviews the fundamental mathematical principles concerning finite fields and MDS matrices. In this context, definitions and notations are introduced.

The finite field F2m is defined by an irreducible polynomial p(x) of degree m over F2, can be denoted as F2[x]/(p(x)). Each element in finite field F2m can be represented as i=0m1biαi, where bi F2 and α is a root of F2m. For simplicity’s sake, the hexadecimal notation is used to represent the elements of F2m and the irreducible polynomial p(x), e.g., the irreducible polynomial p(x)=x4+x+1 can be denoted as 0x13.

The n×n matrix over finite field F2m can be represented as Mn(F2m), and binary representation of the same n×n matrix (whose each entry is m×m invertible binary matrix) over the same finite field can be denoted as Mn(GL(m,F2)).

Definition 1. (MDS Matrix) Let C be an [n,k,d] code and G=[I|A] be a generator matrix of C, where A is a k×(nk) matrix. If and only if every square sub-matrix of A is nonsingular, A is an MDS matrix. If A also satisfies A=A1, A is an involutory MDS matrix.

GHadamard matrix form proposed in Pehlivanoğlu et al. (2018) is a hybrid construction method to construct (involutory) MDS matrices. A k×k GHadamard matrix GH is generated by using the non-zero bi parameters and their inverses with a k×k Finite Field Hadamard (simply Hadamard) matrix H over F2m. A 4×4 GHadamard matrix GH can be denoted as follows:

Definition 2. (GHadamard Matrix) Let H=[a0a1a2a3a1a0a3a2a2a3a0a1a3a2a1a0] be a 4×4 Hadamard matrix, and 4×4 GHadamard matrix form GH=Ghad(a0,a1;b1,a2;b2,a3;b3), where b1,b2 and b3F2m{0} can be shown as follows:



To compute the hardware implementation cost of a diffusion matrix in terms of XOR count, there are three important metrics: direct XOR (d-XOR) count (Khoo et al., 2014), sequential XOR (s-XOR) count (Jean et al., 2017), and general-XOR (g-XOR) count (we used the same abbreviation given in Xiang et al. (2020)).

Definition 3. The d-XOR count is defined as the Hamming weight (sum of the nonzero elements) of the n×n invertible binary matrix minus n.

Definition 4. The s-XOR count is defined as the minimum number of XOR operations necessary to implement an n×n invertible binary matrix using in-place operations. In other words, given input vectors {x0,x1,,xn1} of the n×n invertible binary matrix, the output vectors {y0,y1,,yn1} are calculated using in-place operations xixixj, where 0i,jn1.

Definition 5. The g-XOR is defined as the minimum number of required operations xixj1xj2, where 0j1,j2i.

Some intermediate values can be computed repeatedly under d-XOR and that will ensure a more costly (i.e., overestimation) final circuit than the actual one (Yang, Zeng & Wang, 2021), therefore s-XOR and g-XOR metrics are used for further evaluation. However, s-XOR count causes a high computational cost, especially for optimizing full MDS matrices (Duval & Leurent, 2018).

Sbp heuristic

SBP heuristic starts from Boyar-Peralta’s heuristic but uses a different structure to find the optimal circuit solutions while choosing the new bases. SBP chooses a threshold value that gives the number of pair candidates that ensure (minimize the sum of distances or maximize the Euclidean Norm) the best results above the tie. After that, it performs a randomization step to randomly pick one of the best pairs by using the uniform integer distribution function. This function produces integer values in a range [0, threshold value] according to a uniform discrete distribution. Different distributions like uniform, normal, and sampling distributions were tested in our initial experiments exploring the effects of various random number distributions based on the Mersenne Twister algorithm (Matsumoto & Nishimura, 1998). The findings showed that the uniform integer distribution was the most effective (in terms of the XOR count of the generated circuit) in our research problem. Therefore, we chose it for further experiments.

We present all the details in Algorithm 1. According to Algorithm 1, S denotes a sequence of input signals (i.e., xis), D keeps trace of circuit depth of S, and Δ defines a distance vector, where δH(S,yis) represents the Hamming-Distance from S to output signals (i.e., yis). SBP picks signal pairs that maximize the Euclidean norm of the new updated distance vector Δ, by taking into account the circuit depth limit. But here, the algorithm handles a specified number of pairs (depending on the chosenParam parameter). Then, SBP applies the uniform discrete distribution function to determine a new base element. It performs the previous steps until all elements of Δ are equal to zero. The idea given in SBP potentially leads to the best result by pairing up the input signals that minimize the target values in the distance vector. The chosenParam parameter plays a pivotal role in defining the dimension of the element space. If this space’s size equals the maximum count of selectable elements, SBP will yield outcomes equal to those of other optimization algorithms. However, by constraining the number of elements within this space (by using the chosenParam parameter value), SBP consistently selects superior elements. When determining the chosenParam value, it can be selected based on: (1) the size of the matrix, and (2) the runtime of other optimization algorithms. For small matrix sizes, the chosenParam value should be lower compared to larger sizes. For the second condition, essentially, if generating the circuit for the same matrix takes a long time in other optimization algorithms, it is advisable to keep the chosenParam value low, and if it takes a short time, a higher threshold is recommended. However, when the chosenParam value is set excessively high, the algorithm may enter an infinite loop, making it challenging to make selections between elements or find any optimal element at all. When establishing the maximum value for the threshold, it is important to consider the fundamental factor, which is the number of elements the algorithm places in the candidate list (i.e., allElement array) during each element selection. For example, if there are 10 elements in the allElement array within one iteration, the threshold value should not exceed 10. However, since this situation varies with each iteration of the algorithm, a precise threshold value calculation cannot be made. Therefore, an average threshold value can be determined instead.

Algorithm 1:
SBP algorithm.
1: Input: (depthLimit, n, m, M)                              /* M: a (n×m) binary matrix */
2: Output: S                                   /* that evaluates optimized decomposition of M */
3: Initialization
4: S[x1,x2,...,xn]                                          /* The input signals */
5: D[0,0,...,0]                           /* keeps trace of the circuit depth of S, it is initialized to zero */
6: Δ[δH(S,y1),,δH(S,ym)]            /* The distances and the initial distance equals Hamming Weight of the row minus one */
7: jn
8: while Δ0 do
9:   besti0
10:   bestj0
11:   bestDist[0,0,...,0]
12:   counter0
13:  for i 0 to BaseSize do
14:   if depth[i]+1depthLimit then
15:    continue
16:   end if
17:   for j i+1 to BaseSize do
18:    if depth[j]+1depthLimit then
19:     continue
20:    end if
21:     depthNewBasepow(2,Max(i,j)+1)
22:     thisDisttotalDistance()
23:    if thisDistminDistance then
24:      thisNormUpdate(thisNorm)
25:     if thisDist<minDistance then
26:      if depth[i]+1depthLimit then
27:       continue
28:      end if
29:       thisNormUpdate(thisNorm)
30:      if thisDist<minDistance || thisNorm>oldNorm then
31:       if counter > chosenParam then          /* chosenParam: defines the threshold value */
32:         counter0
33:       end if
34:        allElement[counter]dist&i&j
35:        counter++
36:      end if
37:     end if
38:    end if
39:   end for
40:  end for
41:   numberuniformIntDistribution(0,counter)                     /* Randomization */
42:   bestDist&besti&bestjallElement[number]
43:  update(Base)                                       /* update base */
44:   update(D)                                       /* update depth */
45:   update(Δ)                                   /* update distance */
46: end while
47: return S
DOI: 10.7717/peerj-cs.1820/table-9

While SBP shares its foundational traits with A1, A2, and RNBP, its superior performance can be attributed to its unique approach to element storage logic. Unlike other optimization algorithms that exhaustively explore all possibilities during element storage, thereby significantly expanding the search space and often generating numerous divergent paths, SBP takes a more controlled approach. SBP algorithm carefully curates the search space and stores elements acquired through the element selection process within BP algorithm, up to a specified limit. This strategy ensures that the highest-quality elements remain readily accessible within the stored values. The selections from this pool of top-tier elements facilitate a focus on achieving superior results. Consequently, this approach narrows down the search space, ultimately leading to the attainment of the optimal XOR count.

Better circuit implementations for 4 × 4 low-latency involutory MDS matrices by using SBP heuristic

In this subsection, we apply our new heuristic SBP to the existing and new linear layers and find numerous low-latency candidates for circuit implementations. Notably, we give a new 4×4 involutory MDS matrix over F24/ 0x19 which can be implemented with only 41 g-XOR gates and depth 3 by applying SBP global optimization method, while the previous best optimal result requires 45 g-XOR gates (Liu et al., 2022a) for the same depth level.

Example 6. Let F24 be generated by the primitive element α which is a root of the primitive polynomial 0×19. Consider 4×4 Hadamard involutory MDS matrix H1=had(1,α5,α14,α7) over F24/0x19. Then, GHadamard matrix GH1=Ghad(1,α5;α9,α14;α2,α7;α9) corresponding to H1 with parameters b1=α9,b2=α2, and b3=α9 is given as follows:

GH1=[1α14ααα1111α14α12α141α12α13α14α131]which is involutory and MDS matrix with d-XOR gate count 69 (=21+4×3×4). After applying SBP heuristic to the matrix GH1, we find the circuit with 41 g-XORs for depth 3.

In Table 1, we provide the circuit implementation and computation sequence of the matrix GH1 by applying SBP heuristic with threshold value 7. Moreover, we look for more efficient low latency circuit implementations of the matrix GH1, so we compare our obtained implementation with the results from different state-of-the-art heuristics. We ran all the algorithms for eight hours for the matrix GH1 by taking the number of XOR gates into account with respect to the minimum depth, then we present all the implementation costs in Table 2. As shown in Table 2, our proposed heuristic leads to better circuit results in terms of circuit depth (not only depth 3 but also different depths) than the other heuristics given in the literature.

Table 1:
The global optimization result of GH1 with 41 g-XORs and depth 3, where xi [ (x0,x1,,x15)], yj [ (y0,y1,,y15)] and tk [ (t1,t2,,t41)] refer to the input signals, output signals, and temporary intermediate signals, respectively, and the values are given in parentheses refer to circuit depth.
Iter. New base element New distance vector Δ
1 t1=x1+x9 (1) [3,3,4,5,4,3,6,4,5,4,6,3,3,5,5,2]
2 t2=x0+x8 (1) [3,3,4,5,3,3,6,4,4,4,6,3,3,4,4,2]
3 t3=x2+x14 (1) [3,3,4,5,3,2,6,4,4,4,5,2,3,4,4,2]
4 t4=x4+x13 (1) [3,3,3,5,2,2,6,4,4,4,4,2,3,4,4,2]
5 t5=x0+x12 (1) [3,3,3,5,2,2,5,3,4,3,4,2,3,4,4,2]
6 t6=x6+t5 (2) [3,2,3,5,2,2,4,3,4,2,4,2,3,4,4,2]
7 t7=x1+t2 (2) [3,1,3,5,1,2,4,3,4,2,4,2,3,4,4,2]
8 t8=t6+t7 [y1] (3) [3,0,3,5,1,2,4,3,4,2,4,2,3,4,4,2]
9 t9=t4+t7 [y4] (3) [3,0,3,5,0,2,4,3,4,2,4,2,3,4,4,2]
10 t10=x2+x10 (1) [3,0,3,5,0,2,3,3,4,2,4,2,2,4,4,2]
11 t11=x3+x11 (1) [3,0,3,4,0,2,3,2,4,2,4,2,2,3,4,2]
12 t12=x3+x15 (1) [3,0,3,4,0,2,2,2,3,2,4,2,2,3,4,2]
13 t13=x4+x11 (1) [3,0,3,3,0,2,2,2,3,2,4,1,2,3,4,2]
14 t14=t3+t13 [y11] (2) [3,0,3,2,0,2,2,2,3,2,4,0,2,3,4,2]
15 t15=x5+x12 (1) [3,0,3,2,0,2,2,2,2,2,4,0,1,3,4,2]
16 t16=t10+t15 [y12] (2) [3,0,3,2,0,2,2,2,2,2,4,0,0,3,4,2]
17 t17=t5+t11 (2) [2,0,3,2,0,2,2,1,2,2,4,0,0,3,4,2]
18 t18=x7+t17 [y7] (3) [2,0,3,2,0,2,2,0,2,2,4,0,0,3,4,2]
19 t19=x4+t1 (2) [2,0,3,2,0,2,2,0,2,2,4,0,0,3,3,1]
20 t20=x15+t19 [y15] (3) [2,0,3,2,0,2,2,0,2,2,4,0,0,3,3,0]
21 t21=x7+x14 (1) [2,0,3,2,0,2,2,0,2,2,3,0,0,3,2,0]
22 t22=t10+t12 (2) [2,0,3,1,0,2,1,0,2,2,3,0,0,3,2,0]
23 t23=t14+t22 [y3] (3) [2,0,3,0,0,2,1,0,2,2,3,0,0,3,2,0]
24 t24=t6+t22 [y6] (3) [2,0,3,0,0,2,0,0,2,2,3,0,0,3,2,0]
25 t25=t12+t15 (2) [1,0,3,0,0,2,0,0,1,2,3,0,0,3,2,0]
26 t26=t17+t25 [y0] (3) [0,0,3,0,0,2,0,0,1,2,3,0,0,3,2,0]
27 t27=t2+t25 [y8] (3) [0,0,3,0,0,2,0,0,0,2,3,0,0,3,2,0]
28 t28=x5+t1 (2) [0,0,3,0,0,1,0,0,0,2,3,0,0,3,2,0]
29 t29=t3+t28 [y5] (3) [0,0,3,0,0,0,0,0,0,2,3,0,0,3,2,0]
30 t30=x13+t1 (2) [0,0,3,0,0,0,0,0,0,1,3,0,0,3,2,0]
31 t31=t6+t30 [y9] (3) [0,0,3,0,0,0,0,0,0,0,3,0,0,3,2,0]
32 t32=t2+t21 (2) [0,0,3,0,0,0,0,0,0,0,3,0,0,3,1,0]
33 t33=t19+t32 [y14] (3) [0,0,3,0,0,0,0,0,0,0,3,0,0,3,0,0]
34 t34=t4+t21 (2) [0,0,2,0,0,0,0,0,0,0,2,0,0,3,0,0]
35 t35=x1+t10 (2) [0,0,2,0,0,0,0,0,0,0,1,0,0,3,0,0]
36 t36=t34+t35 [y10] (3) [0,0,2,0,0,0,0,0,0,0,0,0,0,3,0,0]
37 t37=x9+t3 (2) [0,0,1,0,0,0,0,0,0,0,0,0,0,3,0,0]
38 t38=t34+t37 [y2] (3) [0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0]
39 t39=x6+x13 (1) [0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0]
40 t40=t2+t11 (2) [0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]
41 t41=t39+t40 [y13] (3) [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
DOI: 10.7717/peerj-cs.1820/table-1
Table 2:
Circuit cost (XOR count/depth) of GH1 under several global optimization algorithms.
Matrix Paar1 (Paar, 1997) RPaar1 (Lin et al., 2021) BP (Li et al., 2019) A1 (Tan & Peyrin, 2019) A2 (Tan & Peyrin, 2019) RNBP (Tan & Peyrin, 2019) Liu et al. (2022a) SBP
GH1 37/8 37/5 44/3* 41/5* (37/6) 39/5* (37/5) 41/4* (37/5) 44/3* (41/3, 39/4)
DOI: 10.7717/peerj-cs.1820/table-2


Extracted from its original algorithm.
Extracted from the framework given in Lin et al. (2021).

Bold values indicate the best results.

Furthermore, in Table 3, we consider the several 4×4 linear layers given in the literature to extract their low latency circuits. Notably, our results are better than the other heuristics, we can easily see that the SBP heuristic ensures a significant improvement for the minimum circuit depth metric. Note that, the implementation of 4×4 involutory MDS matrix given in Sarkar & Syed (2016) requires only 44 g-XOR, and 40 g-XOR for depth 3 and depth 4, respectively. These new records beat all previous best-known results for this matrix. Even though we find a new record, the circuit of GH1 (see Table 1) beats all the records (for low latency implementations of 4×4 involutory MDS linear layers).

Table 3:
Comparison of circuit cost (XOR count/depth) of binary matrices of size 16 × 16 under several global optimization algorithms.
Matrix Kranz et al. (2017) Xiang et al. (2020) Lin et al. (2021) Li et al. (2019) Banik, Funabiki & Isobe (2021) Liu et al. (2022a) SBP
SMALLSCALE AES (Cid, Murphy & Robshaw, 2005) 47/7 43/5 43/5 49/3 49/3 47/3 (48/3, 47/4)
JOLTIK (Jean, Nikolic & Peyrin, 2015) 48/4 44/7 43/8 51/3 50/3 48/3 49/3
MIDORI (Banik et al., 2015) 24/4 24/3 24/3 24/2 24/2 24/2 24/2
PRINCE M0 (Borghoff et al., 2012) 24/4 24/6 24/6 24/2 24/2 24/2 24/2
PRINCE M1 (Borghoff et al., 2012) 24/4 24/6 24/6 24/2 24/2 24/2 24/2
PRIDE L0, L3 (Albrecht et al., 2014) 24/4 24/3 24/3 24/2 24/2 24/2 24/2
PRIDE L1, L2 (Albrecht et al., 2014) 24/3 24/3 24/3 24/2 24/2 24/2 24/2
QARMA64 (Avanzi, 2017) 24/3 24/5 24/5 24/2 24/2 24/2 24/2
SKINNY (Beierle et al., 2016) 12/2 12/2 12/2 12/2 12/2 12/2 12/2
Non involutory MDS matrices
Sim et al. (2015) (Hadamard) 48/3 44/7 44/7 51/3 50/3 49/3 48/3
Liu & Sim (2016) (Circulant) 44/3 44/6 43/4 47/3 44/3 44/3 47/3
Li & Wang (2016) (Circulant) 44/5 44/8 43/4 47/3 44/3 44/3 47/3
Beierle, Kranz & Leander (2016) (Circulant) 42/5 41/6 40/5 47/3 43/3 45/3 46/3
Sarkar & Syed (2016) (Toeplitz) 43/5 41/7 40/7 44/3 43/3 45/3 43/3
Jean et al. (2017) 43/5 41/6 40/6 45/3 45/3 45/3 (45/3, 43/4)
Involutory MDS matrices
Sim et al. (2015) 48/4 44/8 43/8 51/3 49/3 48/3 (49/3, 47/4)
Li & Wang (2016) 48/4 44/6 43/8 51/3 49/3 48/3 (49/3, 48/4)
Sarkar & Syed (2016) 42/4 38/8 37/7 48/3 46/3 45/3 (44/3, 40/4)
Jean et al. (2017) 47/7 41/6 41/10 47/3 47/3 47/3 (45/4)
DOI: 10.7717/peerj-cs.1820/table-3


Bold values indicate the best results.

Depth-bounded version of bdkci heuristic

BDKCI algorithm typically allows circuits to be generated without any limitations on circuit depth. However, in this article, we have improved upon this heuristic by introducing circuit awareness. We present algorithms just for the modified functions within the original BDKCI heuristic. Note that, we have not only made alterations to these two functions but have also modified others called within them.

Algorithm 2, represents the Main function that begins by importing a target matrix. It then systematically introduces XOR gates using the SLP method until all elements within the target matrix are encompassed. This iterative process is tailored to iteratively enhance the parameters of the XOR circuit through multiple applications of the SLP method. At the end of each iteration, the best XOR circuit parameters, including relevant information such as cost and depth, are recorded in a log file. On the other hand, Algorithm 3, represents the PickNewBaseElementXOR3 function. Basically, in this function, an element ( chosen value) is chosen randomly from the element array to generate a circuit gate. Also, the depth value of the selected element is appended to the depth array. In the original BDKCI version, for detecting the chosen value, A1, and A2 algorithms can be used in addition to RNBP. But, in our proposed depth-bounded version, we just utilize RNBP heuristic.

Algorithm 2:
Main function.
1: depths array of size 1,000
2: DepthLimit5
3: function MAIN
4:  ReadTargetMatrix                           /* Read TargetMatrix then construct Target, Dist arrays */
5:  while iterations >0 do
6:    BestCount,BestCost1,BestCost3,BestCost4,BestDepth LARGE
7:    XorCount,Xor2Count,Xor3Count,Xor4Count 0
8:    XorCost1,XorCost2,XorCost3,XorCost4 0
9:   refreshDistAndTarget          /* Update the Target and Dist arrays by randomly shuffling TargetMatrix */
10:   InitBase                                /* Set initial values */
11:    _returnVal20                           /* Set initial value for _returnVal2 */
12:   while TargetsFound < NumTargets do
13:     _returnVal EasyMoveXOR3   /* Search for targets with a distance of 1 */
14:    if _returnVal=0 then
15:     PickNewBaseElementXOR3 /* Select new elements to create circuits with 3-input, and 4-input XOR gates */
16:    else if _returnVal=2 then
17:      _returnVal22
18:     break
19:    end if
20:    if not EasyMove then
21:     PickNewBaseElement          /* Select new elements to create circuits with just 2-input XOR gates */
22:    end if
23:    if the difference between any BestCost and XorCost is greater than 0.001 then
24:     if _returnVal22 then
25:      logs trialNo
26:     end if
27:      depth max_element(depth_map.begin(), depth_map.end())
28:     if TargetsFound = NumTargets then
29:      if IWSEC then
30:                                /* If all targets are found, the depth is calculated */
31:      end if
32:     end if
33:      t current time
34:     if IWSEC then
35:      if _returnVal2 then
36:      end if
37:     end if
38:                 /* Based on the _returnVal2 value, checks are made and results are written to the log. */
39:    end if
40:    logs.close()
41:   end while
42:  end while
43: end function
DOI: 10.7717/peerj-cs.1820/table-10
Algorithm 3:
PickNewBaseElementXOR3 function.
1: function PickNewBaseElementXOR3
2:   AllElements allocate space for array of size BaseSize×(BaseSize1)×(BaseSize×BaseSize4×BaseSize+5)
3:   counter0
4:   DepthLimitchosencircuitdepthlimit /* Depending on the selected circuit depth limit, the chosencircuitdepthlimit value can be adjusted, e.g., 3,4, etc. */
5:  for i[0,BaseSize) do
6:   for j[i+1,BaseSize) do
7:    if depths[i]+1>DepthLimit or depths[j]+1>DepthLimit then
8:     continue
9:    end if
10:    NewBaseBase[i]Base[j]
11:    TotalDistanceXOR3(Gate::XOR2)         /* Store results of a 2-input XOR operation with distances and parent indices */
12:    for k[0,NumTargets) do
13:      AllElements[counter].newDist[k]NDist[k]
14:    end for
15:     AllElements[counter].parentii
16:     AllElements[counter].parentjj
17:     AllElements[counter].gateGate::XOR2
18:     countercounter+1
19:    for k[j+1,BaseSize) do
20:     if depths[i]+1>DepthLimit or depths[j]+1>4 or depths[k]+1>DepthLimit then
21:      continue
22:     end if
23:      NewBaseBase[i]Base[j]Base[k]
24:     TotalDistanceXOR3(Gate::XOR3)          /* Store results of a 3-input XOR operation with distances and parent indices */
25:     for l[0,NumTargets) do
26:       AllElements[counter].newDist[l]NDist[l]
27:     end for
28:      AllElements[counter].parentii
29:      AllElements[counter].parentjj
30:      AllElements[counter].parentkk
31:      AllElements[counter].gateGate::XOR3
32:      countercounter+1
33:     if XOR4 is defined then
34:      for l[k+1,BaseSize) do
35:        NewBaseBase[i]Base[j]Base[k]Base[l]
36:       TotalDistanceXOR3(Gate::XOR4)          /* Store results of a 4-input XOR operation with distances and parent indices */
37:      end for
38:     end if
39:    end for
40:   end for
41:  end for
42:   chosenRNBP(AllElements,counter) The chosen variable holds a value returned from RNBP algorithm. RNBP algorithm selects one of the elements from the AllElements array, then returns its index in the array. This index is subsequently assigned to the “chosen” variable.
43:   /* The remaining portion of the algorithm includes tasks such as updating the base, computing costs, and releasing memory resources. */
44: end function
DOI: 10.7717/peerj-cs.1820/table-11

The following is a brief overview of the changes made to the original BDKCI heuristic:

• Within the Main function, the “ BestDepth” variable is declared as a large data type, enabling it to store the minimum depth value identified during the algorithm’s execution. Inside the same function, we have established the “ depths” array for the purpose of retaining the depth of each gate. These values play a crucial role in identifying the minimum depth value attained throughout the algorithm’s execution.

• The return type of the EasyMoveXOR3 function has been altered to an integer, allowing us to decide whether to print the results based on the function’s return value. Moreover, inside the same function, we have made the following modifications that allow us to record depth information of two-input XOR gates, three-input XOR gates, and four-input XOR gates, respectively.


depths[BaseSize]=max(depth_map[a],depth_map[b],depth_map[c])+1, depths[BaseSize]=max(depth_map[a],depth_map[b],depth_map[c],depth_map[d])+1.

Furthermore, within the EasyMoveXOR3 function, a boolean variable named “  foundone” has been defined to monitor whether the algorithm’s depth surpasses the specified threshold value, thus influencing the progression or conclusion of the current algorithm round.

• In the function PickNewBaseElementXOR3, we have defined “ DepthLimit” variable that allows us to generate circuits with the chosen circuit depth. Moreover, the condition “ if(depths[i]+1>DepthLimit||(depths[j]+1>DepthLimit)” compares the depth information of the element pair that is eligible for selection in the current round with the depth limit. If the depth limit is exceeded, this pair of elements is not selected, and the loop continues to select a new pair of elements.

Better circuit implementations by using depth-bounded version of BDKCI heuristic

In this subsection, we present improved circuit implementations for the linear layers of some block ciphers, utilizing the circuit depth-bounded version of the BDKCI heuristic suggested in this study. We enhanced AES MixColumn matrix circuit with a cost of 240.95 GE (see Table 4) for the ASIC4 library. This circuit utilizes five XOR2 gates, seven XOR3 gates, and 32 XOR4 gates with depth 3, outperforming the previous best result of 243 GE with depth 5 (Liu et al., 2022b). Note that, XOR2, XOR3, and XOR4 refer to two-input XOR gates, three-input XOR gates, and four-input XOR gates, respectively.

Table 4:
The global optimization result of AES MixColumn matrix with 44 XORs, depth 3, and 240.95 GE in ASIC4 by using the depth-bounded version of BDKCI.
No Operation No Operation
1 t0=x15x23 23 y25=x25t4t8t20
2 y31=x6x7x30t0 24 y2=x2x18y18t20
3 y7=x6x14x31t0 25 t24=x11x20x28
4 y8=x0x16x24t0 26 y12=x4x19t0t24
5 t4=x7x15 27 y04=x3x12t4t24
6 y15=x6x22y7t4 28 t27=x0x8t20
7 y0=x8x16x24t4 29 y17=x16x17y16t27
8 y23=x22x31x30t4 30 y1=x1t4t27
9 t8=x0x15x24x31 31 y9=x9x24y8t27
10 y24=y0t8 32 t31=x6x14x29
11 y16=x8t0t8 33 y22=x21x30t31
12 t11=x3x11x26x31 34 y30=x5x22t31
13 y27=x2x7x19t11 35 t34=x5x13x21x29
14 y19=x18x23x27t11 36 y14=x14x30y30t34
15 t14=x4x12x27x31 37 y21=x11x21t24t34
16 y20=x19x23x28t14 38 y13=x13x12x20t34
17 y28=x3x7x20t14 39 y6=x6x22y22t34
18 t17=x2x10 40 y5=x4x5x12t34
19 y26=x1x18x25t17 41 y29=x4x29x28t34
20 y18=x17x25x26t17 42 t41=t4t17
21 t20=x1x9x17x25 43 y11=x3y27y19t41
22 y10=x10x26y26t20 44 y3=x11x27x19t41
DOI: 10.7717/peerj-cs.1820/table-4

The binary matrix of AES MixColumn is directly taken from the repository given in Baksi et al. (2021). Table 5 provides an overview of recent works that have utilized AES MixColumn, including our own findings. Additionally, we have enhanced the previous implementations of linear layers for ANUBIS and CLEFIA M0. As for TWOFISH, we find the circuit which equals the previous best-known result. Table 6 contains the comparison results for these various diffusion layers. Moreover, for further optimization, we globally optimized GH1 by using the depth-bounded version of BDKCI. The optimized circuit implementation of GH1 is given in Table 7. It requires only one XOR2 gate, seven XOR3 gates, and 11 XOR4 gates with depth 3. Additionally, we have compared our result with those of other 4×4 involutory and MDS matrices over F24 for ASIC1, ASIC2, ASIC3, and ASIC4 libraries. The results presented in Table 8 indicate that our matrix has the smallest GE values for all ASIC libraries.

Table 5:
A brief overview of recent implementation costs of the AES MixColumn matrix in ASIC4.
Ref. #XOR2 #XOR3 #XOR4 GC Depth GE
Banik, Funabiki & Isobe (2019) 95 95 6 316.35
Banik, Funabiki & Isobe (2019) 39 28 67 6 260.35
Tan & Peyrin (2019) 94 94 9 313.02
Maximov (2019) 92 92 6 306.36
Xiang et al. (2020) 92 92 6 306.36
Lin et al. (2021) 91 91 7 303.03
Baksi et al. (2021) 12 47 59 4 258.98
Liu et al. (2022b) 22 21 12 55 5 243.0
This article 5 7 32 44 3 240.95
DOI: 10.7717/peerj-cs.1820/table-5


The notations “#XOR2, #XOR3, #XOR4” indicate the number of two-input XOR gates, three-input XOR gates, and four-input XOR gates needed, respectively. Bold values indicate the best results.

Table 6:
Summary of implementation costs of linear layers of various block ciphers in ASIC4 library.
Matrix XZLBZ (Xiang et al., 2020) BDKCI
(Baksi et al., 2021)
(Banik, Funabiki & Isobe, 2021)
(Liu et al., 2022b)
(Liu et al., 2022b)
(Liu et al., 2022b)
This article
ANUBIS (Barreto & Rijmen, 2000) 329.6 274.2 293.0 270.3 270.3 253.6 251.61
CLEFIA M0 (Shirai et al., 2007) 326.3 271.63 293.0 276.3 270.9 258.9 256.27
CLEFIA M1 (Shirai et al., 2007) 342.9 298.9 294.3 292.9 283.6 270.2 286.88
JOLTIK 146.5 122.5 127.8 126.5 123.8 115.8 117.14
MIDORI 79.9 74.5 71.9 71.9 71.9 71.9 74.56
PRINCE M0, M1 79.9 74.5 71.9 71.9 71.9 71.9 74.56
PRIDE L0, L3 79.9 74.5 71.9 71.9 71.9 71.9 74.56
QARMA128 (Avanzi, 2017) 159.8 145.8 145.8 144.5 144.5 149.12
QARMA64 79.9 74.5 71.9 71.9 71.9 71.9 74.56
SMALLSCALE AES 143.1 111.8 123.8 123.8 121.8 118.4 115.82
TWOFISH (Schneier et al., 1998) 369.6 317.5 338.9 312.9 306.9 293.5 293.53
DOI: 10.7717/peerj-cs.1820/table-6


Bold values indicate the best results.

Table 7:
The global optimization result of GH1 with 19 XORs, depth 3, and 101.84 GE in ASIC4 by using the depth-bounded version of BDKCI.
No Operation No Operation
1 y1=x1x6x8x12 11 y6=x1y1y8y12
2 y12=x2x5x10x12 12 t11=x0x6x12x13
3 y0=x0x5x11x15 13 y13=y0y8t11
4 y11=x2x4x11x14 14 y9=x1x9t11
5 y15=x1x4x9x15 15 y4=x4y1t11
6 y5=x0y0y11y15 16 t15=x7x9x14x13
7 t6=x3x5x12x15 17 y2=x11y11t15
8 y3=y11y12t6 18 y10=x12y2y5y12
9 y7=x7y0t6 19 y14=y4t15
10 y8=x0x8t6
DOI: 10.7717/peerj-cs.1820/table-7
Table 8:
The global optimization results of 4 × 4 involutory and MDS matrices over F24 by using the depth-bounded version of BDKCI.
Ref. Type #XOR2 #XOR3 #XOR4 GC ASIC1 (GE) ASIC2 (GE) ASIC3 (GE) ASIC4 (GE) Depth
Sim et al. (2015) Hadamard, Involutory 20 20 100 110 125 119.8 3
Li & Wang (2016) Hadamard, Involutory 20 20 100 110 125 119.8 3
Sarkar & Syed (2016) Involutory 2 5 12 19 80.25 88.537 101 101.84 3
Jean et al. (2017) Involutory 1 4 15 20 90 99.341 113.05 111.82 4
GH1 GHadamard, Involutory 1 7 11 19 79.75 88.486 100.65 101.84 3
DOI: 10.7717/peerj-cs.1820/table-8


Bold values indicate the best results.

Conclusion and future works

In this article, we give a new heuristic SBP to search for efficient circuit implementations of a given linear layer. By considering low-latency criteria, our heuristic performs better results under the minimum circuit depth metric for 16×16 binary matrices compared to various global optimization algorithms. In this respect, especially by considering low-latency and low-cost circuits of 4×4 involutory MDS matrices over F24, we give a new lightest record, which can be implemented by only 41 g-XORs with depth 3. Additionally, in order to further optimize the results, we incorporate a circuit depth limit into the BDKCI algorithm. The proposed depth-bounded version of BDKCI has allowed us to achieve even better results. Above all, we give a circuit of AES MixColumn with 240.95 GE in ASIC4 library, which is the best result achieved thus far. Much better, our new 4×4 involutory MDS matrix requires 79.75, 88.486, 100.65, and 101.84 GE in the ASIC1, ASIC2, ASIC3, and ASIC4 libraries, respectively. That result is the lightest and superior to the state-of-the-art results. It should be noted that by conducting more runs of our depth-bounded version of BDKCI implementation, there is potential for further improvement of all these circuit results given in this article.

Future works

Future research directions include optimizing SBP heuristic for larger matrices. Alternatively, it would be intriguing to explore the conversion of SBP into a multiple-input XOR gate version for improved results.

  Visitors   Views   Downloads