Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions

The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. Although it has been shown that block-based kernels help to achieve high performance, they are difficult to use in practice because of the zero padding they require. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of results from previous executions. We compare the performance of our approach to that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel executions. Finally, we provide the corresponding code in an open source library, called SPC5.


Introduction
The sparse matrix-vector product (SpMV) is an important operation in many applications, which often needs to be performed multiple times in the course of the algorithm.It is often the case, that no matter how sophisticated the particular algorithm is, most of the CPU time is actually spent in matrix-vector product evaluations.The prominent examples are iterative solvers based on Krylov subspaces, such as the popular CG method.Here the solution vector is found after multiple matrix-vector multiplications with the same matrix.Since in many scientific applications a large part of the CPU time is spent in the solution of the resulting linear system and the matrix is stored in a sparse manner, improving the efficiency of the SpMV on modern hardware could potentially leverage the performance of a wide range of codes.
One of the possible approaches towards the improvement of the SpMV is to take advantage of a specific sparsity pattern.For example, the diagonal storage (DIA) [1] or jagged diagonal storage (JAD) [2] are designed for matrices that are mostly diagonal or band-diagonal.However, unless ensured by the matrix producing method design, it is not straightforward to evaluate in advance when a given matrix's structure is well suited for a specific format.This has been the main motivation to provide general SpMV like block-based schemes, as described in [3,4,5,6,7].This type of kernels makes it possible to use the SIMD/vectorization capability of the CPU, but they also are required to pad the blocks with zeros to avoid a transformation of the loaded values from the memory before computation.The extra memory usage and the ensuing transfers drastically reduce the effective performance.Moreover, it has been shown that there is no ideal block size that works well for all matrices.
In the current study, we attempt to address the mentioned issues, namely problems caused by zero padding and optimal block selection.The recent AVX-512 instruction set provides the possibility to load fewer values than a vector can contain and to expand them inside the vector (dispatching them in order).This feature allows for fully vectorized block-based kernels without any zero padding in the matrix storage.Additionally, we provide a method to select the block size that is most likely to provide the best performance analyzing the execution times of the previous runs.
The contributions of the study are the following: • study block based SpMV without padding with AVX-512, • describe efficient implementation targeting then next generation of Intel's HPC architecture, • provide a record-based strategy to choose the most appropriate kernel (using polynomial interpolation in sequential, and linear regression or neural network for shared-memory parallel approaches), • introduce the SPC5 package that includes the source code related to the present study 1 .
The rest of the paper is organized as follows: Section 2 gives background information related to vectorization and SpMV.We then describe our method in Section 3 and show its performance on selected matrices of various properties in Section 4. Comparison of various variants of our method using different blocking is provided.In Section 5 we propose a systematic approach towards selection of an optimal kernel based on easily obtained properties of the given matrix.Finally, we draw conclusions in Section 6.

Background
In this section we recall a few rather well known facts regarding the vectorization in modern processor architectures and about existing sparse matrix vector product implementations.

Vectorization
The growth of computing power with new generations of processors has been advancing for decades.One of its manifestations, which is of particular interest for scientific computing, is the steady growth of peak performance in terms of the number of floating point operations per second.What has changed recently, however, is the way that the hardware manufacturers sustain this growth.With the effective halt in the growth of processor frequencies, most of the increase of computing power is achieved through increasing the number of cores and the ability of each individual core to perform each operation on a vector of certain length using one instruction only.This capability is named vectorization or single instruction on multiple data (SIMD).
The AVX-512 [8] is an extension to the x86 instruction set dedicated to vectorization.It supports a 512 bits vector length, which corresponds to 16 single precision or 8 double precision floating point values, and we use the term VEC SIZE to refer to the number of values inside a vector.This instruction set provides new operations such as load/store operations to move data between the main memory and the CPU's registers.One of these new instructions is the vexpandpd(mask,ptr), where mask is an unsigned integer of VEC SIZE bits and ptr a pointer to an array of single or double precision floating point values.This instruction loads one value for each bit that is set to one in the mask, and move them to the corresponding position in the output vector.For example, in double precision, vexpandpd(10001011b,ptr) returns a vector equal to [ptr[0], ptr [1], 0, ptr [2], 0, 0, 0, ptr [3]], where ptr[i] refers to the values at position i in the array of address ptr, and considering that the mask is written from right to left.

Sparse Matrix Vector Product (SpMV)
We refer to the definition from [9] to classify a matrix as sparse: "The matrix may be sparse, either with the non-zero elements concentrated on a narrow band centered on the diagonal, or alternatively they may be distributed in a less systematic manner.We shall refer to a matrix as dense if the percentage of zero elements or its distribution is such as to make it uneconomic to take advantage of their presence".In other words, if there is any advantage of exploiting the zeros, for example, by saving time or memory, then the matrix should be considered as sparse.However, removing the zeros from the matrix leads to new storage and new computational kernels and while the gain of using a sparse matrix instead of a dense one can be huge in terms of memory occupancy and speed; the effective Flop rate of a sparse kernel generally remains low compared to its dense counterpart.In fact, in a sparse matrix storage, we provide a way to know the respective column and row of each non-zero value (NNZ).Therefore, the general SpMV is a bandwidth/memory bound operation because it pays the price of this extra storage and causes to have a low ratio of Flop to perform against data occupancy.
The compressed row storage (CRS), also known as the compress sparse row (CSR) storage, is a well-known storage and is used as a de-facto standard in SpMV studies.Its main idea is to avoid storing individual row indexes for each NNZ value.Instead of that, it counts the number of values that each row contains.Figure 1 presents an example of the CRS storage.The NNZ values of the original matrix are stored in a values array in row major (one row after the other) and in column ascending order.In a secondary array colidx we store the column indexes of the NNZ values in the same order.Finally, rowptr contains the positions of the NNZ in values for each row: the row i has NNZ from index rowptr[i] to rowptr[i+1]-1.During the computation, the values are read one row after the other, making the access to the result vector linear and potentially unique.Moreover, the input vector is read from left to right at each row computation.The data occupancy is given by S CRS = N N N Z × (S integer + S f loat ) + S integer × (N rows + 1), with N rows the number of rows in the original matrix, and S f loat and S integer the sizes of a floating point value and an integer, respectively.The performance and data locality has been studied in [10], where also the compressed column storage (CCS) variant has been proposed.Various papers have shown that there is a need for register blocking, cache blocking, and if possible, multiplication by multiple vectors in [3,11].In [12], the authors introduced the fixed-size block storage (FSB) which is one of first block-based matrix formats.The key idea is to extract contiguous blocks of values and to process them differently and more efficiently.This makes it possible to take advantage of the contiguous values.The pre-processing of the matrix, that is, the transformation of a matrix from COO or CSR to FSB, is costly but can be beneficial after one or several SpMVs.However, the corresponding SpMV is penalized by the memory accesses because it has to iterate over the vectors several times, cancelling the possible memory reuse when all the values are computed together.Subsequently, the block compressed sparse row storage (BCSR) was proposed in [13], and it has been extended with larger blocks of variable dimension in [3,4].Blocks of values are extracted from the sparse matrix but they had to be filled with zeros to become dense.For these formats, the blocks were aligned (the upper-left corner of the blocks start at a position multiple of the block size).The unaligned block compressed sparse row (UBCSR) has been proposed in [5] where the blocks can start at any row or column.Choosing the block size is not straightforward and as such, some work has been done to provide a mechanism to find it [6,7].
The main drawback of the compressed sparse matrix format is that the data locality is not preserved and it is thus more difficult to vectorize the operations.A possible attempt to solve this problem is by combining a sparse and dense approach: a certain block size is selected and the matrix is covered by blocks of this size so that all non-zero elements of the matrix belong to some block.The positions of the blocks are then stored in sparse fashion (using row pointers and column indices), while each block is stored as dense, effectively by storing all elements belonging to the block explicitly, including zeros.This leads to padding the non-zero values in the values array by zeros and thus increases memory requirements.The immense padding implied by their design caused that these methods were never adopted in real-life calculations.
More recent work has been done, pushed by the research on GPUs and the availability of manycore architectures.In [14], the authors extend the ELLAPACK format that became popular due to its high performance on GPUs, and adapt it to the Intel KNC.They provide some metrics to estimate if the computation of a matrix is likely to be memory or computation bounded and conclude that block-based schemes are not expected to be efficient because the average number of NNZ per block can be low.As we will show in the current study, it is only partially true because block-based approaches require less transformation of the input matrix and in extreme cases it is possible to use the block mask to avoid useless memory load.The authors of [14] also propose an auto-tuning mechanism to balance the work between threads, and their approach appears efficient on the KNC.The authors of [15] defined the SELL-C-σ format as a variant of Sliced ELLPACK.The key idea of their proposal is to provide a single matrix format that can be used on all common HPC architectures that are regular CPUs, manycore CPUs and GPUs.We consider that focusing on CPUs only could lead to better specific matrix storage.In [16] the authors also target CPUs and GPUs and introduce a new matrix format, called CSR5, and a corresponding source code is freely available online.We include their code in our performance benchmark.

Matrix Permutation/Reordering
Permutation of the rows and/or columns of a sparse matrix can improve the memory access pattern and the storage.A well known technique called Cuthill-McKee from [17] tries to make a matrix bandwidth by applying a breadth first algorithm on a graph which represents the matrix structure such that the resulting matrices have good properties for LU decomposition.However, the aim of this algorithm is not to improve the SpMV performance even though the generated matrices may have better data locality.
In [13] a method is proposed to have specifically more contiguous values in rows or columns.The idea is to create a graph from a matrix where each column is a vertex and by connecting all the vertices with weighted edges.The weights come from different formulations, but they represent the interest of putting two columns contiguously.Then a permutation is found by solving the travelling salesman problem (TSP) to obtain a path that goes through all the nodes but only once and that minimizes the cost of the total weight of the path.Therefore, a path in the graph represents a permutation; when we add a node to a path it means that we aggregate a column to a matrix in construction.The method has been updated in [5,18,19] with different formulas.
The permutation of the matrices has been left aside from the current study but as in most other approaches, any improvement to the shape of the matrix will certainly improve efficiency of our kernels by reducing the number of blocks.

Design of Block-based SpMV without Padding
In this section, we will elaborate on an alternative approach to the existing block-based storage by exploiting the mask features from the AVX-512 instruction set.Instead of padding the nonzero values with zeros to fill the whole blocks, we store an array of masks where each mask corresponds to one block and describes how many non-zeros there are and on which positions within the block.We also describe the corresponding SpMV kernels and discuss their optimization and parallelization.

Block-based Storage Without Zero Padding
We refer to β(r, c) as the matrix storage that has blocks of size r × c, where r is the number of rows and c the number of columns.Figure 2 shows the same matrix as in Figure 1 in the formats β(1, 4) and β(2, 2), see Figure 2a and Figure 2b, respectively.In our approach the blocks are row-aligned i.e. the upper left corner of a block starts at a row of index multiple of r, but at any column.To describe the sparse matrix, we need four different arrays as shown in the figures.In the values array, we store the NNZ values in block order and in row major inside the blocks.This array remains unchanged compared to the CSR format if we have one row per block (r = 1).The block colidx array contains the column indexes of the upper left values of each block.We store in block rowptr the number of blocks per row interval (r consecutive rows).Finally, the block masks array provides one mask of r × c bits per block to describe the sparsity structure inside each block.We note N blocks (r, c) the number of blocks of size r ×c obtained from a given matrix.The average number of NNZ per block is then Avg(r, c) = N N N Z /N blocks (r, c).The memory occupancy is given by (in bytes) : with S f loat and S integer the sizes of a floating point value and an integer, respectively.After putting two terms together and substituting for N blocks (r, c) = N N N Z /Avg(r, c) we get the total occupancy Let us now compare with the memory occupancy of the CSR format, which is The first term is the same for both storages.This is thanks to the fact that we do not use zero padding in the values array, as it has been discussed before.The second term (which is only relevant for very sparse matrices, otherwise N N N Z N rows ) is clearly either the same (if r = 1) or smaller for our storage whenever r > 1.The last term is smaller for our storage if which should be usually true, unless the blocks are very poorly filled.If we consider the usual size of the integer S integer = 4, we need average filling of at least 1 + 1 4 for β(1, 8), 1 + 1 2 for β(2, 8) and β(4, 4), and 2 for β(4, 8) and β(8, 4), respectively.
Memory occupancy is an important measure, because the SpMV is usually a memory bound operation and therefore reducing the total amount of memory to perform the same number of floating point operations (Flop) is expected to be beneficial.

SpMV Kernels for β(r, c) Storages
We provide the SpMV kernel that works for any block size in Algorithm 1.The computation iterates over the rows with a step r since the blocks are r -aligned.Then, it iterates over all the blocks inside a row interval, from left to right, starting from the block of index mat.blockrowptr[idxRow/r] and ending before the block of index mat.blockrowptr[idxRow/r+1] (line 7).The column index for each block is given by mat.block colidx and stored in idxCol (line 8).To access the values that correspond to a block, we must use a dedicated variable idxVal, which is incremented by the number of bits set to 1 in the masks.The scalar algorithm relies on an inner loop of index k to iterate over the bits of the masks.This loop can be vectorized in AVX-512 using the vexpand instruction such that the arithmetic operations between y, x and mat.values are vectorized.
In Algorithm 1, all the blocks from a matrix are computed similarly regardless whether their masks contain all ones or all zeros.This can become unfavorable in the case of extremely sparse matrices: most of the blocks will then contain a single value and only one bit of the mask will be equal to 1.This implies two possible overheads, first we load the values from the matrix using the vexpand instruction instead of a scalar move, and second, we load a full vector from x and use vector based arithmetic instruction.This is why we propose an alternative approach shown in Algorithm 2 for β(1, V EC SIZE).The idea is to use two separate inner loops to proceed differently the blocks depending if they contain only one value or more than one value.However, having a test inside the inner loop could kill the instruction pipeline and the speculation from the CPU.Instead of that, we use two separate inner loops and jump from one loop to the other one when needed.Indeed, using goto command might seem strange, but it is justified by performance considerations and, since the algorithm is implemented in assembly, it is a rather natural choice.Regarding the performance, we expect that for most matrices the algorithm would stay in one of the modes (scalar or vector) for several blocks and thus the CPU is more likely to predict to stay inside the loop, avoiding the performance penalty.Therefore, the maximum overhead is met if the blocks' kinds alternate such that the algorithm jumps from one loop to the other at each block.Still, this approach can be significantly beneficial in terms of data transfer especially if the structure of the matrix is chaotic.

Relation Between Matrix Shape and Number of Blocks
The number of blocks and the average number of values inside the blocks for a particular block size provide only limited information about structure of a given matrix.For example, having a high filling of the blocks means that locally the non-zero values are close to each other, but it says nothing of the possible memory jump in the x array from one block to the next one.If, however, this information is known for several block sizes, more can be deduced about the global structure of the given matrix.Having small blocks largely filled and large blocks poorly filled suggests that there is hyper-concentration, but still gaps between the blocks.On the other hand, the opposite would mean that the NNZ are not far from each other, but not close enough to fill small blocks.In the present study, we try to predict the performance of our different kernels (using ALGORITHM 1: SpMV for a matrix mat in format β(r, c).The lines in blue • are to compute in scalar and have to be replaced by the line in green • to have the vectorized equivalent.
Input: x : vector to multiply with the matrix.mat : a matrix in the block format β(r, c).r, c : the size of the blocks.Output: y : the result of the product. 1 function spmv(x, mat, r, c, y)

Optimized Kernel Implementation
In Section 3.2 we described generic kernels, which can be used with any block sizes.For the most useful block sizes, which we consider to be β(1, 8), β(2, 4), β(2, 8), β(4, 4), β(4, 8) and β(8, 4), we decided to develop highly optimized routines in assembly to further reduce the run-time.In this section we describe some of the technical considerations leading to the optimized kernels speed-up.
In AVX-512, V EC SIZE is equal to 8, so the formats that have blocks with 4 columns load only half a vector from x into registers.In addition, the vexpand instruction loads values for 2 consecutive block rows.Consequently, we have to decide in the implementation between expanding the half vector from x into a full AVX-512 register or splitting the values vector into two AVX-2 registers.We have made the choice of the second option.
We have decided to implement our kernels in assembly language.By employing register oriented programming, we intend to reduce the number of instructions (compared to a C/C++ compiled code) and to minimize the access to the cache, leading to higher performance despite its extreme speed.Thus we achieve some kind of non-temporal usage of all the arrays except for x and y (that is, we access each element only once to put it into a register, and no more).In addition, we are able to apply software pipelining techniques, ALGORITHM 2: Scalar SpMV for a matrix mat in format β(1, V EC SIZE) with test (vectorized in the second idxBlock loop).
Input: x : vector to multiply with the matrix.mat : a matrix in the block format.
Output: y : the result of the product. 1 function spmv scalar 1 VECSIZE(x, mat, y) even though it is difficult to really figure out how the hardware is helped by this strategy.We have compared our implementation to an intrinsic-based C++ equivalent and got up to 10% difference (comparison not included in the current study).By implementing the kernels in assembly language and employing register oriented programming, we intend to reduce the number of instructions (compared to a compiled code written in C/C++ with intrinsics) and to minimize the access to the cache which, even though it is much faster then the main memory, can still lead to higher performance.Thus, we achieve some kind of non-temporal usage of all the arrays except for x and y (that is, we access each element only once to put it into a register, and no more).In addition, we are able to apply software pipelining techniques, even though it is difficult to really figure out how the hardware is helped by this strategy.We have compared our implementation to an intrinsic-based C++ equivalent and got up to 10% gain (comparison not included in the current study due to space limitations).We provide a simplified source code of the β(1, 8) kernel in Code 1.

Parallelization
We parallelize our kernels with a static workload division among the threads.Our objective is to have approximately the same number of blocks per thread, which is N b/t = N blocks (r, c)/N threads in an ideal case,  5 // (nbRows r d i , rowsSizes r s i , headers rdx , values rcx , x r8 , y r9 ) 6 // save some r e g i s t e r s 7 ( commented o u t ) . . .8 x o r q %r12 , %r 1 2 ; // valIdx = 0 9 // i f no rows in the matrix , jump to end 10 t e s t %r d i , %r d i ;  but without distributing one row to multiple threads.We create the row intervals by iterating over the blocks in rows order with a step r and by deciding whether all of the blocks inside the next r rows should be added to the current interval.We add the next r rows if the following test is true: absolute value((thread id + 1) × N b/t − N blocks (r, c)[row idx]) is lower than absolute value((thread id + 1) × N b/t − N blocks (r, c)[row idx + 1]), where thread id is the index of the current interval that will be assigned to the thread of the same id for computation.We obtain one-row interval per thread, and we pre-allocate a working vector of the same size.After all threads finish their respective parts of the calculation, all the partial results are merged into the final vector of size Dim.This operation is done without synchronization between the threads since there is no overlap between the rows assigned to each of them and it is thus possible for each thread to copy its working vector into the global one directly.So even in the case of slight work-load imbalance, after a thread has finished, it does not wait for the others but starts directly adding its results.
We attempt to reduce the NUMA effects by splitting the matrix's arrays values, block colidx, block rowptr and block masks, to allocate sub-arrays for each thread in the memory node that corresponds to the core where it is pinned.There are, however, conceptual disadvantages to this approach since it duplicates the matrix, at least temporarily, in memory during the copy and it ties the data structure and memory distribution to the number of threads.The vectors x and y are still allocated by the master thread, and x is accessed during the computation while y is only accessed during the merge.

Performance
In the previous section a family of sparse matrix formats and their corresponding SpMV kernels has been described.In this section, we show their performance using selected matrices from the SuiteSparse Matrix Collection, formerly known as the University of Florida Sparse Matrix Collection [20].We describe our methodology and provide comparisons between our kernels of different block sizes, and also comparisons with MKL and CSR5 libraries.

Hardware/Software
We used a compute node with two Intel Xeon Platinum 8170 (Skylake) CPUs at 2.10GHz and 26 cores each, with caches of sizes 32K, 1024K and 36608K, and the GNU compiler 6.3.We bind the memory allocation using numaclt -localalloc, and we bind the processes by setting OMP PROC BIND=true.As references, we use Intel MKL 2017.2 and the CSR5 package taken from the bhSPARSE repository accessed on the 11th of September 20172 .We obtain a floating point operation per second (FLOPS) measure using the formula 2 × N N N Z /T , where T is the execution time in seconds.The execution time is measured as an average of 16 runs.

Test Matrices
We took matrices from the University of Florida sparse matrix collection.We selected matrices that were used in the study [21] and added few more to obtain a very diverse set.The matrices labeled Set-A, see Table 1, are used in the computation benchmark.Their execution times are also used in our prediction system, which will be introduced in Section 5.
In the ideal case, the user of our library would create the matrix directly using one of our block-based schemes, choosing the most appropriate depending of the expected matrix structure.This might, however, imply some changes in the application code and thus we have to consider another use-case, where the user wants to convert the matrix from a standard CSR format to one of our formats.The time taken to convert any of the matrices form the Set-A from the CSR format to one of ours is around twice the time of a single SpMV in sequential.For example, for the atmosmodd and bone010, the conversion takes approximately 0.4 and 4 seconds, and the sequential execution around 0.2 and 1.5 seconds, respectively.In a typical scenario, however, many matrix-vector multiplications with the same matrix are performed and thus the cost of the matrix conversion can be covered.When the conversion is required, it is worth noting that it is easiest to convert to β(1, 8) format, since each block is part of a single row and thus the values array remains intact and only the index arrays have to be altered.On the other hand, the β(1, 8) kernel might not be the best performing one, so the optimal decision has to be taken depending on a particular setting.The detailed performance analysis of matrix conversions is not included in the current study.

Sequential SpMV Performance
Figure 3 shows the performance of our kernels in sequential for the matrices from Set-A.We can see significant speed-up in most cases, often up to 50%.We observe that we obtain a speedup even with the β(1, 8) formats compared to the CSR and CSR5 in many cases (and as we mentioned in the previous section, this is of particular importance due to easy conversion from CSR format).However, for all the matrices that have an average number of non-zero values per block of corresponding storage below 2, the β(1, 8) is more likely to be slower than the CSR (with some exceptions, e.g. for the mixtank new the average is 2.5 and β(1, 8) is still slower).For the other blocks sizes, we obtain a similar behavior, i.e., if there is not enough values per block the performance decreases.The worse case is for the ns3Da and kron g500-logn21 matrices, and we see from Table 1 that the blocks remain unfilled for all the considered storages.That hurts the performance of our block-based kernels, since 8 or 4 values have still to be loaded from x, even though only one value is useful, and large width arithmetic operations (multiplication and addition on vectors) still have to be called.
For the β(2, 4) storage, the test-based scheme provides a speedup only for rajat31.For other formats we see that the performance is mainly a matter of average number of values per block (so related to the matrix sparsity pattern), in fact, if we look to the Dense-8000 matrix where all the blocks are completely filled, the performance is not very different from one kernel to the other.For instance there are more values per block in the β (8,4)

Parallel SpMV Performance
Figure 4 shows the performance of our kernels in parallel using 52 cores for the matrices from the Set-A.The MKL is not able to take advantage of this number of threads, and it is faster only for kron g500-logn21 matrix.The CSR5 package is efficient especially when the matrix's structure makes it difficult to achieve high flop-rate, in these cases it has a performance similar or higher than our block-based kernels.All our kernels have very simmilar performance.By comparing Figure 4 with Table 1 we can observe that the NUMA optimization gives noticeable speedup for large matrices (shown as the dark part of bars in Figure 4. Indeed, when a matrix is allocated on a single memory socket, any access by threads on a different socket is very expensive.This might not be a severe problem if the matrix(or at least the part used by the threads) fits in the L3 cache.Then only the first access is costly, especially when the data is read only.On the other hand, if the matrix does not fit in the L3 cache, multiple expensive memory transfers will take place during the computation without any possibility for the CPU to hide them.

Performance Prediction and Optimal Kernel Selection
In the previous section, we compared our method with MKL and CSR5 and showed significant speed-up for most matrices using any block size.The question arises, however, what block size should the user choose to (8,4) bundle adj 513,351   3 and 4, the performance of individual kernels for different block sizes varies and the best option depends on the matrix.
If the user applies presented library on many matrices arising from a particular method (e.g., certain discretization scheme of a certain operator), it is quite likely that the matrices will have similar structure and thus an optimal kernel can be found in just several trials and then applied for all other runs.If, however, the user does not have any idea about the structure of the given matrix and no experience with execution times of individual kernels, some kind of decision making process should be supplied.There are many possible approaches to address this challenging problem.If it should be ussable, it needs to be computationally cheap, without conversion of the matrix, yet still able to advise the user about the block size to take.
In the following we describe several possible approaches towards this problem.The matrices from the Set-A listed in Table 1 are used to fine-tune the prediction method.To asses each of the methods, we introduce new, independent set of matrices labeled Set-B and listed in Table 2.

Performance Polynomial Interpolation (Sequential)
Figure 5 shows dependence of kernel performance on an average number of NNZ per block.Each kernelmatrix combination is plotted as a single dot.One can clearly see a correlation between the two quantities, slightly different for each kernel.Polynomial interpolation has been done for each kernel using the matrices from Set-A and its result is also shown in Figure 5.
The Avg.NNZ/blocks numbers can be obtained without converting the matrices into a block-based storage.This can be used to roughly estimate the performance of each individual kernel.It is clear, however, that even within the matrix set used for interpolation, there are matrices with performance significantly different from the one suggested by the interpolation curve.This difference illustrates the fact that the Avg.NNZ/block metric is a very high-level information that hides all the memory accesses patterns, that can be different depending on the positions of the blocks and their individual structure.If the matrix is mall, for example, it can even fit into the cache, which would influence the performance significantly.Much less obvious but equally important is the influence of block distribution on the access patterns of the x array.These effects cannot be determined from simple knowledge of Avg.NNZ/blocks and therefore we do not expect too accurate results.On the other hand, this approach is very cheap, can be incorporated in our package very easily and can provide some basic guideline.
To select a kernel, we perform the following steps.From a given matrix, we first compute the Avg.NNZ/blocks for the sizes that correspond to our formats.Then, we use the estimation formulas plotted in Figure 5 to have an estimation of the performance for each kernel.Finally, we select the kernel with the highest estimated performance as the one to use.
We can assess a quality of the prediction system using data shown in Table 3.For each matrix we provide the objectively best kernel and its speed, the recommended kernel and its estimated and real speed and, finally, the difference between the best and recommended kernels speed.If the selected kernel is the optimal one, then Best kernel has the same value as Selected kernel and Speed difference is zero.We see that the method selects the best kernels or kernels that have very close performance in most cases, even though the performance estimations as such are not allways completely accurate.In other words, values in columns Best kernel speed and Selected kernel predicted speed are in most cases quite similar, but there are generally differences between values in columns Selected kernel predicted speed and Real speed of selected kernel.
For some matrices the reason simply is, that all kernels have very simmilar performance and thus even if the prediction is not accurate and random kernel is selected, its performance is actually not far from the optimal one.For other matrices it seems that if they have a special structure that makes the performance far from the one given by the interpolation curves, this is the case for all kernels and so the kernel recommendation is finally good.Of course, one could design a specific sparsity pattern in order to make the prediction fail, but it seems that this simple and cheap prediction system works reasonably well for real-world matrices.

Parallel Performance Estimation
Similarly as we have done for the sequential kernels in the last section, we attempt to estimate the performance of parallel kernels in order to advise the user about the best block size to choose for a given matrix before the matrix is converted into a block-based format.The situation is, however, more complicated, since the performance of the kernels will depend on another parameter: the number of threads, which will be used.Thus we perform a non-linear 2D regression of performance based on two parameters: number of threads and average values per block.We use performance results obtained for matrices from Set-A using 1, 4, 16, 32 and 52 threads for each kernel.
The results are then used to estimate (by interpolation) the performance of individual kernels for a given setup.The results of interpolation used on both Set-A and independent Set-B are shown in Figure 6.We show in Figure 6a when this strategy leads to an optimal kernel selection, in Figure 6b we show what is the real performance difference between the kernel advised and the one that is actually the fastest, and in Figure 6c we show the prediction difference for the selected kernel/block size.We observe that the approach does not select the optimal kernels in most cases, but we can also see that the performance provided by the selected kernels are close to the optimal ones (less than 10 percent difference in most cases).Similarly  shows the performance difference between the selected kernel and the best one (if it is 0, the algorithm selected the optimal kernel -Best kernel and Selected kernel is the same).as for the polynomial interpolation in sequential, this is true despite the performance estimation not being accurate.
Even though the presented prediction system works reasonably well, the results are of course not fully satisfactory.Before we describe few other ideas, which could improve its performance, we want to stress out again, that in an ideal case no such system is needed at all, since the user can choose the most suitable kernel based on his or her knowledge of the matrices used.Having that said, we have experimented with prediction system based on neural network (NN) to expand the abilities of the non-linear regression.The advantage of the NN approach is that it is easy to use more parameters to train the network.We thus provided the average NNZ per block for all considered block-sizes (6 parameters) and number of threads to the NN and trained it using the Set-A to predict the performance of all kernels.The results of performance prediction (not included in the study) were generally better than for non-linear interpolation, but, on the other hand, the final kernel recommendation success rate, which is the ultimate goal, was very similar.We thus opted for the interpolation, since it is much simpler to include it as a part of the package.The potential strength of the NN would be in the possibility of extending the parameter set of the NN.It is quite obvious, that much more information about the matrix is needed to estimate its structure and thus the optimal kernel, including the hardware properties such as the cache sizes, the memory bandwidth, etc, but also more detailed statistics on block filling.Developing such prediction system using much richer set of parameters will require training the NN on large test suites and will be part of our future work.

Figure 2 :
Figure 2: SPC5 format examples.The masks are written in conventional order (greater/right, lower/left).

1
extern "C" void c o r e S P C 5 1 r V c S p m v a s m d o u b l e ( c o n s t long i n t nbRows , c o n s t i n t * r o w s S i z e s , 2 c o n s t u n s i g n e d char * h e a d e r s , c o n s t double * v a l u e s , 3 c o n s t double * x , double * y ) ;

17 /
11 j z c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e o u t e x p ; 12 x o r q %r10 , %r 1 0 ; // idxRow/r10 = 0 13 c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e l o o p e x p : r s i ,% r10 , 4 ) , %r 1 1 d ; // nbBlocks = rowsSizes [ idxRow+1]−rowsSizes [ idxRow ] / i f no blocks f o r t h i s row , jump to next i n t e r v a l 18 j z c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e o u t r o w e x p ; p u t e S p m v 5 1 2 a v x a s m d o u b l e i n r o w e x p : 22 movslq 0(% rdx ) , %r 1 3 ; // colIdx = * rdx or * headers 23 movzbl 4(% rdx ) , %r 1 4 d ; // mask = * (rdx+4) or * ( headers+4) ,% r12 , 8 ) , %zmm1{%k1 }{ z } ; // values ( only some o f them) 26 vfmadd231pd (%r8 ,% r13 , 8 ) , %zmm1 , %zmm0 ; // mul add to sum 27 28 popcntw %r14w , %r14w ; // count the number o f b i t s in the mask 29 addq %r14 , %r 1 2 ; // valIdx += number o f b i t s (mask) −=1, i f equal zero go to end o f i n t e r v a l 32 j n z c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e i n r o w e x p ; 33 34 c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e i n r o w e x p s t o p : 35 // Horizontal sum from ymm0 to xmm0 36 ( commented o u t ) . . .

37/c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e o u t r o w e x p :
/ add to y , r9 [ r10 ] = > y [ idxRow ] 38 vaddsd (%r9 ,% r10 , 8 ) , %xmm0, %xmm0 ; i , %r 1 0 ; // idxRow == nbRows 44 j n e c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e l o o p e x p ; // i f nbBlocks != 0 go to beginning 45 c o m p u t e S p m v 5 1 2 a v x a s m d o u b l e o u t e x p : Code 1: β(1, 8) kernel in assembly language.

Figure 3 :
Figure 3: Performance in Giga Flop per second for sequential computation in double precision for the MKL CSR, the CSR5 and our SPC5 kernels.Speedup of SPC5 against MKL CSR is shown above the bars.

Figure 4 :
Figure 4: Performance in Giga Flop per second in double precision (double) for the MKL CSR, the CSR5 and our SPC5 kernels for the parallel implementation using 52 threads.Each bar shows the performance without NUMA optimization (light) and with NUMA optimization (dark).Speedup of SPC5 with NUMA optimizations against MKL CSR is shown above the bars.

Figure 5 :
Figure 5: Polynomial interpolation of the performance in Giga-Flop per second vs. the average number of values per blocks.The dots are the matrices from Set-1 used to obtain the polynomial weights.
Selection of the optimal kernels.Green means success, red failure.

%
From the best execution (b) Performance difference between the selected kernels and the best ones.

%
From the best execution (c) Difference between the real and estimated performance of the selected kernels.

Figure 6 :
Figure 6: Selection of kernels for matrices of Set-A and Set-B by non-linear interpolation.Matrices from Set-A are used to find the interpolation coefficients.Interpolation is then applied for both Set-A and independent Set-B (matrices labeled with ).

Table 1 :
Matrix set for computation and performance analysis.Refered to as Set-A.
then in the β(4, 8) storage only for indochina-2004 matrix, which means that it did not help to capture 8 rows per block instead of 8 columns.

Table 2 :
Set-B : Matrix set for prediction maximize the performance.As it can be seen in Figures

Table 3 :
Performance estimation and kernel selection for matrices from Set-A and Set-B.The selection is done by having a polynomial interpolation based on the results of matrices from Set-A.Speed difference