Sparse matrix-vector multiplication (SpMV) is an essential operation in solving linear systems and eigenvalue problems. For many iterative methods, the fraction of the execution time of SpMV may be more than 80% in the total time, so the study of its performance has attracted a lot of attention. Right now, the GPU has been from a graphics accelerator to a computing device with a broad spectrum of purposes, due to the characteristics of the multi-thread, high memory bandwidth. It can solve massively parallel problems and obtain very high performance. However, how to predict the execution time of SpMV on GPUs accurately is still a big challenge.
In 2003, Bolz et al.  first implemented conjugate gradient (CG) method on GPU which contains SpMV. Bell and Garland  proposed SpMV CUDA kernels for some well-known sparse matrix storage formats, such as compressed sparse row (CSR), Ellpack-Itpack (ELL), Coordinate (COO), Diagonal (DIA) and Hybrid (HYB). The jagged diagonal format (JAD) was used to implement the SpMV kernel in  , and they also realized the GPU-accelerated preconditioned CG and GMRES methods. In this paper, we utilize the SpMV kernels in  : CSR-V, CSR-S, ELL and JAD.
In order to improve the performance of SpMV on GPUs, Vazquez et al.  presented a new sparse storage format, termed ELLR-T, which is improved from ELL format, to achieve higher performance. Monakov et al.  put forward a sliced ELL format and used auto-tuning to find the optimal configuration for batter performance. Zheng and Gu  proposed bisection ELL (BiELL) and bisection JAD (BiJAD) format based on ELL and JAD format for optimizing SpMV on GPUs. Especially for irregular matrices, BiELL and BiJAD format can get greater load balance and higher performance by adjusting the elements storage location of matrix and make the zero-padding less. Meanwhile, they realized CG and GMRES method with BiELL and BiJAD format on GPU  . Choi et al.  designed blocked ELL format and select proper parameters for better performance. Guo et al.  proposed an auto-tuning framework that can select the parameters of SpMV kernels to obtain the optimal performance on GPUs.
Besides studying how to improve the performance of SpMV on the GPUs, there are also many performance models focusing on performance prediction. Resios  proposed a parameterized analytical model to estimate execution time and identify potential bottlenecks in programs. Dinkins  put forward a model for predicting SpMV performance using memory bandwidth requirements and data locality. Werkhoven et al.  gave an analytical performance model that includes PCIe transfers and overlapping computation and communication to predict the execution time for CPU-GPU data transfers. Baghsorkhi et al.  presented a model to predict the performance of GPU applications based on the string and work flow graph. Guo et al.  showed a performance modeling and optimizing analysis to predict and optimize SpMV performance on GPUs. A simple analytical GPU model to predict the execution time of massively parallel programs was given by Hong et al.  . Schaa et al.  presented a model to accurately estimate the execution time of GPU applications by varying the configurations. In a word, most of the performance prediction researches analyze and predict the execution time from the perspective of the machine itself, which focus on the physical properties and parameters of GPUs. There has few performance prediction models were established from the view of mathematics.
In this paper, we present two new improved models based on  and get better prediction accuracy. Our models mainly consist of two phases: generating parameters and fitting for prediction. In the first phase, some benchmark matrices will be generated according to a GPU’s architecture features and four different sparse matrix storage formats, then SpMV with these benchmark matrices are implemented on the GPU to obtain the execution times. We will establish two parametric models according to the results of the benchmark matrices and predict the execution time of the SpMV kernels with a given target matrix on the GPU by our models in second phase.
The performance prediction models are essentially at the statistical point of view to predict the execution time of different SpMV kernels on GPUs. Firstly, the execution time of the benchmark matrices with different parameters is required, and then fitting the prediction functions according to the execution time of the benchmark matrices and two parameters. Finally, the estimated execution time of a target matrix will be got after putting two parameters into the prediction functions.
Compared with  ’s model, the important difference of this paper is generating benchmark matrices. In  , the number of non-zero elements per row ( ) of the benchmark matrices is set to a fixed value. While in many practical problems, such as in computer science or mathematics, there are many matrices are in irregular structure, which is different largely. So if of benchmark matrices generated is fixed, the performance prediction model will go awry to some extent and then the accuracy of prediction will be decreased. Keeping this in mind that we generated two kinds benchmark matrices whose in the uniform distribution or normal distribution, respectively, and then establish its own performance prediction model. In the numerical experiments, we used our model to estimate the execution time of different SpMV kernels with 30 matrices, these matrices are from the University of Florida Sparse matrix collection  . The experimental results show that the average prediction error of our models are two to three times lower than  at least, some matrix even tens of times lower.
The remainder of this paper is organized as follows: Section 2 gives some preliminaries and Section 3 shows the details of the performance prediction model. Experimental results and analyses are reported in Section 4. Finally, some conclusions and future works are stated in Section 5.
Firstly, we state in brief the GPU architecture and CUDA (Compute Unified Device Architecture) programming model. Traditionally, GPUs have been especially designed to handle the computation for computer graphics in real-time. Today, they are increasingly being exploited as general-purpose attached processor to speed-up computations in image processing, physical simulations, data mining, linear algebra, etc.  . It is suitable for processing massively parallel tasks, which have high density and simple branching logic. CUDA introduced by NVIDIA is similar in style to a single program multiple data (SIMD) software model  . CUDA programs on the host (CPU) invoke a kernel which runs on the device (GPU). All threads within a block are executed concurrently on a architecture  . In addition, when a multiprocessor is given one or more thread blocks to execute, it partitions them into groups of 32 parallel threads termed warp.
Four sparse matrix storage formats used in our model are described below. The CSR is probably the most popular format for storing general sparse matrices  . To parallelize the SpMV for a matrix in the CSR format, a simple scheme called CSR-S (CSR scalar) kernel  is to assign each thread by one row. The main drawback of this scheme is that the pattern of memory access is un-coalesced, so it shows not very efficient. A better approach, termed CSR-V (CSR vector) is proposed in  and modified in  to realize the memory access contiguously.  assign a warp (32 threads) to each row, while  assign each row a half-warp (16 threads). In this approach, since all threads within a warp or a half-warp access non-zero elements of one row, these accesses are more likely to belong to the same memory segment, so the chance of coalescing should be higher. In addition, CSR-V kernel in  will be used in our model. ELL format is efficient if the maximum number of non-zeros per row is not substantially different from the average. However, when the number of non-zeros varies largely between rows, excess padded zeros decrease the performance of SpMV. The JAD can be viewed as a generalization of ELL format which removes the assumption on the fixed-length rows  . Compared to the ELL format, there are fewer zero-padding in the JAD format, so the performance of JAD will be better than ELL when implement SpMV on GPUs. Other details of these sparse matrix storage formats can be consulted in  .
3. The Performance Prediction Model for SpMV
The work-flow of our model is similar to  , which contains two phases. The main difference is the criteria for generating the benchmark matrices, which we described in follows.
3.1. Phase One: The Establishment of Fitting Function for Performance Prediction
Firstly, we give the definition of matrix strip. The strip of a matrix is a maximum sub-matrix that can be handled by a GPU with a full load of thread blocks within one iteration. Let be the number of streaming multiprocessors for a NVIDIA GPU, be the number of half-warps per multiprocessor and denote the number of threads per multiprocessor. Then, the size of strip for CSR-V, CSR-S, ELL and JAD format can be computed as follows:
Secondly, we state the criteria for generating benchmark matrices.
・ The number of rows ( ):
where is a positive integer, is the size of strip defined for four formats (in different sub-index) as above.
・ The number of non-zero elements per row ( ):
In  ’s model, each row has the same for the benchmark matrices. However, will not be same exactly for a target matrix in practical situations. So we combine with the characteristics of the matrix structure and let to meet a certain distribution. Two distributions will be adopted: normal and uniform. In addition, the mean of the distribution will be treated as . The benchmark matrices meet two kinds of distribution can be generated by Matlab. In addition, we assume that then on-zero elements are in single precision (float).
・ The number of columns ( ):
For the sake of simplicity, the benchmark matrices generated in our numerical experiments will be square. Obviously, it should be assumed that .
Thirdly, we set parameters of benchmark matrices.
In order to get more accurate fitting functions in our models, a series of benchmark matrices will be generated according to the above criteria. A benchmark matrix is only determined by and . Since , where S is fixed for a certain sparse matrix format, we just need change the value of to get different benchmark matrices. Due to in the benchmark matrices follows two kinds of distributions, so it is determined by the mean of each distribution based on the distribution density . Then combine the value of and , we can obtain a benchmark matrix.
・ The number of strips ( ):
➢ CSR-V: Let
In our experimental platform, the size of matrix strip for CSR-V format is fewer than other formats, in order to predict the performance accurately, we increase the value of to 50.
➢ CSR-S, ELL, JAD: Let
・ The distribution density ( ):
➢ CSR-V: Let
Because of out of memory occurred in Matlab when is too large, so we make as the maximum value, but it does not affect the accuracy of the performance prediction.
➢ CSR-S, ELL, JAD: Let
Finally, the formula for calculating average the execution time of benchmark matrices is same as that of  .
where denotes a benchmark matrix of dimension ; is a random vector of length ; and are the number of executions and . is the execution time for each time the benchmark matrix be executed. For a target matrix with rows and non-zero elements, the number of strips and the number of non-zero elements per row with four formats can be compute as follows:
Let be the set consisting of the number of non-zero elements in each row of the target matrix. Then is set to be mode of for CSR-V matrix, while it is the maximum value of for CSR-S, ELL, JAD matrices.
3.2. Phase Two: Prediction Based on the Performance Model
According to the statistics methods, we fit the performance function of SpMV for different storage formats, which based on three parameters of benchmark matrices: , and . After the performance function obtained, we can estimate the execution time of SpMV for a target matrix by substituting two parameters and of the target matrix into it.
3.2.1. CSR-V Format
After a large amount of experiments and fitting, we found that for CSR-V matrices, the relationship between and is different when is smaller or larger than the number of maximum threads per block (1024 for GeForce GTX 540 M). Therefore, the performance fitting function is obtained by the following method.
・ Establish the function
For the benchmark matrices with the same number of strips, we establish the relationship between and the execution time for SpMV. The fitting functions of two distributions for the number of strips 40 are shown in Figure 1. As can be seen, the relationship between and is approximately linear, so we make .
・ Establish the function
For the benchmark matrices with same , we establish the relationship between and the execution time (i.e. in the above) of the benchmark matrices for SpMV. The fitting functions of two distributions for and 2048 are shown in Figure 2. The relation between and is also approximately linear, so we make .
・ Estimate the execution time of a target matrix
For a target matrix, we need to calculate two parameters according to the Equation (6) and : the number of non-zero elements per row and the number of strips , then derive and from above functions, respectively. In order to combat the effects of the difference about functions when the number of non-zero elements per row is smaller or larger than 1024, another execution time of any previously tested benchmark matrix whose is set to be the number of non-zero elements per row in . At this point, estimated execution time of the target matrix in CSR-V format is
Figure 1. The fitting functions of two distributions for ((a) and (b): , (c) and (d): ).
Figure 2. The fitting functions of two distributions for and 2048 ((a) and (b): , (c) and (d): ).
3.2.2. CSR-S Format
・ Establish the function
For the benchmark matrices with the same , we establish the relationship between and for SpMV. The fitting functions of two distributions for are shown in Figure 3. As showed in the Figure, the relationship between and is approximately linear, so we make ( can be any arbitrary value within the range of ).
・ Establish the function
For sets of benchmark matrices with different number of strips, we establish the relationship between the number of strips and the coefficient of the linear functions in . The fitting functions of two distribution are shown in Figure 4. In Figure 4, it is approximately linear between and , so we make .
・ Establish the function
For Like the fitting function , we establish the relationship between the number of strips and the execution time (i.e. in the above) of the benchmark matrices with the same number of non-zero elements per row
Figure 3. The fitting functions of two distributions for .
Figure 4. The fitting functions for versus .
, which can be any arbitrary value within the range defined . The fitting functions of two distributions for are shown in Figure 5. We can see that the relationship between and is approximately linear, so we make , and the intercept function .
・ Estimate the execution time of a target matrix
Given a target matrix, we need to calculate two parameters according to the Equation (6) and : the number of non-zero elements per row and the number of strips , then derive and from above functions, respectively. At this moment, estimated execution time of the target matrix in CSR-S format is .
After getting the performance function of CSR-S format, we find that the relationship between dependent variables and two variables , is saddle surface in the functional image, that is to say, when is fixed, the relationship between and is linearity and vice versa, which coincided with the 3D fitting image we get in Matlab, as shown in Figure 6, which the darker of the colors means the smaller of the values.
3.2.3. ELL and JAD Formats
Note that, the granularity of ELL and JAD format is the same as CSR-S format, which assigns one thread to each row to implement SpMV on GPUs. Therefore, fitting the performance function of ELL and JAD format is done in a similar way with CSR-S format, except the functional expressions. In addition, the 3D images obtained by the Matlab can be fitted with the performance functions and need not be repeated here.
4. Experimental Results and Analyses
The experiments are performed on NVIDIA GeForce GTX 540 M with 1 GB global memory, the operating system is a 64-bit Linux with CUDA 6.5 driver. We evaluated our performance prediction model on 30 matrices with each sparse matrix storage format, respectively. These matrices are square real matrices from the University of Florida Sparse matrix collection  . Moreover,
Figure 5. The fitting functions of two distributions for .
Figure 6. The 3D fitting images of two distributions for .
in order to compare with  ’s model, we also implement the model in  whose is fixed in benchmark matrices. The estimated time and the performance difference rate of the 30 target matrices in four sparse matrix storage formats with three models are given, which is convenient to compare with each other.
We define the performance difference rate for different model as
For CSR-V format, the performance difference rate of SpMV in three models of 30 matrices is shown in Figure 7. We can see that in  ’s model is 5.05% on average, while it is 2.42% and 2.56% for our normal and uniform model, respectively. Compared with the  ’s model, the prediction accuracy of the normal model and uniform model are improved by 2.08 times and 1.97 times on average, respectively.
Furthermore, there are three cases for the prediction accuracy of  ’s model is higher than that of uniform model (such cavity05), and that of normal model (such as bcsstk04). The prediction accuracy of normal distribution model is higher than uniform distribution model for bp_1600 and other 17 matrices.
When implement SpMV on GPU with CSR-S format, the performance difference rate in three models of 30 matrices is shown in Figure 8. The average in  ’s model, our uniform and normal model is 5.44%, 3.21% and 3.52%, respectively. The average factor for the prediction accuracy of uniform model higher than that of  ’s model is 1.69, while for normal model, it is 1.54.
In addition, the prediction accuracy of  ’s model is higher than that of normal model on 15 matrices (such as bcsstk16), while for uniform model the better number is 12 (such as bips98_1142). The better number for normal model vs. uniform model is 13 (such as bayer09) in 30 cases.
Figure 7. The performance difference rate of SpMV on CSR-V matrices.
The similar results for ELL matrices are given in Figure 9. It says the average of three models are 5.79%, 3.53% and 3.26%, respectively. The average
Figure 8. The performance difference rate of SpMV on CSR-S matrices.
Figure 9. The performance difference rate of SpMV on ELL matrices.
better number for the factor of normal and uniform model vs.  ’s model are 1.77, 1.64 respectively. The better number for  ’s model vs. normal model and uniform model is the same 7:23, while for normal model vs. uniform model it is 20:10.
Figure 10 give the results for JAD matrices. of three models are 5.97%, 3.94% and 4.28% on average, respectively. The average improved factor for normal and uniform model over  ’s model are 1.46 and 1.35. The better number for  ’s model vs. normal model is 9:21, while for uniform model it is 11:19, while for normal model vs. uniform model it is 16:14.
The execution time of four SpMV kernels in three model on 30 matrices is shown in Figures 11-14. There is large difference in the execution time for all matrices in different storage formats. So we put the execution time into two figures: the shorter and the longer in (a) and (b), respectively. Almost all of the estimated time of the matrices in four different storage formats is greater than the actual measured time. The possible reasons are that we take the number of strips by rounding up to an integer.
5. Conclusion and Future Work
Aiming at the better performance model of SpMV on GPU based on statistics and  , we have presented two new models, which consider the structure of matrices. We predict four SpMV CUDA kernels: CSR-V, CSR-S, ELL and JAD. The numerical result shows that the prediction accuracy of our models is higher than that of  ’s model.
Figure 10. The performance difference rate of SpMV on JAD matrices.
Figure 11. The comparison of estimated and measured time on CSR-V matrices.
Figure 12. The comparison of estimated and measured time on CSR-S matrices.
Figure 13. The comparison of estimated and measured time on ELL matrices.
Figure 14. The comparison of estimated and measured time on JAD matrices.
In the future, we will extend our performance prediction model to other SpMV with different storage formats on different kinds of GPUs. In addition, we will propose a new performance model to predict the execution time of a class of iterative methods on heterogeneous parallel machines.
*The project is partly supported by the NSF of China (61472462, 11671049) and foundation of Key Laboratory of Computational Physics.
 Bell, N. and Garland, M. (2009) Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, 14-20 November 2009, Article No. 18.
 Vazquez, F., Ortega, G., Fernandez, J.J. and Garzon, E.M. (2010) Improving the Performance of the Sparse Matrix Vector Product with GPUs. Proceedings of the 10th IEEE International Conference on Computer and Information Technology, IEEE Computer Society, Bradford, 29 June-1 July 2010, 1146-1151.
 Monakov, A., Lokhmotov, A. and Avetisyan, A. (2010) Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P. and Martorell, X., Eds., High Performance Embedded Architectures and Compilers. HiPEAC 2010. Lecture Notes in Computer Science, Vol. 5952. Springer, Berlin, Heidelberg, 111-125.
 Zheng, C., Gu, S., Gu, T.-X., Yang, B. and Liu, X.-P. (2014) BiELL: A Bisection ELLPACK Based Storage Format for Optimizing SpMV on GPUs. Journal of Parallel and Distributed Computing, 74, 2639-2647.
 Gu, T.-X., Zheng, C., Gu, S. and Liu, X.-P. (2014) Solving Sparse Linear Systems on GPUs Based on the BiELL Storage Format. Proceedings of International Conference on Parallel, Distributed Systems and Software Engineering, Singapore, 96-107.
 Guo, P., Wang, L.-Q. and Chen, P. (2014) A Performance Modeling and Optimization Analysis Tool for Sparse Matrix Vector Multiplication on GPUs. IEEE Transactions on Parallel and Distributed Systems, 25, 1112-1123.
 Dinkins, S. (2012) A Model for Predicting the Performance of Sparse Matrix Vector Multiply (SpMV) Using Memory Bandwidth Requirements and Data Locality. Master’s Thesis, Colorado State University, Fort Collins.
 van Werkhoven, B., Maassen, J., Seinstra, F.J. and Bal, H.E. (2014) Performance Model for CPU-GPU Data Transfers. 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Chicago, 26-29 May 2014, 11-20.
 Baghsorkhi, S.S., Delahaye, M., Gropp, W.D. and Hwu, W.-M.W. (2012) Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications. Workshop on Exploiting Parallelism Using GPUs and Other Hardware-Assisted Methods (EPHAM’09), in conjunction with the 2009 International Symposium on Code Generation and Optimization (CGO), Seattle, Washington DC.
 Hong, S. and Kim, H. (2009) An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness. Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, 20-24 June 2009, 152-163.
 Schaa, D. and Kaeli, D. (2009) Exploring the Multiple-GPU Design Space. Proceedings of the 2009 IEEE International Parallel & Distributed Processing Symposium, Rome, 23-29 May 2009, 1-12.
 Guo, P. and Wang, L.-Q. (2012) Accurate CUDA Performance Modeling for Sparse Matrix-Vector Multiplication. Proceeding of the 2012 International Conference on High Performance Computing and Simulation (HPCS), Madrid, 2-6 July 2012, 496-502.