The Better Accuracy of Strassen-Winograd Algorithms (FastMMW)

ABSTRACT

The first error theory and bounds for Fast Matrix Multiplication based on the Strassen-Winograd algorithms (FastMMW) were formulated in the 70s. The theory introduces the concept, which is now known as weakly-stable error analysis, where the error bounds must use matrix norms instead of component-wise bounds. While the theory debunked the instability myth by using matrix scaling and a clean and simple analysis, its bounds are available only as properties of the whole matrices, which are too coarse, pessimistic, at times used to suggest instability, and are not used for algorithm optimization. We build on top of the original theory in order to reformulate the bounds: we show that tighter norm-wise and component-wise bounds are achievable by orthogonal algorithm optimizations. To achieve even better discrimination and circumvent the use of norm bounds, we develop an error theory by using communication and statistics concepts: we investigate lower and upper bounds, we estimate the practical bounds, and we investigate the algorithmic nature of the error for the class of random matrices. The theory and tools are not limited to random matrices and we can foresee further investigations to different matrix classes and algorithms. We propose new and more accurate algorithms. We show that we can improve theoretically and empirically the maximum absolute error of any FastMMW algorithm by 10% - 20% per recursion (we reduce the error by half for 4 recursions). Our theory and practice, in turn, will provide a kick start for the development of hybrid algorithms as accurate as the vendor GEMM implementation, and in certain cases even more accurate for random matrices.

Cite this paper

D’Alberto, P. (2014) The Better Accuracy of Strassen-Winograd Algorithms (FastMMW).*Advances in Linear Algebra & Matrix Theory*, **4**, 9-39. doi: 10.4236/alamt.2014.41002.

D’Alberto, P. (2014) The Better Accuracy of Strassen-Winograd Algorithms (FastMMW).

References

[1] Strassen, V. (1969) Gaussian Elimination Is Not Optimal. Numerische Mathematik, 14, 354-356.

http://dx.doi.org/10.1007/BF02165411

[2] Douglas, C.C., Heroux, M., Slishman, G. and Smith, R.M. (1994) GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen’s Matrix-Matrix Multiply Algorithm. Journal of Computational Physics, 110, 1-10.

http://dx.doi.org/10.1006/jcph.1994.1001

[3] Demmel, J. and Higham, N. (1992) Stability of Block Algorithms with Fast Level-3 BLAS. ACM Transactions on Mathematical Software, 18, 274-291.

http://dx.doi.org/10.1145/131766.131769

[4] Demmel, J., Dumitriu, J., Holtz, O. and Kleinberg, R. (2006) Fast Matrix Multiplication Is Stable.

[5] Brent, R.P. (1970) Error Analysis of Algorithms for Matrix Multiplication and Triangular Decomposition Using Winograd’s Identity. Numerische Mathematik, 16, 145-156.

http://dx.doi.org/10.1007/BF02308867

[6] Miller, W. (1975) Computational Complexity and Numerical Stability. SIAM Journal on Computing, 4, 97-107.

http://dx.doi.org/10.1137/0204009

[7] Bini, D. and Lotti, G. (1980) Stability of Fast Algorithms for Matrix Multiplication. Numerische Mathematik, 36, 63-72.

http://dx.doi.org/10.1007/BF01395989

[8] Edelman, A. and Rao, N. (2005) Random Matrix Theory. Acta Numerica, 14, 233-297.

http://dx.doi.org/10.1017/S0962492904000236

[9] Kolmogorov, A.N. and Uspenskiiq, V.A. (1987) Algorithms and Randomness. Theory of Probability and Its Applications, 32, 389-412.

http://dx.doi.org/10.1137/1132060

[10] Winograd, S. (1968) A New Algorithm for Inner Product. IEEE Transactions on Computers, 17, 693-694.

[11] Higham, N.J. (1990) Exploiting Fast Matrix Multiplication within the Level 3 BLAS. ACM Transactions on Mathematical Software, 16, 352-368.

http://dx.doi.org/10.1145/98267.98290

[12] Higham, N.J. (2002) Accuracy and Stability of Numerical Algorithms. 2nd Edition, SIAM, Philadelphia.

http://dx.doi.org/10.1137/1.9780898718027

[13] Badin, M., D’Alberto, P., Bic, L., Dillencourt, M. and Nicolau, A. (2011) Improving the Accuracy of High Performance Blas Implementations Using Adaptive Blocked Algorithms. In Proceedings of the 2011 23rd International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD ’11, Washington, DC, IEEE Computer Society, 26-29 October 2011, 120-127.

[14] Castaldo, A.M., Clint Whaley, R. and Chronopoulos, A.T. (2008) Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms. SIAM Journal on Scientific Computing, 31, 1156-1174.

http://dx.doi.org/10.1137/070679946

[15] Dongarra, J.J., Du Croz, J., Duff, I.S. and Hammarling, S. (1990) A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transaction in Mathematical Software, 16, 1-17.

http://dx.doi.org/10.1145/77626.79170

[16] Goto, K. and van de Geijn, R.A. (2008) Anatomy of Highperformance Matrix Multiplication. ACM Transactions on Mathematical Software.

http://dx.doi.org/10.1145/1356052.1356053

[17] Priestley, M.B. (1981) Spectral Analysis and Time Series. Academic Press Inc, New York.

[18] Brockwell, P.J. and Davis, R.A. (2006) Time Series: Theory and Methods. Springer, New York.

[19] D’Alberto, P., Bodrato, M. and Nicolau, A. (2011) Exploiting Parallelism in Matrix-Computation Kernels for Symmetric Multiprocessor Systems: Matrix-Multiplication and Matrix-Addition Algorithm Optimizations by Software Pipelining and Threads Allocation. ACM Transaction in Mathematical Software, 38, 1-2.

[20] Welch, P.D. (1969) A Fixed-Point Fast Fourier Transform Error Analysis. IEEE Transactions on Audio and Electroacoustics, 17, 151-157.

http://dx.doi.org/10.1109/TAU.1969.1162035

[21] Loos, S. and Wise, D.S. (2009) Strassen’s Matrix Multiplication Relabeled.

[22] Li, J.J., Ranka, S. and Sahni, S. (2011) Strassen’s Matrix Multiplication on Gpus. 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), Tainan, 7-9 December 2011, 157-164.

[1] Strassen, V. (1969) Gaussian Elimination Is Not Optimal. Numerische Mathematik, 14, 354-356.

http://dx.doi.org/10.1007/BF02165411

[2] Douglas, C.C., Heroux, M., Slishman, G. and Smith, R.M. (1994) GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen’s Matrix-Matrix Multiply Algorithm. Journal of Computational Physics, 110, 1-10.

http://dx.doi.org/10.1006/jcph.1994.1001

[3] Demmel, J. and Higham, N. (1992) Stability of Block Algorithms with Fast Level-3 BLAS. ACM Transactions on Mathematical Software, 18, 274-291.

http://dx.doi.org/10.1145/131766.131769

[4] Demmel, J., Dumitriu, J., Holtz, O. and Kleinberg, R. (2006) Fast Matrix Multiplication Is Stable.

[5] Brent, R.P. (1970) Error Analysis of Algorithms for Matrix Multiplication and Triangular Decomposition Using Winograd’s Identity. Numerische Mathematik, 16, 145-156.

http://dx.doi.org/10.1007/BF02308867

[6] Miller, W. (1975) Computational Complexity and Numerical Stability. SIAM Journal on Computing, 4, 97-107.

http://dx.doi.org/10.1137/0204009

[7] Bini, D. and Lotti, G. (1980) Stability of Fast Algorithms for Matrix Multiplication. Numerische Mathematik, 36, 63-72.

http://dx.doi.org/10.1007/BF01395989

[8] Edelman, A. and Rao, N. (2005) Random Matrix Theory. Acta Numerica, 14, 233-297.

http://dx.doi.org/10.1017/S0962492904000236

[9] Kolmogorov, A.N. and Uspenskiiq, V.A. (1987) Algorithms and Randomness. Theory of Probability and Its Applications, 32, 389-412.

http://dx.doi.org/10.1137/1132060

[10] Winograd, S. (1968) A New Algorithm for Inner Product. IEEE Transactions on Computers, 17, 693-694.

[11] Higham, N.J. (1990) Exploiting Fast Matrix Multiplication within the Level 3 BLAS. ACM Transactions on Mathematical Software, 16, 352-368.

http://dx.doi.org/10.1145/98267.98290

[12] Higham, N.J. (2002) Accuracy and Stability of Numerical Algorithms. 2nd Edition, SIAM, Philadelphia.

http://dx.doi.org/10.1137/1.9780898718027

[13] Badin, M., D’Alberto, P., Bic, L., Dillencourt, M. and Nicolau, A. (2011) Improving the Accuracy of High Performance Blas Implementations Using Adaptive Blocked Algorithms. In Proceedings of the 2011 23rd International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD ’11, Washington, DC, IEEE Computer Society, 26-29 October 2011, 120-127.

[14] Castaldo, A.M., Clint Whaley, R. and Chronopoulos, A.T. (2008) Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms. SIAM Journal on Scientific Computing, 31, 1156-1174.

http://dx.doi.org/10.1137/070679946

[15] Dongarra, J.J., Du Croz, J., Duff, I.S. and Hammarling, S. (1990) A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transaction in Mathematical Software, 16, 1-17.

http://dx.doi.org/10.1145/77626.79170

[16] Goto, K. and van de Geijn, R.A. (2008) Anatomy of Highperformance Matrix Multiplication. ACM Transactions on Mathematical Software.

http://dx.doi.org/10.1145/1356052.1356053

[17] Priestley, M.B. (1981) Spectral Analysis and Time Series. Academic Press Inc, New York.

[18] Brockwell, P.J. and Davis, R.A. (2006) Time Series: Theory and Methods. Springer, New York.

[19] D’Alberto, P., Bodrato, M. and Nicolau, A. (2011) Exploiting Parallelism in Matrix-Computation Kernels for Symmetric Multiprocessor Systems: Matrix-Multiplication and Matrix-Addition Algorithm Optimizations by Software Pipelining and Threads Allocation. ACM Transaction in Mathematical Software, 38, 1-2.

[20] Welch, P.D. (1969) A Fixed-Point Fast Fourier Transform Error Analysis. IEEE Transactions on Audio and Electroacoustics, 17, 151-157.

http://dx.doi.org/10.1109/TAU.1969.1162035

[21] Loos, S. and Wise, D.S. (2009) Strassen’s Matrix Multiplication Relabeled.

[22] Li, J.J., Ranka, S. and Sahni, S. (2011) Strassen’s Matrix Multiplication on Gpus. 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), Tainan, 7-9 December 2011, 157-164.