Hi Jerry, > Yes, OK, however, have you been able to test performance. I am only > curious. There was a test program we used back when this code was first > implemented in bugzilla. I do not remember the PR number off hand.
as you mentioned in a private mail, it was PR51119, and the timing program https://gcc.gnu.org/bugzilla/attachment.cgi?id=40039 I needed to fix the source code slightly to make it work with current gfortran, by replacing the subroutine dummy with subroutine dummy(a,b) integer, parameter :: wp = selected_real_kind(4), & dp = selected_real_kind(8) real(dp), intent(in), dimension(1) :: a real(dp), intent(inout), dimension(1) :: b end subroutine dummy Testing it on my notebook with an Intel i5-8250U which has avx2, I found no significant differences between the current master and the version with the patch when compiling with % gfc-11 -static -O2 -march=native -finline-matmul-limit=0 compare.f90 E.g. gcc-11 with patch to libfortran: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 0.025 0.139 0.025 0.026 4 2000 0.191 0.799 0.743 0.741 8 2000 3.272 2.437 3.280 3.311 16 2000 7.615 2.768 8.405 7.572 32 2000 8.492 3.063 9.733 9.521 64 2000 14.137 3.299 14.118 14.295 128 2000 18.838 3.128 19.149 18.893 256 477 17.214 3.256 17.293 17.255 512 59 17.940 3.316 17.986 17.985 1024 7 17.672 2.665 17.691 17.698 2048 1 17.571 2.595 17.559 17.170 With unmodified gcc-11: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 0.024 0.194 0.025 0.025 4 2000 0.231 1.641 0.718 0.716 8 2000 3.424 2.445 3.198 3.435 16 2000 7.715 2.718 7.615 7.845 32 2000 8.696 3.088 9.728 9.772 64 2000 14.171 3.275 13.995 14.447 128 2000 18.931 3.127 18.942 19.019 256 477 17.239 3.232 17.267 17.291 512 59 17.938 3.315 17.967 17.996 1024 7 17.674 2.632 17.673 17.711 2048 1 17.579 2.581 17.552 17.587 give or take. (For those too lazy to check: refMatmul is just the naive explicit matmul). However, when comparing with older gccs I got better numbers! E.g. gcc-7: ========================================================= ================ MEASURED GIGAFLOPS = ========================================================= Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumed explicit ========================================================= 2 2000 0.113 0.199 0.126 0.150 4 2000 0.866 0.865 0.766 0.881 8 2000 3.551 2.750 3.371 3.852 16 2000 7.826 3.517 7.489 7.464 32 2000 9.989 3.859 11.811 11.903 64 2000 16.218 4.213 16.501 16.687 128 2000 19.971 4.006 20.070 20.049 256 477 22.804 4.139 22.949 22.894 512 59 23.637 4.047 23.800 23.765 1024 7 23.051 3.065 23.177 23.152 2048 1 22.953 2.784 22.946 22.960 So if I were worried that there is a performance penalty by my patch, I'd look for other places, too. Cheers, Harald