Re: [PATCH] Improve sha*sum speed

Pádraig Brady Tue, 13 Sep 2011 05:12:29 -0700

On 09/12/2011 03:49 PM, Loïc Le Loarer wrote:
> Hi,
> 
> Here is my latest results and patch. Please find the patches to
> sha1.c, sha256.c and sh512.c attached and the "time" of the resulting
> binaries in sha_benchs.log. For all binaries, in 64 and 32 bits modes
> (.m32), I run 3 times the command "\time sha*sum zero1G" where zero1G
> is a 10^9 bytes file created by the command:
> dd if=/dev/zero of=zero1G count=1 bs=1 seek=$(( 1000 * 1000 * 1000 - 1 ))


Note using a sparse file should eliminate
some I/O overhead and caching issues.
I'm using: truncate -s1G 1G

> 
> The compilation of coreutils was done using the command
> make CFLAGS="-O3"

I used -O2 -march=corei7-avx

> for 64 bit version and
> make CFLAGS="-m32 -O3"
> for 32 bit version.
> 
> gcc is version 4.4.5 (Ubuntu 10.10)

gcc version 4.6.0 20110603 (Red Hat 4.6.0-10)

> My CPU is a Sandy Bridge @2.5GHz.

Sandy Bridge i3-2310M CPU @ 2.10GHz

> 
> For sha1, the result is very close to Linus' version for git.
> 
> I think it could be a good idea to include thoses patches to improve
> the C versions, it is probably close to the best it can be done in
> "pure" C.
> 
> To improve further, assembly with or without SSE could be done in a second 
> pass.
> 
> What to you think of that ?
> 
> I don't have a GCC farm access yet, so I can only test on my system for now.

Just summarising your results for 1G of data

sha1  \  orig    new
32 bit | 5.15s   2.93s
64 bit | 3.54s   2.59s

I'm not seeing any improvement on my Sandy Bridge system?

sha1  \  orig    new
64 bit | 5.5s   5.5s

Is perhaps the new GCC better able to handle the old code?
Though you said you tried both gcc-4.6.1 and gcc-4.4.5 with
no significant difference (maybe Red Hat have tweaks to their GCC?)

I am seeing a halving of the branch instructions though
which should help a lot for Intel P4 CPUs for example.
(see the attached perf output (obtained using the attached perf-hw script)).
Actually GCC with -O3 rather than -O2 there is the same
halving of branch instructions with either new or old code

I'd like to find out why your Sandy Bridge system
is giving double the performance.

cheers,
Pádraig.

2a492f15396a6768bcbca016993f4b4c8b0b5307  1G

 Performance counter stats for './sha1sum 1G':

    11,486,302,464 cpu-cycles                #    0.000 GHz                     
[14.81%]
     <not counted> stalled-cycles-frontend 
     <not counted> stalled-cycles-backend  
    23,607,008,727 instructions              #    2.06  insns per cycle         
[18.51%]
        32,295,529 cache-references                                             
[18.51%]
        15,620,660 cache-misses              #   48.368 % of all cache refs     
[18.51%]
       467,602,257 branch-instructions                                          
[18.51%]
           397,166 branch-misses             #    0.08% of all branches         
[18.51%]
       547,720,718 bus-cycles                                                   
[14.82%]
     4,903,619,143 L1-dcache-loads                                              
[14.87%]
        74,511,949 L1-dcache-load-misses     #    1.52% of all L1-dcache hits   
[14.87%]
     2,291,883,132 L1-dcache-stores                                             
[14.87%]
        37,281,285 L1-dcache-store-misses                                       
[14.86%]
     <not counted> L1-dcache-prefetches    
        49,635,435 L1-dcache-prefetch-misses                                    
[14.86%]
     <not counted> L1-icache-loads         
         4,461,435 L1-icache-load-misses     #    0.00% of all L1-icache hits   
[14.86%]
     <not counted> L1-icache-prefetches    
     <not counted> L1-icache-prefetch-misses
                 0 LLC-loads                                                    
[14.86%]
                 0 LLC-load-misses           #    0.00% of all LL-cache hits    
[14.85%]
                 0 LLC-stores                                                   
[ 7.43%]
                 0 LLC-store-misses                                             
[ 7.42%]
                 0 LLC-prefetches                                               
[ 7.42%]
                 0 LLC-prefetch-misses                                          
[ 7.42%]
                27 dTLB-loads                                                   
[11.13%]
         1,462,186 dTLB-load-misses          #  5415503.70% of all dTLB cache 
hits  [14.83%]
                 0 dTLB-stores                                                  
[14.83%]
           181,369 dTLB-store-misses                                            
[14.83%]
     <not counted> dTLB-prefetches         
     <not counted> dTLB-prefetch-misses    
            19,525 iTLB-loads                                                   
[14.83%]
             7,151 iTLB-load-misses          #   36.62% of all iTLB cache hits  
[14.82%]
       468,827,726 branch-loads                                                 
[14.82%]
           362,910 branch-load-misses                                           
[14.82%]

       5.535355729 seconds time elapsed

2a492f15396a6768bcbca016993f4b4c8b0b5307  1G

 Performance counter stats for './sha1sum 1G':

    11,375,498,564 cpu-cycles                #    0.000 GHz                     
[14.81%]
     <not counted> stalled-cycles-frontend 
     <not counted> stalled-cycles-backend  
    22,033,897,866 instructions              #    1.94  insns per cycle         
[18.52%]
        32,681,805 cache-references                                             
[18.54%]
        15,632,851 cache-misses              #   47.833 % of all cache refs     
[18.55%]
       221,107,498 branch-instructions                                          
[18.57%]
           446,103 branch-misses             #    0.20% of all branches         
[18.58%]
       543,002,709 bus-cycles                                                   
[14.87%]
     4,376,525,868 L1-dcache-loads                                              
[14.87%]
        75,073,302 L1-dcache-load-misses     #    1.72% of all L1-dcache hits   
[14.87%]
     1,879,261,476 L1-dcache-stores                                             
[14.86%]
        37,334,903 L1-dcache-store-misses                                       
[14.86%]
     <not counted> L1-dcache-prefetches    
        61,566,282 L1-dcache-prefetch-misses                                    
[14.86%]
     <not counted> L1-icache-loads         
         4,512,297 L1-icache-load-misses     #    0.00% of all L1-icache hits   
[14.84%]
     <not counted> L1-icache-prefetches    
     <not counted> L1-icache-prefetch-misses
                 0 LLC-loads                                                    
[14.82%]
                 0 LLC-load-misses           #    0.00% of all LL-cache hits    
[14.81%]
                 0 LLC-stores                                                   
[ 7.40%]
                 0 LLC-store-misses                                             
[ 7.43%]
                 0 LLC-prefetches                                               
[ 7.43%]
                 0 LLC-prefetch-misses                                          
[ 7.42%]
                27 dTLB-loads                                                   
[11.13%]
         1,543,156 dTLB-load-misses          #  5715392.59% of all dTLB cache 
hits  [14.84%]
                 0 dTLB-stores                                                  
[14.83%]
           209,807 dTLB-store-misses                                            
[14.83%]
     <not counted> dTLB-prefetches         
     <not counted> dTLB-prefetch-misses    
            22,465 iTLB-loads                                                   
[14.83%]
            11,811 iTLB-load-misses          #   52.58% of all iTLB cache hits  
[14.82%]
       222,518,023 branch-loads                                                 
[14.82%]
           407,963 branch-load-misses                                           
[14.82%]

       5.492283549 seconds time elapsed

2a492f15396a6768bcbca016993f4b4c8b0b5307  1G

 Performance counter stats for 'sha1sum 1G':

    11,373,486,440 cpu-cycles                #    0.000 GHz                     
[14.81%]
     <not counted> stalled-cycles-frontend 
     <not counted> stalled-cycles-backend  
    23,729,513,137 instructions              #    2.09  insns per cycle         
[18.53%]
        33,044,435 cache-references                                             
[18.54%]
        15,737,392 cache-misses              #   47.625 % of all cache refs     
[18.56%]
       490,380,296 branch-instructions                                          
[18.58%]
           430,390 branch-misses             #    0.09% of all branches         
[18.59%]
       543,035,710 bus-cycles                                                   
[14.87%]
     4,658,001,297 L1-dcache-loads                                              
[14.87%]
        75,287,410 L1-dcache-load-misses     #    1.62% of all L1-dcache hits   
[14.86%]
     2,185,819,952 L1-dcache-stores                                             
[14.86%]
        37,483,100 L1-dcache-store-misses                                       
[14.86%]
     <not counted> L1-dcache-prefetches    
        36,350,891 L1-dcache-prefetch-misses                                    
[14.86%]
     <not counted> L1-icache-loads         
         4,902,310 L1-icache-load-misses     #    0.00% of all L1-icache hits   
[14.86%]
     <not counted> L1-icache-prefetches    
     <not counted> L1-icache-prefetch-misses
                 0 LLC-loads                                                    
[14.85%]
                 0 LLC-load-misses           #    0.00% of all LL-cache hits    
[14.85%]
                 0 LLC-stores                                                   
[ 7.41%]
                 0 LLC-store-misses                                             
[ 7.39%]
                 0 LLC-prefetches                                               
[ 7.42%]
                 0 LLC-prefetch-misses                                          
[ 7.42%]
                27 dTLB-loads                                                   
[11.13%]
         1,869,064 dTLB-load-misses          #  6922459.26% of all dTLB cache 
hits  [14.83%]
                 0 dTLB-stores                                                  
[14.83%]
           226,914 dTLB-store-misses                                            
[14.83%]
     <not counted> dTLB-prefetches         
     <not counted> dTLB-prefetch-misses    
            27,737 iTLB-loads                                                   
[14.82%]
            11,867 iTLB-load-misses          #   42.78% of all iTLB cache hits  
[14.82%]
       488,638,178 branch-loads                                                 
[14.82%]
           417,566 branch-load-misses                                           
[14.82%]

       5.495609117 seconds time elapsed

#!/bin/sh

hw_events=$(
 for i in $(perf list | sed -n 's/\[Hardware.*event\]//; T; s/ OR .*//; p'); do
    perf stat -e $i true >/dev/null 2>&1 &&
     printf -- "-e %s \n" $i
 done | tr -d '\n'
)

perf stat $hw_events "$@"

Re: [PATCH] Improve sha*sum speed

Reply via email to