Hi,

Here is my latest results and patch. Please find the patches to
sha1.c, sha256.c and sh512.c attached and the "time" of the resulting
binaries in sha_benchs.log. For all binaries, in 64 and 32 bits modes
(.m32), I run 3 times the command "\time sha*sum zero1G" where zero1G
is a 10^9 bytes file created by the command:
dd if=/dev/zero of=zero1G count=1 bs=1 seek=$(( 1000 * 1000 * 1000 - 1 ))

The compilation of coreutils was done using the command
make CFLAGS="-O3"
for 64 bit version and
make CFLAGS="-m32 -O3"
for 32 bit version.

gcc is version 4.4.5 (Ubuntu 10.10)

My CPU is a Sandy Bridge @2.5GHz.

For sha1, the result is very close to Linus' version for git.

I think it could be a good idea to include thoses patches to improve
the C versions, it is probably close to the best it can be done in
"pure" C.

To improve further, assembly with or without SSE could be done in a second pass.

What to you think of that ?

I don't have a GCC farm access yet, so I can only test on my system for now.

Best regards.
Loïc

2011/9/6 Pádraig Brady <p...@draigbrady.com>:
> On 09/06/2011 02:25 PM, Loďc Le Loarer wrote:
>> Hi Pádraig,
>>
>> Thank you for your answer.
>>
>> 2011/9/6 Pádraig Brady <p...@draigbrady.com <mailto:p...@draigbrady.com>>
>>
>>     A few general points.
>>     You essentially used Linus' code (albeit by
>>     very helpfully isolating the significant differences).
>>     It might be easier/required to just include it in gnulib?
>>     There are a few files in gnulib that are not copyright of the FSF,
>>     so would Nicolas and Linus need to assign copyright?
>>
>>
>> Yes, this is what I did. I don't thing that including Linus' is easier as 
>> the functions have a different prototype. Also, sha1, sha256 and sha512 
>> share the same structure in gnulib, changing one without changing the other 
>> would be weird. But if you thing it is required, I have not problem with 
>> that.
>
> Ok, let's just use your patches to gnulib so.
> The techniques were fairly generic anyway.
>
>>
>> By the way, I have done a test on sha512 and I have improved the speed on 
>> the same 1Gb zero file from 4.5 to 3.9s. Please find the patch attached. So 
>> I thing that using the same technics, we could improve all sha's speed.
>>
>>     For performance testing I've found gcc generates
>>     much more deterministic results with a -march
>>     as close to native as possible or otherwise
>>     the code is very susceptible to alignment issues etc.
>>     Your compiler supports -march=native.
>>     Note also gcc 4.6 has much better support for your sandy bridge CPU,
>>     either with -march=native or -march=corei7-avx
>>
>>
>> I tried using gcc-4.6.1 (I recompiled it under my ubuntu 10.10) but I 
>> couldn't see any differences. For me, using any combination of -march=native 
>> or not and gcc 4.4.5 or 4.6.1 doesn't make a difference, all the times are 
>> in the measurement margin.
>
> OK that at least confirms the improvement is fairly deterministic.
>
>>
>>     As for the SSE version, I would also like to see that included,
>>     given the proportion of hardware supporting that these days.
>>     I previously noticed a coreutils SSE2 patch here:
>>     http://www.arctic.org/~dean/crypto/sha1.html 
>> <http://www.arctic.org/%7Edean/crypto/sha1.html>
>>     Though we'd probably need some runtime SSE detection to include that.
>>
>>
>> Ok, I could try to work on this. The real problem is to test that 
>> compilation and SSE detection is done correctly on several platform. I only 
>> have access to a few x86 machines, what is the usual way to test more 
>> platforms ?
>
> It would probably be best to get an account on the GCC compile farm.
> http://gcc.gnu.org/wiki/CompileFarm
>
> cheers,
> Pádraig.
>



-- 
Loïc
--- lib/sha1.c.orig	2011-09-12 15:22:57.104202004 +0200
+++ lib/sha1.c	2011-09-12 16:32:41.714201999 +0200
@@ -29,6 +29,7 @@
 #include <stddef.h>
 #include <stdlib.h>
 #include <string.h>
+#include <arpa/inet.h>
 
 #if USE_UNLOCKED_IO
 # include "unlocked-io.h"
@@ -316,112 +317,138 @@
 
 #define rol(x, n) (((x) << (n)) | ((uint32_t) (x) >> (32 - (n))))
 
-#define M(I) ( tm =   x[I&0x0f] ^ x[(I-14)&0x0f] \
-                    ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f] \
-               , (x[I&0x0f] = rol(tm, 1)) )
+#define X(I) ntohl(words[I])
 
-#define R(A,B,C,D,E,F,K,M)  do { E += rol( A, 5 )     \
+/*
+ * If you have 32 registers or more, the compiler can (and should)
+ * try to change the array[] accesses into registers. However, on
+ * machines with less than ~25 registers, that won't really work,
+ * and at least gcc will make an unholy mess of it.
+ *
+ * So to avoid that mess which just slows things down, we force
+ * the stores to memory to actually happen (we might be better off
+ * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
+ * suggested by Artur Skawina - that will also make gcc unable to
+ * try to do the silly "optimize away loads" part because it won't
+ * see what the value will be).
+ *
+ * Ben Herrenschmidt reports that on PPC, the C version comes close
+ * to the optimized asm with this (ie on PPC you don't want that
+ * 'volatile', since there are lots of registers).
+ *
+ * On ARM we get the best code generation by forcing a full memory barrier
+ * between each SHA_ROUND, otherwise gcc happily get wild with spilling and
+ * the stack frame size simply explode and performance goes down the drain.
+ */
+
+#if defined(__i386__) || defined(__x86_64__)
+  #define setX(I, val) (*(volatile unsigned int *)&x[(I)&0x0f] = (val))
+#elif defined(__GNUC__) && defined(__arm__)
+  #define setX(I, val) do { x[(I)&0x0f] = (val); __asm__("":::"memory"); } while (0)
+#else
+  #define setX(I, val) (x[(I)&0x0f] = (val))
+#endif
+
+#define M(I) rol(  x[I&0x0f] ^ x[(I-14)&0x0f] \
+                 ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f], 1) \
+
+#define R(A,B,C,D,E,F,K,M,I) do { unsigned int TEMP = M(I); \
+                                 setX(I,TEMP);\
+				 E += rol( A, 5 )     \
                                       + F( B, C, D )  \
                                       + K             \
-                                      + M;            \
+                                      + TEMP;         \
                                  B = rol( B, 30 );    \
                                } while(0)
 
   while (words < endp)
     {
-      uint32_t tm;
-      int t;
-      for (t = 0; t < 16; t++)
-        {
-          x[t] = SWAP (*words);
-          words++;
-        }
-
-      R( a, b, c, d, e, F1, K1, x[ 0] );
-      R( e, a, b, c, d, F1, K1, x[ 1] );
-      R( d, e, a, b, c, F1, K1, x[ 2] );
-      R( c, d, e, a, b, F1, K1, x[ 3] );
-      R( b, c, d, e, a, F1, K1, x[ 4] );
-      R( a, b, c, d, e, F1, K1, x[ 5] );
-      R( e, a, b, c, d, F1, K1, x[ 6] );
-      R( d, e, a, b, c, F1, K1, x[ 7] );
-      R( c, d, e, a, b, F1, K1, x[ 8] );
-      R( b, c, d, e, a, F1, K1, x[ 9] );
-      R( a, b, c, d, e, F1, K1, x[10] );
-      R( e, a, b, c, d, F1, K1, x[11] );
-      R( d, e, a, b, c, F1, K1, x[12] );
-      R( c, d, e, a, b, F1, K1, x[13] );
-      R( b, c, d, e, a, F1, K1, x[14] );
-      R( a, b, c, d, e, F1, K1, x[15] );
-      R( e, a, b, c, d, F1, K1, M(16) );
-      R( d, e, a, b, c, F1, K1, M(17) );
-      R( c, d, e, a, b, F1, K1, M(18) );
-      R( b, c, d, e, a, F1, K1, M(19) );
-      R( a, b, c, d, e, F2, K2, M(20) );
-      R( e, a, b, c, d, F2, K2, M(21) );
-      R( d, e, a, b, c, F2, K2, M(22) );
-      R( c, d, e, a, b, F2, K2, M(23) );
-      R( b, c, d, e, a, F2, K2, M(24) );
-      R( a, b, c, d, e, F2, K2, M(25) );
-      R( e, a, b, c, d, F2, K2, M(26) );
-      R( d, e, a, b, c, F2, K2, M(27) );
-      R( c, d, e, a, b, F2, K2, M(28) );
-      R( b, c, d, e, a, F2, K2, M(29) );
-      R( a, b, c, d, e, F2, K2, M(30) );
-      R( e, a, b, c, d, F2, K2, M(31) );
-      R( d, e, a, b, c, F2, K2, M(32) );
-      R( c, d, e, a, b, F2, K2, M(33) );
-      R( b, c, d, e, a, F2, K2, M(34) );
-      R( a, b, c, d, e, F2, K2, M(35) );
-      R( e, a, b, c, d, F2, K2, M(36) );
-      R( d, e, a, b, c, F2, K2, M(37) );
-      R( c, d, e, a, b, F2, K2, M(38) );
-      R( b, c, d, e, a, F2, K2, M(39) );
-      R( a, b, c, d, e, F3, K3, M(40) );
-      R( e, a, b, c, d, F3, K3, M(41) );
-      R( d, e, a, b, c, F3, K3, M(42) );
-      R( c, d, e, a, b, F3, K3, M(43) );
-      R( b, c, d, e, a, F3, K3, M(44) );
-      R( a, b, c, d, e, F3, K3, M(45) );
-      R( e, a, b, c, d, F3, K3, M(46) );
-      R( d, e, a, b, c, F3, K3, M(47) );
-      R( c, d, e, a, b, F3, K3, M(48) );
-      R( b, c, d, e, a, F3, K3, M(49) );
-      R( a, b, c, d, e, F3, K3, M(50) );
-      R( e, a, b, c, d, F3, K3, M(51) );
-      R( d, e, a, b, c, F3, K3, M(52) );
-      R( c, d, e, a, b, F3, K3, M(53) );
-      R( b, c, d, e, a, F3, K3, M(54) );
-      R( a, b, c, d, e, F3, K3, M(55) );
-      R( e, a, b, c, d, F3, K3, M(56) );
-      R( d, e, a, b, c, F3, K3, M(57) );
-      R( c, d, e, a, b, F3, K3, M(58) );
-      R( b, c, d, e, a, F3, K3, M(59) );
-      R( a, b, c, d, e, F4, K4, M(60) );
-      R( e, a, b, c, d, F4, K4, M(61) );
-      R( d, e, a, b, c, F4, K4, M(62) );
-      R( c, d, e, a, b, F4, K4, M(63) );
-      R( b, c, d, e, a, F4, K4, M(64) );
-      R( a, b, c, d, e, F4, K4, M(65) );
-      R( e, a, b, c, d, F4, K4, M(66) );
-      R( d, e, a, b, c, F4, K4, M(67) );
-      R( c, d, e, a, b, F4, K4, M(68) );
-      R( b, c, d, e, a, F4, K4, M(69) );
-      R( a, b, c, d, e, F4, K4, M(70) );
-      R( e, a, b, c, d, F4, K4, M(71) );
-      R( d, e, a, b, c, F4, K4, M(72) );
-      R( c, d, e, a, b, F4, K4, M(73) );
-      R( b, c, d, e, a, F4, K4, M(74) );
-      R( a, b, c, d, e, F4, K4, M(75) );
-      R( e, a, b, c, d, F4, K4, M(76) );
-      R( d, e, a, b, c, F4, K4, M(77) );
-      R( c, d, e, a, b, F4, K4, M(78) );
-      R( b, c, d, e, a, F4, K4, M(79) );
+      R( a, b, c, d, e, F1, K1, X,  0);
+      R( e, a, b, c, d, F1, K1, X,  1);
+      R( d, e, a, b, c, F1, K1, X,  2);
+      R( c, d, e, a, b, F1, K1, X,  3);
+      R( b, c, d, e, a, F1, K1, X,  4);
+      R( a, b, c, d, e, F1, K1, X,  5);
+      R( e, a, b, c, d, F1, K1, X,  6);
+      R( d, e, a, b, c, F1, K1, X,  7);
+      R( c, d, e, a, b, F1, K1, X,  8);
+      R( b, c, d, e, a, F1, K1, X,  9);
+      R( a, b, c, d, e, F1, K1, X, 10);
+      R( e, a, b, c, d, F1, K1, X, 11);
+      R( d, e, a, b, c, F1, K1, X, 12);
+      R( c, d, e, a, b, F1, K1, X, 13);
+      R( b, c, d, e, a, F1, K1, X, 14);
+      R( a, b, c, d, e, F1, K1, X, 15);
+      R( e, a, b, c, d, F1, K1, M, 16);
+      R( d, e, a, b, c, F1, K1, M, 17);
+      R( c, d, e, a, b, F1, K1, M, 18);
+      R( b, c, d, e, a, F1, K1, M, 19);
+      R( a, b, c, d, e, F2, K2, M, 20);
+      R( e, a, b, c, d, F2, K2, M, 21);
+      R( d, e, a, b, c, F2, K2, M, 22);
+      R( c, d, e, a, b, F2, K2, M, 23);
+      R( b, c, d, e, a, F2, K2, M, 24);
+      R( a, b, c, d, e, F2, K2, M, 25);
+      R( e, a, b, c, d, F2, K2, M, 26);
+      R( d, e, a, b, c, F2, K2, M, 27);
+      R( c, d, e, a, b, F2, K2, M, 28);
+      R( b, c, d, e, a, F2, K2, M, 29);
+      R( a, b, c, d, e, F2, K2, M, 30);
+      R( e, a, b, c, d, F2, K2, M, 31);
+      R( d, e, a, b, c, F2, K2, M, 32);
+      R( c, d, e, a, b, F2, K2, M, 33);
+      R( b, c, d, e, a, F2, K2, M, 34);
+      R( a, b, c, d, e, F2, K2, M, 35);
+      R( e, a, b, c, d, F2, K2, M, 36);
+      R( d, e, a, b, c, F2, K2, M, 37);
+      R( c, d, e, a, b, F2, K2, M, 38);
+      R( b, c, d, e, a, F2, K2, M, 39);
+      R( a, b, c, d, e, F3, K3, M, 40);
+      R( e, a, b, c, d, F3, K3, M, 41);
+      R( d, e, a, b, c, F3, K3, M, 42);
+      R( c, d, e, a, b, F3, K3, M, 43);
+      R( b, c, d, e, a, F3, K3, M, 44);
+      R( a, b, c, d, e, F3, K3, M, 45);
+      R( e, a, b, c, d, F3, K3, M, 46);
+      R( d, e, a, b, c, F3, K3, M, 47);
+      R( c, d, e, a, b, F3, K3, M, 48);
+      R( b, c, d, e, a, F3, K3, M, 49);
+      R( a, b, c, d, e, F3, K3, M, 50);
+      R( e, a, b, c, d, F3, K3, M, 51);
+      R( d, e, a, b, c, F3, K3, M, 52);
+      R( c, d, e, a, b, F3, K3, M, 53);
+      R( b, c, d, e, a, F3, K3, M, 54);
+      R( a, b, c, d, e, F3, K3, M, 55);
+      R( e, a, b, c, d, F3, K3, M, 56);
+      R( d, e, a, b, c, F3, K3, M, 57);
+      R( c, d, e, a, b, F3, K3, M, 58);
+      R( b, c, d, e, a, F3, K3, M, 59);
+      R( a, b, c, d, e, F4, K4, M, 60);
+      R( e, a, b, c, d, F4, K4, M, 61);
+      R( d, e, a, b, c, F4, K4, M, 62);
+      R( c, d, e, a, b, F4, K4, M, 63);
+      R( b, c, d, e, a, F4, K4, M, 64);
+      R( a, b, c, d, e, F4, K4, M, 65);
+      R( e, a, b, c, d, F4, K4, M, 66);
+      R( d, e, a, b, c, F4, K4, M, 67);
+      R( c, d, e, a, b, F4, K4, M, 68);
+      R( b, c, d, e, a, F4, K4, M, 69);
+      R( a, b, c, d, e, F4, K4, M, 70);
+      R( e, a, b, c, d, F4, K4, M, 71);
+      R( d, e, a, b, c, F4, K4, M, 72);
+      R( c, d, e, a, b, F4, K4, M, 73);
+      R( b, c, d, e, a, F4, K4, M, 74);
+      R( a, b, c, d, e, F4, K4, M, 75);
+      R( e, a, b, c, d, F4, K4, M, 76);
+      R( d, e, a, b, c, F4, K4, M, 77);
+      R( c, d, e, a, b, F4, K4, M, 78);
+      R( b, c, d, e, a, F4, K4, M, 79);
 
       a = ctx->A += a;
       b = ctx->B += b;
       c = ctx->C += c;
       d = ctx->D += d;
       e = ctx->E += e;
+      words += 16;
     }
 }
--- lib/sha256.c.orig	2011-09-09 23:08:04.521357998 +0200
+++ lib/sha256.c	2011-09-12 16:32:42.184202007 +0200
@@ -27,6 +27,7 @@
 #include <stddef.h>
 #include <stdlib.h>
 #include <string.h>
+#include <arpa/inet.h>
 
 #if USE_UNLOCKED_IO
 # include "unlocked-io.h"
@@ -468,94 +469,88 @@
 #define SS0(x) (rol(x,30)^rol(x,19)^rol(x,10))
 #define SS1(x) (rol(x,26)^rol(x,21)^rol(x,7))
 
-#define M(I) ( tm =   S1(x[(I-2)&0x0f]) + x[(I-7)&0x0f] \
-                    + S0(x[(I-15)&0x0f]) + x[I&0x0f]    \
-               , x[I&0x0f] = tm )
-
-#define R(A,B,C,D,E,F,G,H,K,M)  do { t0 = SS0(A) + F2(A,B,C); \
-                                     t1 = H + SS1(E)  \
-                                      + F1(E,F,G)     \
-                                      + K             \
-                                      + M;            \
-                                     D += t1;  H = t0 + t1; \
-                               } while(0)
-
+#define X(I) (ntohl(words[I]))
+#define M(I) ( S1(x[(I-2)&0x0f]) + x[(I-7)&0x0f] \
+               + S0(x[(I-15)&0x0f]) + x[I&0x0f]    \
+               )
+
+#define R(A,B,C,D,E,F,G,H,M,I)  do { \
+                uint32_t t1;     \
+                uint32_t tm = M(I);  \
+                x[I&0x0f] = tm;      \
+                t1 = H + SS1(E)    \
+                   + F1(E,F,G)     \
+                   + K(I)          \
+                   + tm;           \
+                D += t1;           \
+                H = SS0(A) + F2(A,B,C) + t1; \
+              } while(0)
   while (words < endp)
     {
-      uint32_t tm;
-      uint32_t t0, t1;
-      int t;
-      /* FIXME: see sha1.c for a better implementation.  */
-      for (t = 0; t < 16; t++)
-        {
-          x[t] = SWAP (*words);
-          words++;
-        }
-
-      R( a, b, c, d, e, f, g, h, K( 0), x[ 0] );
-      R( h, a, b, c, d, e, f, g, K( 1), x[ 1] );
-      R( g, h, a, b, c, d, e, f, K( 2), x[ 2] );
-      R( f, g, h, a, b, c, d, e, K( 3), x[ 3] );
-      R( e, f, g, h, a, b, c, d, K( 4), x[ 4] );
-      R( d, e, f, g, h, a, b, c, K( 5), x[ 5] );
-      R( c, d, e, f, g, h, a, b, K( 6), x[ 6] );
-      R( b, c, d, e, f, g, h, a, K( 7), x[ 7] );
-      R( a, b, c, d, e, f, g, h, K( 8), x[ 8] );
-      R( h, a, b, c, d, e, f, g, K( 9), x[ 9] );
-      R( g, h, a, b, c, d, e, f, K(10), x[10] );
-      R( f, g, h, a, b, c, d, e, K(11), x[11] );
-      R( e, f, g, h, a, b, c, d, K(12), x[12] );
-      R( d, e, f, g, h, a, b, c, K(13), x[13] );
-      R( c, d, e, f, g, h, a, b, K(14), x[14] );
-      R( b, c, d, e, f, g, h, a, K(15), x[15] );
-      R( a, b, c, d, e, f, g, h, K(16), M(16) );
-      R( h, a, b, c, d, e, f, g, K(17), M(17) );
-      R( g, h, a, b, c, d, e, f, K(18), M(18) );
-      R( f, g, h, a, b, c, d, e, K(19), M(19) );
-      R( e, f, g, h, a, b, c, d, K(20), M(20) );
-      R( d, e, f, g, h, a, b, c, K(21), M(21) );
-      R( c, d, e, f, g, h, a, b, K(22), M(22) );
-      R( b, c, d, e, f, g, h, a, K(23), M(23) );
-      R( a, b, c, d, e, f, g, h, K(24), M(24) );
-      R( h, a, b, c, d, e, f, g, K(25), M(25) );
-      R( g, h, a, b, c, d, e, f, K(26), M(26) );
-      R( f, g, h, a, b, c, d, e, K(27), M(27) );
-      R( e, f, g, h, a, b, c, d, K(28), M(28) );
-      R( d, e, f, g, h, a, b, c, K(29), M(29) );
-      R( c, d, e, f, g, h, a, b, K(30), M(30) );
-      R( b, c, d, e, f, g, h, a, K(31), M(31) );
-      R( a, b, c, d, e, f, g, h, K(32), M(32) );
-      R( h, a, b, c, d, e, f, g, K(33), M(33) );
-      R( g, h, a, b, c, d, e, f, K(34), M(34) );
-      R( f, g, h, a, b, c, d, e, K(35), M(35) );
-      R( e, f, g, h, a, b, c, d, K(36), M(36) );
-      R( d, e, f, g, h, a, b, c, K(37), M(37) );
-      R( c, d, e, f, g, h, a, b, K(38), M(38) );
-      R( b, c, d, e, f, g, h, a, K(39), M(39) );
-      R( a, b, c, d, e, f, g, h, K(40), M(40) );
-      R( h, a, b, c, d, e, f, g, K(41), M(41) );
-      R( g, h, a, b, c, d, e, f, K(42), M(42) );
-      R( f, g, h, a, b, c, d, e, K(43), M(43) );
-      R( e, f, g, h, a, b, c, d, K(44), M(44) );
-      R( d, e, f, g, h, a, b, c, K(45), M(45) );
-      R( c, d, e, f, g, h, a, b, K(46), M(46) );
-      R( b, c, d, e, f, g, h, a, K(47), M(47) );
-      R( a, b, c, d, e, f, g, h, K(48), M(48) );
-      R( h, a, b, c, d, e, f, g, K(49), M(49) );
-      R( g, h, a, b, c, d, e, f, K(50), M(50) );
-      R( f, g, h, a, b, c, d, e, K(51), M(51) );
-      R( e, f, g, h, a, b, c, d, K(52), M(52) );
-      R( d, e, f, g, h, a, b, c, K(53), M(53) );
-      R( c, d, e, f, g, h, a, b, K(54), M(54) );
-      R( b, c, d, e, f, g, h, a, K(55), M(55) );
-      R( a, b, c, d, e, f, g, h, K(56), M(56) );
-      R( h, a, b, c, d, e, f, g, K(57), M(57) );
-      R( g, h, a, b, c, d, e, f, K(58), M(58) );
-      R( f, g, h, a, b, c, d, e, K(59), M(59) );
-      R( e, f, g, h, a, b, c, d, K(60), M(60) );
-      R( d, e, f, g, h, a, b, c, K(61), M(61) );
-      R( c, d, e, f, g, h, a, b, K(62), M(62) );
-      R( b, c, d, e, f, g, h, a, K(63), M(63) );
+      R( a, b, c, d, e, f, g, h, X,  0 );
+      R( h, a, b, c, d, e, f, g, X,  1 );
+      R( g, h, a, b, c, d, e, f, X,  2 );
+      R( f, g, h, a, b, c, d, e, X,  3 );
+      R( e, f, g, h, a, b, c, d, X,  4 );
+      R( d, e, f, g, h, a, b, c, X,  5 );
+      R( c, d, e, f, g, h, a, b, X,  6 );
+      R( b, c, d, e, f, g, h, a, X,  7 );
+      R( a, b, c, d, e, f, g, h, X,  8 );
+      R( h, a, b, c, d, e, f, g, X,  9 );
+      R( g, h, a, b, c, d, e, f, X, 10 );
+      R( f, g, h, a, b, c, d, e, X, 11 );
+      R( e, f, g, h, a, b, c, d, X, 12 );
+      R( d, e, f, g, h, a, b, c, X, 13 );
+      R( c, d, e, f, g, h, a, b, X, 14 );
+      R( b, c, d, e, f, g, h, a, X, 15 );
+      R( a, b, c, d, e, f, g, h, M, 16 );
+      R( h, a, b, c, d, e, f, g, M, 17 );
+      R( g, h, a, b, c, d, e, f, M, 18 );
+      R( f, g, h, a, b, c, d, e, M, 19 );
+      R( e, f, g, h, a, b, c, d, M, 20 );
+      R( d, e, f, g, h, a, b, c, M, 21 );
+      R( c, d, e, f, g, h, a, b, M, 22 );
+      R( b, c, d, e, f, g, h, a, M, 23 );
+      R( a, b, c, d, e, f, g, h, M, 24 );
+      R( h, a, b, c, d, e, f, g, M, 25 );
+      R( g, h, a, b, c, d, e, f, M, 26 );
+      R( f, g, h, a, b, c, d, e, M, 27 );
+      R( e, f, g, h, a, b, c, d, M, 28 );
+      R( d, e, f, g, h, a, b, c, M, 29 );
+      R( c, d, e, f, g, h, a, b, M, 30 );
+      R( b, c, d, e, f, g, h, a, M, 31 );
+      R( a, b, c, d, e, f, g, h, M, 32 );
+      R( h, a, b, c, d, e, f, g, M, 33 );
+      R( g, h, a, b, c, d, e, f, M, 34 );
+      R( f, g, h, a, b, c, d, e, M, 35 );
+      R( e, f, g, h, a, b, c, d, M, 36 );
+      R( d, e, f, g, h, a, b, c, M, 37 );
+      R( c, d, e, f, g, h, a, b, M, 38 );
+      R( b, c, d, e, f, g, h, a, M, 39 );
+      R( a, b, c, d, e, f, g, h, M, 40 );
+      R( h, a, b, c, d, e, f, g, M, 41 );
+      R( g, h, a, b, c, d, e, f, M, 42 );
+      R( f, g, h, a, b, c, d, e, M, 43 );
+      R( e, f, g, h, a, b, c, d, M, 44 );
+      R( d, e, f, g, h, a, b, c, M, 45 );
+      R( c, d, e, f, g, h, a, b, M, 46 );
+      R( b, c, d, e, f, g, h, a, M, 47 );
+      R( a, b, c, d, e, f, g, h, M, 48 );
+      R( h, a, b, c, d, e, f, g, M, 49 );
+      R( g, h, a, b, c, d, e, f, M, 50 );
+      R( f, g, h, a, b, c, d, e, M, 51 );
+      R( e, f, g, h, a, b, c, d, M, 52 );
+      R( d, e, f, g, h, a, b, c, M, 53 );
+      R( c, d, e, f, g, h, a, b, M, 54 );
+      R( b, c, d, e, f, g, h, a, M, 55 );
+      R( a, b, c, d, e, f, g, h, M, 56 );
+      R( h, a, b, c, d, e, f, g, M, 57 );
+      R( g, h, a, b, c, d, e, f, M, 58 );
+      R( f, g, h, a, b, c, d, e, M, 59 );
+      R( e, f, g, h, a, b, c, d, M, 60 );
+      R( d, e, f, g, h, a, b, c, M, 61 );
+      R( c, d, e, f, g, h, a, b, M, 62 );
+      R( b, c, d, e, f, g, h, a, M, 63 );
 
       a = ctx->state[0] += a;
       b = ctx->state[1] += b;
@@ -565,5 +560,6 @@
       f = ctx->state[5] += f;
       g = ctx->state[6] += g;
       h = ctx->state[7] += h;
+      words += 16;
     }
 }
--- lib/sha512.c.orig	2011-09-06 15:24:17.320209997 +0200
+++ lib/sha512.c	2011-09-12 16:32:42.604202003 +0200
@@ -27,6 +27,7 @@
 #include <stddef.h>
 #include <stdlib.h>
 #include <string.h>
+#include <endian.h>
 
 #if USE_UNLOCKED_IO
 # include "unlocked-io.h"
@@ -383,6 +384,7 @@
 #if !_STRING_ARCH_unaligned
 # define alignof(type) offsetof (struct { char c; type x; }, x)
 # define UNALIGNED_P(p) (((size_t) p) % alignof (u64) != 0)
+#error "unaligned"
       if (UNALIGNED_P (buffer))
         while (len > 128)
           {
@@ -498,19 +500,29 @@
 #define SS0(x) u64xor (u64rol (x, 36), u64xor (u64rol (x, 30), u64rol (x, 25)))
 #define SS1(x) u64xor (u64rol(x, 50), u64xor (u64rol (x, 46), u64rol (x, 23)))
 
-#define M(I) (x[(I) & 15]                                                 \
-              = u64plus (x[(I) & 15],                                     \
-                         u64plus (S1 (x[((I) - 2) & 15]),                 \
-                                  u64plus (x[((I) - 7) & 15],             \
-                                           S0 (x[((I) - 15) & 15])))))
+#if __WORDSIZE == 64
+#define setX(I, val) (*(volatile u64 *)&x[(I)&15] = (val))
+// be64toh in define in endian.h
+#define X(I) (be64toh(words[I]))
+#else
+#define setX(I, val) (x[(I)&15] = (val))
+#define X(I) (SWAP(words[I]))
+#endif
 
-#define R(A, B, C, D, E, F, G, H, K, M)                                   \
+#define M(I) (u64plus (x[(I) & 15],                                       \
+                       u64plus (S1 (x[((I) - 2) & 15]),                   \
+                                u64plus (x[((I) - 7) & 15],               \
+                                         S0 (x[((I) - 15) & 15])))))
+#define R(A, B, C, D, E, F, G, H, M, I)                                   \
   do                                                                      \
     {                                                                     \
+      u64 temp = M(I);                                                    \
       u64 t0 = u64plus (SS0 (A), F2 (A, B, C));                           \
       u64 t1 =                                                            \
         u64plus (H, u64plus (SS1 (E),                                     \
-                             u64plus (F1 (E, F, G), u64plus (K, M))));    \
+                             u64plus (F1 (E, F, G),                       \
+                                      u64plus (K(I), temp))));            \
+      setX(I, temp);                                                      \
       D = u64plus (D, t1);                                                \
       H = u64plus (t0, t1);                                               \
     }                                                                     \
@@ -518,94 +530,86 @@
 
   while (words < endp)
     {
-      int t;
-      /* FIXME: see sha1.c for a better implementation.  */
-      for (t = 0; t < 16; t++)
-        {
-          x[t] = SWAP (*words);
-          words++;
-        }
-
-      R( a, b, c, d, e, f, g, h, K( 0), x[ 0] );
-      R( h, a, b, c, d, e, f, g, K( 1), x[ 1] );
-      R( g, h, a, b, c, d, e, f, K( 2), x[ 2] );
-      R( f, g, h, a, b, c, d, e, K( 3), x[ 3] );
-      R( e, f, g, h, a, b, c, d, K( 4), x[ 4] );
-      R( d, e, f, g, h, a, b, c, K( 5), x[ 5] );
-      R( c, d, e, f, g, h, a, b, K( 6), x[ 6] );
-      R( b, c, d, e, f, g, h, a, K( 7), x[ 7] );
-      R( a, b, c, d, e, f, g, h, K( 8), x[ 8] );
-      R( h, a, b, c, d, e, f, g, K( 9), x[ 9] );
-      R( g, h, a, b, c, d, e, f, K(10), x[10] );
-      R( f, g, h, a, b, c, d, e, K(11), x[11] );
-      R( e, f, g, h, a, b, c, d, K(12), x[12] );
-      R( d, e, f, g, h, a, b, c, K(13), x[13] );
-      R( c, d, e, f, g, h, a, b, K(14), x[14] );
-      R( b, c, d, e, f, g, h, a, K(15), x[15] );
-      R( a, b, c, d, e, f, g, h, K(16), M(16) );
-      R( h, a, b, c, d, e, f, g, K(17), M(17) );
-      R( g, h, a, b, c, d, e, f, K(18), M(18) );
-      R( f, g, h, a, b, c, d, e, K(19), M(19) );
-      R( e, f, g, h, a, b, c, d, K(20), M(20) );
-      R( d, e, f, g, h, a, b, c, K(21), M(21) );
-      R( c, d, e, f, g, h, a, b, K(22), M(22) );
-      R( b, c, d, e, f, g, h, a, K(23), M(23) );
-      R( a, b, c, d, e, f, g, h, K(24), M(24) );
-      R( h, a, b, c, d, e, f, g, K(25), M(25) );
-      R( g, h, a, b, c, d, e, f, K(26), M(26) );
-      R( f, g, h, a, b, c, d, e, K(27), M(27) );
-      R( e, f, g, h, a, b, c, d, K(28), M(28) );
-      R( d, e, f, g, h, a, b, c, K(29), M(29) );
-      R( c, d, e, f, g, h, a, b, K(30), M(30) );
-      R( b, c, d, e, f, g, h, a, K(31), M(31) );
-      R( a, b, c, d, e, f, g, h, K(32), M(32) );
-      R( h, a, b, c, d, e, f, g, K(33), M(33) );
-      R( g, h, a, b, c, d, e, f, K(34), M(34) );
-      R( f, g, h, a, b, c, d, e, K(35), M(35) );
-      R( e, f, g, h, a, b, c, d, K(36), M(36) );
-      R( d, e, f, g, h, a, b, c, K(37), M(37) );
-      R( c, d, e, f, g, h, a, b, K(38), M(38) );
-      R( b, c, d, e, f, g, h, a, K(39), M(39) );
-      R( a, b, c, d, e, f, g, h, K(40), M(40) );
-      R( h, a, b, c, d, e, f, g, K(41), M(41) );
-      R( g, h, a, b, c, d, e, f, K(42), M(42) );
-      R( f, g, h, a, b, c, d, e, K(43), M(43) );
-      R( e, f, g, h, a, b, c, d, K(44), M(44) );
-      R( d, e, f, g, h, a, b, c, K(45), M(45) );
-      R( c, d, e, f, g, h, a, b, K(46), M(46) );
-      R( b, c, d, e, f, g, h, a, K(47), M(47) );
-      R( a, b, c, d, e, f, g, h, K(48), M(48) );
-      R( h, a, b, c, d, e, f, g, K(49), M(49) );
-      R( g, h, a, b, c, d, e, f, K(50), M(50) );
-      R( f, g, h, a, b, c, d, e, K(51), M(51) );
-      R( e, f, g, h, a, b, c, d, K(52), M(52) );
-      R( d, e, f, g, h, a, b, c, K(53), M(53) );
-      R( c, d, e, f, g, h, a, b, K(54), M(54) );
-      R( b, c, d, e, f, g, h, a, K(55), M(55) );
-      R( a, b, c, d, e, f, g, h, K(56), M(56) );
-      R( h, a, b, c, d, e, f, g, K(57), M(57) );
-      R( g, h, a, b, c, d, e, f, K(58), M(58) );
-      R( f, g, h, a, b, c, d, e, K(59), M(59) );
-      R( e, f, g, h, a, b, c, d, K(60), M(60) );
-      R( d, e, f, g, h, a, b, c, K(61), M(61) );
-      R( c, d, e, f, g, h, a, b, K(62), M(62) );
-      R( b, c, d, e, f, g, h, a, K(63), M(63) );
-      R( a, b, c, d, e, f, g, h, K(64), M(64) );
-      R( h, a, b, c, d, e, f, g, K(65), M(65) );
-      R( g, h, a, b, c, d, e, f, K(66), M(66) );
-      R( f, g, h, a, b, c, d, e, K(67), M(67) );
-      R( e, f, g, h, a, b, c, d, K(68), M(68) );
-      R( d, e, f, g, h, a, b, c, K(69), M(69) );
-      R( c, d, e, f, g, h, a, b, K(70), M(70) );
-      R( b, c, d, e, f, g, h, a, K(71), M(71) );
-      R( a, b, c, d, e, f, g, h, K(72), M(72) );
-      R( h, a, b, c, d, e, f, g, K(73), M(73) );
-      R( g, h, a, b, c, d, e, f, K(74), M(74) );
-      R( f, g, h, a, b, c, d, e, K(75), M(75) );
-      R( e, f, g, h, a, b, c, d, K(76), M(76) );
-      R( d, e, f, g, h, a, b, c, K(77), M(77) );
-      R( c, d, e, f, g, h, a, b, K(78), M(78) );
-      R( b, c, d, e, f, g, h, a, K(79), M(79) );
+      R( a, b, c, d, e, f, g, h, X,  0 );
+      R( h, a, b, c, d, e, f, g, X,  1 );
+      R( g, h, a, b, c, d, e, f, X,  2 );
+      R( f, g, h, a, b, c, d, e, X,  3 );
+      R( e, f, g, h, a, b, c, d, X,  4 );
+      R( d, e, f, g, h, a, b, c, X,  5 );
+      R( c, d, e, f, g, h, a, b, X,  6 );
+      R( b, c, d, e, f, g, h, a, X,  7 );
+      R( a, b, c, d, e, f, g, h, X,  8 );
+      R( h, a, b, c, d, e, f, g, X,  9 );
+      R( g, h, a, b, c, d, e, f, X, 10 );
+      R( f, g, h, a, b, c, d, e, X, 11 );
+      R( e, f, g, h, a, b, c, d, X, 12 );
+      R( d, e, f, g, h, a, b, c, X, 13 );
+      R( c, d, e, f, g, h, a, b, X, 14 );
+      R( b, c, d, e, f, g, h, a, X, 15 );
+      R( a, b, c, d, e, f, g, h, M, 16 );
+      R( h, a, b, c, d, e, f, g, M, 17 );
+      R( g, h, a, b, c, d, e, f, M, 18 );
+      R( f, g, h, a, b, c, d, e, M, 19 );
+      R( e, f, g, h, a, b, c, d, M, 20 );
+      R( d, e, f, g, h, a, b, c, M, 21 );
+      R( c, d, e, f, g, h, a, b, M, 22 );
+      R( b, c, d, e, f, g, h, a, M, 23 );
+      R( a, b, c, d, e, f, g, h, M, 24 );
+      R( h, a, b, c, d, e, f, g, M, 25 );
+      R( g, h, a, b, c, d, e, f, M, 26 );
+      R( f, g, h, a, b, c, d, e, M, 27 );
+      R( e, f, g, h, a, b, c, d, M, 28 );
+      R( d, e, f, g, h, a, b, c, M, 29 );
+      R( c, d, e, f, g, h, a, b, M, 30 );
+      R( b, c, d, e, f, g, h, a, M, 31 );
+      R( a, b, c, d, e, f, g, h, M, 32 );
+      R( h, a, b, c, d, e, f, g, M, 33 );
+      R( g, h, a, b, c, d, e, f, M, 34 );
+      R( f, g, h, a, b, c, d, e, M, 35 );
+      R( e, f, g, h, a, b, c, d, M, 36 );
+      R( d, e, f, g, h, a, b, c, M, 37 );
+      R( c, d, e, f, g, h, a, b, M, 38 );
+      R( b, c, d, e, f, g, h, a, M, 39 );
+      R( a, b, c, d, e, f, g, h, M, 40 );
+      R( h, a, b, c, d, e, f, g, M, 41 );
+      R( g, h, a, b, c, d, e, f, M, 42 );
+      R( f, g, h, a, b, c, d, e, M, 43 );
+      R( e, f, g, h, a, b, c, d, M, 44 );
+      R( d, e, f, g, h, a, b, c, M, 45 );
+      R( c, d, e, f, g, h, a, b, M, 46 );
+      R( b, c, d, e, f, g, h, a, M, 47 );
+      R( a, b, c, d, e, f, g, h, M, 48 );
+      R( h, a, b, c, d, e, f, g, M, 49 );
+      R( g, h, a, b, c, d, e, f, M, 50 );
+      R( f, g, h, a, b, c, d, e, M, 51 );
+      R( e, f, g, h, a, b, c, d, M, 52 );
+      R( d, e, f, g, h, a, b, c, M, 53 );
+      R( c, d, e, f, g, h, a, b, M, 54 );
+      R( b, c, d, e, f, g, h, a, M, 55 );
+      R( a, b, c, d, e, f, g, h, M, 56 );
+      R( h, a, b, c, d, e, f, g, M, 57 );
+      R( g, h, a, b, c, d, e, f, M, 58 );
+      R( f, g, h, a, b, c, d, e, M, 59 );
+      R( e, f, g, h, a, b, c, d, M, 60 );
+      R( d, e, f, g, h, a, b, c, M, 61 );
+      R( c, d, e, f, g, h, a, b, M, 62 );
+      R( b, c, d, e, f, g, h, a, M, 63 );
+      R( a, b, c, d, e, f, g, h, M, 64 );
+      R( h, a, b, c, d, e, f, g, M, 65 );
+      R( g, h, a, b, c, d, e, f, M, 66 );
+      R( f, g, h, a, b, c, d, e, M, 67 );
+      R( e, f, g, h, a, b, c, d, M, 68 );
+      R( d, e, f, g, h, a, b, c, M, 69 );
+      R( c, d, e, f, g, h, a, b, M, 70 );
+      R( b, c, d, e, f, g, h, a, M, 71 );
+      R( a, b, c, d, e, f, g, h, M, 72 );
+      R( h, a, b, c, d, e, f, g, M, 73 );
+      R( g, h, a, b, c, d, e, f, M, 74 );
+      R( f, g, h, a, b, c, d, e, M, 75 );
+      R( e, f, g, h, a, b, c, d, M, 76 );
+      R( d, e, f, g, h, a, b, c, M, 77 );
+      R( c, d, e, f, g, h, a, b, M, 78 );
+      R( b, c, d, e, f, g, h, a, M, 79 );
 
       a = ctx->state[0] = u64plus (ctx->state[0], a);
       b = ctx->state[1] = u64plus (ctx->state[1], b);
@@ -615,5 +619,6 @@
       f = ctx->state[5] = u64plus (ctx->state[5], f);
       g = ctx->state[6] = u64plus (ctx->state[6], g);
       h = ctx->state[7] = u64plus (ctx->state[7], h);
+      words += 16;
     }
 }
./sha1sum.orig
3.54user 0.10system 0:03.65elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
3.50user 0.14system 0:03.65elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
3.50user 0.11system 0:03.61elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
./sha1sum.v1
2.59user 0.11system 0:02.71elapsed 99%CPU (0avgtext+0avgdata 2832maxresident)k
2.59user 0.11system 0:02.70elapsed 99%CPU (0avgtext+0avgdata 2816maxresident)k
2.60user 0.11system 0:02.72elapsed 99%CPU (0avgtext+0avgdata 2832maxresident)k
./sha1sum.m32.orig
5.15user 0.10system 0:05.26elapsed 99%CPU (0avgtext+0avgdata 2528maxresident)k
5.15user 0.11system 0:05.27elapsed 99%CPU (0avgtext+0avgdata 2544maxresident)k
5.08user 0.16system 0:05.25elapsed 99%CPU (0avgtext+0avgdata 2544maxresident)k
./sha1sum.m32.v1
2.93user 0.14system 0:03.08elapsed 99%CPU (0avgtext+0avgdata 2512maxresident)k
2.90user 0.18system 0:03.09elapsed 99%CPU (0avgtext+0avgdata 2512maxresident)k
3.00user 0.08system 0:03.09elapsed 99%CPU (0avgtext+0avgdata 2528maxresident)k
./sha256sum.orig
6.44user 0.12system 0:06.58elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
6.50user 0.08system 0:06.59elapsed 99%CPU (0avgtext+0avgdata 2864maxresident)k
6.48user 0.14system 0:06.63elapsed 99%CPU (0avgtext+0avgdata 2864maxresident)k
./sha256sum.v1
6.09user 0.10system 0:06.20elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
6.10user 0.08system 0:06.19elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
6.05user 0.13system 0:06.19elapsed 99%CPU (0avgtext+0avgdata 2848maxresident)k
./sha256sum.m32.orig
7.07user 0.13system 0:07.21elapsed 99%CPU (0avgtext+0avgdata 2576maxresident)k
7.09user 0.12system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2576maxresident)k
7.06user 0.13system 0:07.20elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
./sha256sum.m32.v1
6.47user 0.13system 0:06.61elapsed 99%CPU (0avgtext+0avgdata 2544maxresident)k
6.53user 0.11system 0:06.65elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
6.48user 0.19system 0:06.68elapsed 99%CPU (0avgtext+0avgdata 2560maxresident)k
./sha512sum.orig
4.43user 0.12system 0:04.55elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.45user 0.11system 0:04.56elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.48user 0.09system 0:04.58elapsed 99%CPU (0avgtext+0avgdata 2912maxresident)k
./sha512sum.v1
4.05user 0.11system 0:04.17elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.11user 0.06system 0:04.18elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
4.15user 0.14system 0:04.29elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
./sha512sum.m32.orig
25.23user 0.20system 0:25.46elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k
25.22user 0.11system 0:25.36elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k
25.13user 0.15system 0:25.32elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k
./sha512sum.m32.v1
22.13user 0.14system 0:22.30elapsed 99%CPU (0avgtext+0avgdata 2784maxresident)k
22.09user 0.21system 0:22.33elapsed 99%CPU (0avgtext+0avgdata 2784maxresident)k
22.19user 0.13system 0:22.35elapsed 99%CPU (0avgtext+0avgdata 2800maxresident)k

Reply via email to