Tom Elken wrote: > To add some details to what Christian says, the HPC Challenge version of > STREAM uses dynamic arrays and is hard to optimize. I don't know what's > best with current compiler versions, but you could try some of these that > were used in past HPCC submissions with your program, Bill:
Thanks for the heads up, I've checked the specbench.org compiler options for hints on where to start with optimization flags, but I didn't know about the dynamic stream. Is the HPC challenge code open source? > PathScale 2.2.1 on Opteron: > Base OPT flags: -O3 -OPT:Ofast:fold_reassociate=0 > STREAMFLAGS=-O3 -OPT:Ofast:fold_reassociate=0 > -OPT:alias=restrict:align_unsafe=on -CG:movnti=1 Alas my pathscale license expired and I believe with sci-cortex's death (RIP) I can't renew it. I tried open64-4.2.2 with those flags and on a nehalem single socket: $ opencc -O4 -fopenmp stream.c -o stream-open64 -static $ opencc -O4 -fopenmp stream-malloc.c -o stream-open64-malloc -static $ ./stream-open64 Total memory required = 457.8 MB. Function Rate (MB/s) Avg time Min time Max time Copy: 22061.4958 0.0145 0.0145 0.0146 Scale: 22228.4705 0.0144 0.0144 0.0145 Add: 20659.2638 0.0233 0.0232 0.0233 Triad: 20511.0888 0.0235 0.0234 0.0235 Dynamic: $ ./stream-open64-malloc Function Rate (MB/s) Avg time Min time Max time Copy: 14436.5155 0.0222 0.0222 0.0222 Scale: 14667.4821 0.0218 0.0218 0.0219 Add: 15739.7070 0.0305 0.0305 0.0305 Triad: 15770.7775 0.0305 0.0304 0.0305 > Intel C/C++ Compiler 10.1 on Harpertown CPUs: > Base OPT flags: -O2 -xT -ansi-alias -ip -i-static > Intel recently used > Intel C/C++ Compiler 11.0.081 on Nehalem CPUs: > -O2 -xSSE4.2 -ansi-alias -ip > and got good STREAM results in their HPCC submission on their ENdeavor > cluster. $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream.c -o stream-icc $ icc -O2 -xSSE4.2 -ansi-alias -ip -openmp stream-malloc.c -o stream-icc-malloc $ ./stream-icc | grep ":" STREAM version $Revision: 5.9 $ Copy: 14767.0512 0.0022 0.0022 0.0022 Scale: 14304.3513 0.0022 0.0022 0.0023 Add: 15503.3568 0.0031 0.0031 0.0031 Triad: 15613.9749 0.0031 0.0031 0.0031 $ ./stream-icc-malloc | grep ":" STREAM version $Revision: 5.9 $ Copy: 14604.7582 0.0022 0.0022 0.0022 Scale: 14480.2814 0.0022 0.0022 0.0022 Add: 15414.3321 0.0031 0.0031 0.0031 Triad: 15738.4765 0.0031 0.0030 0.0031 So ICC does manage zero penalty, alas no faster than open64 with the penalty. I'll attempt to track down the HPCC stream source code to see if their dynamic arrays are any friendlier than mine (I just use malloc). In any case many thanks for the pointer. Oh, my dynamic tweak: $ diff stream.c stream-malloc.c 43a44 > # include <stdlib.h> 97c98 < static double a[N+OFFSET], --- > /* static double a[N+OFFSET], 99c100,102 < c[N+OFFSET]; --- > c[N+OFFSET]; */ > > double *a, *b, *c; 134a138,142 > > a=(double *)malloc(sizeof(double)*(N+OFFSET)); > b=(double *)malloc(sizeof(double)*(N+OFFSET)); > c=(double *)malloc(sizeof(double)*(N+OFFSET)); > 283c291,293 < --- > free(a); > free(b); > free(c); _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf