On Apr 5, 2009, at 12:38 AM, Nifty Tom Mitchell wrote:

On Sat, Apr 04, 2009 at 12:19:02PM -0700, Greg Lindahl wrote:
On Sat, Apr 04, 2009 at 04:06:22PM +0200, Eugen Leitl wrote:
On Fri, Apr 03, 2009 at 01:32:13PM -0700, Greg Lindahl wrote:

Will have to do with embedded memory or stacked 3d memory a la
http://www.cc.gatech.edu/~loh/Papers/isca2008-3Ddram.pdf

We've been building bigger and bigger SMPs for a long time, making
changes to improve the memory system as needed. How is multicore any

Off-die memory bandwidth and latency are limited, so many codes
start running into memory bottlenecks even at moderate number
of cores

Exactly like shared-bus multiprocessors.

The incremental method of solving this is what Opteron/Nehalem does.
The more radical method is what Origin/Altiax did. It all comes down to
how many pins you're willing to commit to memory and how few pins you
can squeeze a memory bus down to; ounce you bump up against that limit,
then you need an improvement in packaging, which is no more radical
than what Opteron/Nehalem or Origin/Altix did. And, of course, all
this was invented before Opteron or Origin.

The memory bus issue is a serious issue but with a kilo-core chip
one solution may surface with a reflection back to the Inmos Transputer.

If there was core to core pipe like bus perhaps with a switch (not a
full blown any to any cross bar) that lets one core communicate via a
pipe equivalent to one or more cores.  The pipe need not be too deep.
Perhaps 128 cache line equivalents.  I do not think that it will be
necessary for each core to connect to all the other cores directly as
long as data could flow across the die. The interesting challenge for
coding would be to take advantage of a data flow across the die.

Two of the problems with the Transputer was the lack of virtual memory
and that the the network was fixed.


Hi Transputers a bit before my time,
but wasn't it the case that transputers were like 1Mhz and a network of nearly 10 mbit? The processors speeds that were reported to me was having a really bad ugly ipc. I'm not sure whether i was misinformed there. So let's say ipc effectively 0.4

So in short the processor speed in Mhz was nearly the same as the datapath, but the datapath far outperformed the number of bytes per second the processor
could produce.

Processor speed being nearly the same now as the datapaths is going to be wishful
thinking.

The equivalent of that today would be say for example 1024 cores of 1.5Ghz and 128 bits
vectors and 2 nearly 3 instructions a cycle, that's:

16*3*1.5 G bytes = 48 * 1.5 G bytes = 48 + 24 = 72 Gbytes /s lanes from each processor. Or that's, including error correction, i'm no expert there, maybe 720 gbit lanes from each processor
going to the network.

In short there is an ever bigger need for low level programmers and math type guys who can rewrite algorithms to something that's doing a lot of work within the L1 cache and limiting
the amount of bandwidth over the network. Not to speak about latency.

I already had proposed a while ago to rewrite some FFT type codes to integer versions, in order to limit the number of bytes stored to the RAM (integer transforms can effectively use more bits per 64 bits integer than the FFT transforms can store in doubles). It would require more integer resources on the cpu's though, especially for multiplication. The additional advantage for free is that in integer transforms you have less errors, which at todays huge FFT sizes is not a small advantage, but already is a real huge advantage. The last advantage is that when such software codes work real well, you can just put the logic in hardware a lot easier than floating point,
which gives a huge speedboost.

To give exact concrete the numbers. An example:

We consider now the Yap transform using the prime that fits within a 64 bits register

P = 2^64 - 2^32 + 1

There is 3 "instructions" needed so to speak.
I cut'n paste C source code here from the FFT code.

Note those interested can get the relevant code (maybe i need to clean up some
factorisation code that's inside, not sure) :

FORCEINLINE uint64 plusmod(uint64 x,uint64 y) { // returns : x + y (mod P) uint64 result=x+y,res1=result,res2=result-P,resultored=x| y,res3=result+4294967295;
  if( resultored > res1 )
    result = res3; // result + 2^64 - P
  if( res1 >= P )
    result = res2; // result - P
  return result;
}

FORCEINLINE uint64 minusmod(uint64 x,uint64 y) { // returns : x - y (mod P)
  uint64 res1=y-x,result=x-y,res2=P-res1;
  if( y > x )
    result = res2;
  return result;
}

FORCEINLINE uint64 mulmod(uint64 x,uint64 y) { // returns : x * y (mod P)
  uint64 result,reslo,reshi;
  uint64 hi_least,hi_most,hi_shifted,hi_addtolo;

  #if USEINTRINSICSX64
  reslo = _umul128(x,y,&reshi);
  #else
  reshi = mul_limbs(&reslo,x,y);
  #endif
  /* taking Modulo:
   *                           p = 2^64 - 2^32 + 1
   *  ==>  i * (2^64 - 2^32 + 1) = 0
   * <==>               2^64 * i = 2^32 * i - i
   *
   * so if i is top (most significant) 32 bits from 128 bits
   * then :  2^96 * i == 2^64 * i - 2^32 * i
   *
   * because 2^96 == -1 (mod P) ==>
   * reduction of i * 2^96 is similar to
   * substraction of least significant dword
   */

// first a simplistic implementation to see whether the DFT/NTT with this P works anyway
  // splits hi quadword in 2 dwords:

  // reduction of bits 96.. ..127 we wait for until at end of modulo
  // so we start with the 96 bits
  hi_least   = reshi&0x00000000ffffffff;
  hi_shifted = reshi<<32;
  hi_most    = reshi>>32;

  // now: 2^64 * i = 2^32 * i - i
  hi_addtolo = hi_shifted-hi_least;

  // now it is : reslo + hi_addtolo - hi_most;
  result = plusmod(reslo,hi_addtolo);
  result = minusmod(result,hi_most);
  return result;
}

You see a lot of C code for something when in hardware, and it shouldn't be too complicated to do,
total crushes existing floating point FFT's.

Such generic FFT's can get used for anything, there is just one difference with the floating point FFT's which is that the floating point FFT's which runs in all that SSE2 code, can have an error, this FFT can't
AND this FFT can store more bits in each 64 bits unit.

You can use then these instructions to speedup all matrixcalculations major league, which is like 50% of all system time on supercomputers, and it most definitely runs on all those GPU's type clusters.

Thanks,
Vincent


--
        T o m  M i t c h e l l
        Found me a new hat, now what?


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to