[Beowulf] Transputers vs Today doing lossless FFT - was Rackable SGI

Vincent Diepeveen Sat, 04 Apr 2009 17:27:03 -0700


On Apr 5, 2009, at 12:38 AM, Nifty Tom Mitchell wrote:

On Sat, Apr 04, 2009 at 12:19:02PM -0700, Greg Lindahl wrote:

On Sat, Apr 04, 2009 at 04:06:22PM +0200, Eugen Leitl wrote:

On Fri, Apr 03, 2009 at 01:32:13PM -0700, Greg Lindahl wrote:

Will have to do with embedded memory or stacked 3d memory a la
http://www.cc.gatech.edu/~loh/Papers/isca2008-3Ddram.pdf


We've been building bigger and bigger SMPs for a long time, making

changes to improve the memory system as needed. How is multicoreany


Off-die memory bandwidth and latency are limited, so many codes
start running into memory bottlenecks even at moderate number
of cores


Exactly like shared-bus multiprocessors.

The incremental method of solving this is what Opteron/Nehalem does.

The more radical method is what Origin/Altiax did. It all comesdown to

how many pins you're willing to commit to memory and how few pins you

can squeeze a memory bus down to; ounce you bump up against thatlimit,

then you need an improvement in packaging, which is no more radical
than what Opteron/Nehalem or Origin/Altix did. And, of course, all
this was invented before Opteron or Origin.


The memory bus issue is a serious issue but with a kilo-core chip

one solution may surface with a reflection back to the InmosTransputer.


If there was core to core pipe like bus perhaps with a switch (not a
full blown any to any cross bar) that lets one core communicate via a
pipe equivalent to one or more cores.  The pipe need not be too deep.
Perhaps 128 cache line equivalents.  I do not think that it will be
necessary for each core to connect to all the other cores directly as

long as data could flow across the die. The interesting challengefor

coding would be to take advantage of a data flow across the die.

Two of the problems with the Transputer was the lack of virtual memory
and that the the network was fixed.


Hi Transputers a bit before my time,

but wasn't it the case that transputers were like 1Mhz and a networkof nearly 10 mbit?The processors speeds that were reported to me was having a reallybad ugly ipc.I'm not sure whether i was misinformed there. So let's say ipceffectively 0.4

So in short the processor speed in Mhz was nearly the same as thedatapath,but the datapath far outperformed the number of bytes per second theprocessor

could produce.

Processor speed being nearly the same now as the datapaths is goingto be wishful

thinking.

The equivalent of that today would be say for example 1024 cores of1.5Ghz and 128 bits

vectors and 2 nearly 3 instructions a cycle, that's:

16*3*1.5 G bytes = 48 * 1.5 G bytes = 48 + 24 = 72 Gbytes /s lanesfrom each processor.Or that's, including error correction, i'm no expert there, maybe 720gbit lanes from each processor

going to the network.

In short there is an ever bigger need for low level programmers andmath type guys whocan rewrite algorithms to something that's doing a lot of work withinthe L1 cache and limiting

the amount of bandwidth over the network. Not to speak about latency.

I already had proposed a while ago to rewrite some FFT type codes tointeger versions,in order to limit the number of bytes stored to the RAM (integertransforms can effectively usemore bits per 64 bits integer than the FFT transforms can store indoubles). It would require moreinteger resources on the cpu's though, especially for multiplication.The additional advantage for freeis that in integer transforms you have less errors, which at todayshuge FFT sizes is not asmall advantage, but already is a real huge advantage. The lastadvantage is that when such softwarecodes work real well, you can just put the logic in hardware a loteasier than floating point,

which gives a huge speedboost.

To give exact concrete the numbers. An example:

We consider now the Yap transform using the prime that fits within a64 bits register


P = 2^64 - 2^32 + 1

There is 3 "instructions" needed so to speak.
I cut'n paste C source code here from the FFT code.

Note those interested can get the relevant code (maybe i need toclean up some

factorisation code that's inside, not sure) :

FORCEINLINE uint64 plusmod(uint64 x,uint64 y) { // returns : x + y(mod P)uint64 result=x+y,res1=result,res2=result-P,resultored=x|y,res3=result+4294967295;

  if( resultored > res1 )
    result = res3; // result + 2^64 - P
  if( res1 >= P )
    result = res2; // result - P
  return result;
}

FORCEINLINE uint64 minusmod(uint64 x,uint64 y) { // returns : x - y(mod P)

  uint64 res1=y-x,result=x-y,res2=P-res1;
  if( y > x )
    result = res2;
  return result;
}

FORCEINLINE uint64 mulmod(uint64 x,uint64 y) { // returns : x * y(mod P)

  uint64 result,reslo,reshi;
  uint64 hi_least,hi_most,hi_shifted,hi_addtolo;

  #if USEINTRINSICSX64
  reslo = _umul128(x,y,&reshi);
  #else
  reshi = mul_limbs(&reslo,x,y);
  #endif
  /* taking Modulo:
   *                           p = 2^64 - 2^32 + 1
   *  ==>  i * (2^64 - 2^32 + 1) = 0
   * <==>               2^64 * i = 2^32 * i - i
   *
   * so if i is top (most significant) 32 bits from 128 bits
   * then :  2^96 * i == 2^64 * i - 2^32 * i
   *
   * because 2^96 == -1 (mod P) ==>
   * reduction of i * 2^96 is similar to
   * substraction of least significant dword
   */

// first a simplistic implementation to see whether the DFT/NTTwith this P works anyway

  // splits hi quadword in 2 dwords:

  // reduction of bits 96.. ..127 we wait for until at end of modulo
  // so we start with the 96 bits
  hi_least   = reshi&0x00000000ffffffff;
  hi_shifted = reshi<<32;
  hi_most    = reshi>>32;

  // now: 2^64 * i = 2^32 * i - i
  hi_addtolo = hi_shifted-hi_least;

  // now it is : reslo + hi_addtolo - hi_most;
  result = plusmod(reslo,hi_addtolo);
  result = minusmod(result,hi_most);
  return result;
}

You see a lot of C code for something when in hardware, and itshouldn't be too complicated to do,

total crushes existing floating point FFT's.

Such generic FFT's can get used for anything, there is just onedifference with the floating point FFT'swhich is that the floating point FFT's which runs in all that SSE2code, can have an error, this FFT can't

AND this FFT can store more bits in each 64 bits unit.

You can use then these instructions to speedup all matrixcalculationsmajor league, which is like 50%of all system time on supercomputers, and it most definitely runs onall those GPU's type clusters.


Thanks,
Vincent

--
        T o m  M i t c h e l l
        Found me a new hat, now what?


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Transputers vs Today doing lossless FFT - was Rackable SGI

Reply via email to