On Nov 8, 2011, at 2:46 AM, Gilad Shainer wrote: >> I just test things and go for the fastest. But if we do theoretic >> math, SHMEM >> is difficult to beat of course. >> Google for measurements with shmem, not many out there. > > SHMEM within the node or between nodes?
shmem is the programming library that cray had and that quadrics had. so basically your program doesn't need silly message catching mpi commands everywhere. You only define at program start whether an array is getting tracked by elan4 and which nodes it gets updated to etc. So no need to check for MPI overfows for the complex code of starting / stopping cpu's. Can reuse code there easily to start remote nodes and cpu's. So where the majority of the latency is needed for RDMA reads and/or reads from remote elan memory, the tough yet in overhead neglectible complicated code to start/stop cpu's, is a bit easier to program with SHMEM library. the caches on the quadrics cards have shmem so you don't access the RAM at all, it's already in the cards. didn't check whether those features got added to mpi somehow. so you just need to read the card - it's not gonna go through pci-x at all at the remote node. Yet of course all this is not so relevant to explain here - as quadrics is long gone, and i just search for a cheapo solution :) So you lose only 2x the pci-x latency, versus 4x pci-e latency in such case. In case of a RDMA read i doubt latency of DDR infiniband is faster than quadrics. that 0.7 you mentionned if it is microseconds sounds like a bit overestimated latency for pci-x. From the 1.3 us that the MPI-one-way pingpong is at QM500, if we multiply it by 2 it's 2.6 us. From that 2.6 us, according to your math it's already 2.8 us cost to pci-x, then , which has a cost of 2x pci-x, receiving elan has a cost of 130 ns, switch say 300 ns including cables for a 128 port router, 100 ns from the sending elan. that's 530 ns, and that times 2 is 1060 ns. There's really little left for the pci-x. as 2.6 - 1.06 = 1.44 us left for 4 times pci-x. 1.44 / 4 = 0.36 us for pci-x. I used the Los Alamos National Laboratory example numbers here for elan4. In the end it is about price, not user friendliness of programming :) > > >> Fact that so few standardized/rewrote their floating point >> software to gpu's, >> is already saying enough about all the legacy codes in HPC world :) >> >> When some years ago i had a working 2 cluster node here with >> QM500- A , it >> had at 32 bits , 33Mhz pci long sleeve slots a blocked read >> latency of under 3 >> us is what i saw on my screen. Sure i had no switch in between it. >> Direct >> connection between the 2 elan4's. >> >> I'm not sure what pci-x adds to it when clocked at 133Mhz, but it >> won't be a >> big diff with pci-e. > > There is a big different between PCIX and PCIe. PCIe is half the > latency - from 0.7 to 0.3 more or less. > Well i'm not so sure the difference is that huge. All those measurements in past was at oldie Xeon P4 machines, and i've never really seen a good comparision there. Furthermore fabrics like Dolphin at the time with a 66Mhz, 64 bits PCI card already got like 1.36 us one-way pingpong latencies, not exactly a lot slower than DDR infinibands qlogics of a claimed 1.2 us. >> PCI-e probably only has a bigger bandwidth isn't it? > > Also bandwidth ...:-) That's a non discussion here. I need latency :) If i'd really need big bandwidth for transport i'd use of course a boat - 90% of all cargo here gets transported over the rivers and hand dug canal; especially river Rhine. > >> Beating such hardware 2nd hand is difficult. $30 on ebay and i can >> install 4 >> rails or so. >> Didn't find the cables yet though... >> >> So i don't see how to outdo that with old infiniband cards which are >> $130 and upwards for the connectx, say $150 soon, which would >> allow only >> single rail >> or maybe at best 2 rails. So far didn't hear anyone yet who has >> more than >> single rail IB. >> >> Is it possible to install 2 rails with IB? > > Yes, you can do dual rails very well > >> So if i use your number in pessimistic manner, which means that >> there is >> some overhead of pci-x, then the connectx type IB, can do 1 >> million blocked >> reads per second theoretic with 2 rails. Which is $300 or so, >> cables not >> counted. > > Are you referring to RDMA reads? > As i use all cpu cores 100%, i simply cannot catch mpi messages, let alone overflow. So anything that has the cards processor do the job of digging inthe RAM rather than bug one of the very busy cores, is very welcome form of communication. 99.9% of all communication to remote nodes is 32 byte RDMA wites and 128-256 byte reads. I can set myself whether it's 128, 192 or 256. Probably i'll make it 128. The number of reads is a few percent more than writes. That other 0.01% is the very complex parallel algorithm that basically parallellizes a sequential algorithm. That algorithm is a 150 pages of a4 roughly full of insights and proofs why it works correct :) > > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf