Re: [Beowulf] interconnect and compiler ?

Vincent Diepeveen Wed, 11 Feb 2009 20:51:36 -0800

Hi Patrick,

Interesting to know that you nowadays market ethernet cards. Stillsome knowledge on other companies switches

you also seem to posses. Congrats.

My faith in the switch and crossbars is actually quite high. Not somuch in the MPI-cards however.Let's assume for now that I was speaking of highend network cardsthat receive and send MPI packets in our myri cluster.

Let's say we got a cool quad xeon MP node @ 64 logical cores @ 3.xGhz. Soon a very popular machine i'd guess.

Not really any other x86 cpu that can take it on head to head.

The upcoming release of the 'beckton' cpu from intel. The monsterthat probably

eats more power than any intel x86 cpu before :)

Feeding the monster node is not so dead easy.

Just look at 1 node now please. You've got at the MPI card 1 big longpacket of a few megabyte that gets received.Idemdito a few other threads get a few megabyte sized packets. Atsame time of all this, the card

also gets a packet of a few bytes for thread 42.

What is the time now for thread 42 to get the packet?

Will it first handle all the megabyte sized packets, or give thequick short packet already 'in between' to our "logical core 42"?


Thanks,
Vincent

p.s. i expect of course the answer '42' :)

On Feb 12, 2009, at 2:30 AM, Patrick Geoffray wrote:

Vincent Diepeveen wrote:
All such sorts of switch latencies are at least factor 50-100worse than their one-way pingpong latency.
I think you are a bit confused about switch latencies.
There is the crossbar latency that is the time it takes for apacket to be decoded and routed to the right output port. It isessentially the difference between the pingpong latency with andwithout the crossbar in the middle for the smallest packet size.Typical crossbar latencies are in the order of 100ns for recentEthernot, 200ns for Ethernet. To build bigger fabric, you need toconnect multiple crossbars into Clos, Fat-tree or Torus topologies.The end-to-end switch latency is then dependent on the number ofcrossbars the packet crosses.
There is the PHY/transceiver latency. That only applies to the edgeof the switch, where a physical cable plugs into a sockets. SFP+for example requires serialization compared to QSFP. With fiber,the transceiver have some overhead. Typical overhead is 250ns perport for serial fiber PHY, almost nothing for parallel copper.
Another overhead is the head-of-Line blocking. It happens when thepacket has to wait for another one to pass in order to be switched.This is equivalent to 2 cars turning on the same road: one willhave to wait on the other to make the turn. This latency can behigh, specially if the packets are large (imagine a couple oftrains instead of cars).Is that what you call "ugly switch latency" ? HOL blocking willreduce you switch efficiency to ~40% with random traffic. Thatmeans your latency will be about two times higher in average,assuming all packets have the same size. Where is the factor 50-100 ?
My assumption is always: "if manufacturer doesn't respond it mustbe real bad for his network card".
Maybe they don't respond because the question does not make any sense.
Note that pingpong latency also gets demonstrated in a wrong manner.
Requirement to determine one way pingpong should be that it eatsno cpu time obtaining it.
You mean blocking on an interrupt ? When you go to a restaurant, doyou place your order and go back home waiting for a phone call ordo you wait at a table ? I, for one, sit down and busy poll.
Patrick


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] interconnect and compiler ?

Reply via email to