Fw: [Beowulf] Correct networking solution for 16-core nodes

Vincent Diepeveen Fri, 04 Aug 2006 08:25:40 -0700

So for the replacement personnel of Greg,
I understand that your cards can't interrupt at all.
Users just have to wait until other messages have past the wire,

before receiving a very short message (that for example aborts the entirejob).

In short if some other person on the cluster is streaming a bit at somenodes,

then you have a major latency problem.

Vincent

From : Greg Lindahl

To: Vincent Diepeveen
On Fri, Aug 04, 2006 at 11:19:42AM +0100, Vincent Diepeveen wrote:

What is that time 'in between' for your specific card?


Zero. That's the whole point of the message-rate benchmark, and a
unique aspect of the interconnect that I designed.

Now please stop emailing me personally, you know I find you extremely
annoying.

So the time needed to *interrupt* the current long message.


Our interconnect uses no interrupts.

The cruel reality of trying to scale 100% at a network is that you can'tmake a specialthread to just nonstop check for a MPI message, like all you guys do foryour pingpong measurements.


We do not ever use a special thread.

That is the REAL problem.


The real problem is that you do not understand that you don't know
everything.

Now, as I said earlier, never email me personally.

-- greg

----- Original Message -----From: "Vincent Diepeveen" <[EMAIL PROTECTED]>

To: "Greg Lindahl" <[EMAIL PROTECTED]>
Sent: Friday, August 04, 2006 11:19 AM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes

Yeah you meant it's 200 usec latency
When 16 cores want all something from the card and that card is serving 16threads,then 200 usec is probably the minimum latency when for example 1 long MPImessage(say of about 200 MB) is arriving, until some other thread receives somevery short message "in between".
What is that time 'in between' for your specific card?

So the time needed to *interrupt* the current long message.
The cruel reality of trying to scale 100% at a network is that you can'tmake a specialthread to just nonstop check for a MPI message, like all you guys do foryour pingpong measurements.
If you have 16 cores then you want to run 16 processes at 16 cores, a 17ththread is already doing time division, a 18th thread is doingi/o from and to the user. The 19th thread checks for MPI messages fromother nodes regurarly,then if it's not in the runqueue, the OS already has a wakeup latency toput a thread in the runqueue of 10 milliseconds.
That is the REAL problem.
You just can't dedicate a special thread to short messages if you want touse all cores thanks to the runqueue latency.
So the only solution for that is polling from the working processesthemselves.
For non-embarrassingly parallel software that needs to poll for shortmessages therefore the time needed
to do a single poll whether a tiny message is there, is very CRUCIAL.
If it's a read from local RAM (local to that processor) taking 0.13 usthen that is in fact already slowing down the
program a small tad.

Preferably most of such polls happen from the L2 which is a cycle or 13.
It's quite interesting to know which card/implementation has the fastestpoll time here for processes that regurarly poll for short messages,
that includes overhead to check for overflow of the given protocol.
If that's 0.5 us because you have to check for all kind of MPI overflowthen that sucks a lot. Such a card i throw away directly.
Vincent
----- Original Message -----From: "Greg Lindahl" <[EMAIL PROTECTED]>
To: "Joachim Worringen" <[EMAIL PROTECTED]>; <beowulf@beowulf.org>
Sent: Thursday, August 03, 2006 10:07 PM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes
On Thu, Aug 03, 2006 at 12:53:40PM -0700, Greg Lindahl wrote:
We have clearly stated that the Mellanox switch is around 200 usec per
hop.  Myricom's number is also well known.
Er, 200 micro seconds. Y'all know what I meant, right? :-)

-- greg

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Fw: [Beowulf] Correct networking solution for 16-core nodes

Reply via email to