Rahul,
On 7/13/2010 12:04 AM, Rahul Nabar wrote:
I am puzzled by a bunch of ARP requests on my network that I captured
using tcpdump. Shouldn't ARP discovery requests always be sent to a
broadcast address?
No, the kernel regularly refreshes the entries in the ARP cache with
unicast requests.
On 3/15/2010 5:24 PM, richard.wa...@comcast.net wrote:
to best and worst case). It would be good to add Ethernet to the mix
(1Gb, 10Gb, and 40Gb) as well.
10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw
bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is
On 3/15/2010 5:33 PM, Gilad Shainer wrote:
To make it more accurate, most PCIe chipsets supports 256B reads, and
the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20.
I know Marketers lives in their own universe, but here are a few nuts
for you to crack:
* If most PCIe chipsets woul
Hi Richard,
I meant to reply earlier but got busy.
On 2/27/2010 11:17 PM, richard.wa...@comcast.net wrote:
If anyone finds errors in it please let me know so that I can fix
them.
You don't consider the protocol efficiency, and this is a major issue on
PCIe.
First of all, I would change the
Brian,
On 2/19/2010 1:25 PM, Brian Dobbins wrote:
the IB cards. With a 4-socket node having between 32 and 48 cores, lots
of computing can get done fast, possibly stressing the network.
I know Qlogic has made a big deal about the InfiniPath adapter's
extremely good message rate in the past.
Joe,
On 2/19/2010 9:52 PM, Joe Landman wrote:
This aside, both AoE and iSCSI provide block device services. Both
systems can present a block device with a RAID backing store. Patrick
and others will talk about the beauty of the standards, but this is
unfortunately irrelevant in the market. The m
On 2/18/2010 2:26 PM, Jesse Becker wrote:
On Thu, Feb 18, 2010 at 01:12:05PM -0500, Gerald Creager wrote:
For what you're describing, I'd consider CoRAID's AoE technology and
I'll second this recommendation. The Coraid servers are fairly
+1. The AoE spec is very simple, I wish it would have
Bogdan Costescu wrote:
long as it fits in one page. At another time, the switch was more
likely to drop large frames under high load (maybe something to do
with internal memory management), so the 9000bytes frames worked most
of the time while the 1500bytes ones worked all the time...
This is a
Rahul Nabar wrote:
What was your tool to measure this latency? Just curious.
I like to use netperf to measure performance over Sockets, including
latency (it's there but not obvious). For OS-bypass interfaces, your
favorite MPI benchmark is fine.
Patrick
Rahul Nabar wrote:
The TSO and LRO are only relevant to TCP though, aren't they? I am
using RDMA so that shouldn't matter. Maybe I am wrong.
TSO/LRO applies to TCP, but you can have the same technique with
different protocol, USO for UDP Send Offload for example.
RDMA is everything you want
Rahul Nabar wrote:
Thanks! So I could push it beyond 9000 as well?
1500 Bytes is the standard MTU for Ethernet, anything larger is out of
spec. The convention for a larger MTU is Jumbo Frames at 9000 Bytes, and
most switches support it these days. Some hardware even support Super
Jumbo Frame
Hi Richard,
richard.wa...@comcast.net wrote:
I would seem that a larger MTU would help in at least two situations,
clearly applications with very large messages, but also those that
have transmission bursts of messages below the MTU that could
take advantage of hardware coalescing.
Such coales
Rahul,
Rahul Nabar wrote:
I have seen a considerable performance boost for my codes by using
Jumbo Frames. But are there any systematic tools or strategies to
select the optimum MTU size?
There is no optimal MTU size. This is the maximum payload you can fit in
one packet, so there is no drawb
John Hearns wrote:
I would say take the switch out and do a direct point-to-point link
between two systems.
Is this possible with 10gig ethernet?
Yes, no need for crossover cables with 10GE.
Patrick
___
Beowulf mailing list, Beowulf@beowulf.org spons
Hey Larry,
Larry Stewart wrote:
Does anyone know, or know where to find out, how long it takes to do a
store to a device register on a Nahelem system with a PCIexpress device?
Are you asking for latency or throughput ? For latency, it depends on
the distance between the core and the IOH (eac
Dave, Scott,
Dave Love wrote:
Scott Atchley writes:
When I test Open-MX, I turn interrupt coalescing off. I run
omx_pingpong to determine the lowest latency (LL). If the NIC's driver
allows one to specify the interrupt value, I set it to LL-1.
Note that it is only meaningful wrt ping-pon
Dave Love wrote:
That's something I haven't seen. However, I'm only using rx-frames=1
because simply adjusting rx-usec doesn't behave as expected.
Instead of rx-usecs being the time between interrupts, it is sometimes
implemented as the delay between the the first packet and the following
in
Hi Carsten,
Carsten Aulbert wrote:
I've run some early tests and these *seem* to suggest that the IP-layer
takes correctly care of this (e.g. tcpdump shows that the maximum header
length it 1514 bytes when the link is in use).
If you use TCP, the kernel will negotiate the Max Segment Size (MSS
Hi Igor,
Igor Kozin wrote:
- Switch latency (btw, the data sheet says x86 inside);
AFAIK, it is using the 24-port Fulcrum chip, with has a latency of
~300ns. The 48-port models use multiple crossbars in a Clos, partially
(S) or fully (SX) connected. I have never benchmarked the 48-port
ver
Vincent Diepeveen wrote:
All such sorts of switch latencies are at least factor 50-100 worse than
their one-way pingpong latency.
I think you are a bit confused about switch latencies.
There is the crossbar latency that is the time it takes for a packet to
be decoded and routed to the right o
Greg Lindahl wrote:
little computation. InfiniPath gets a speedup on lots of codes that
you wouldn't predict given the raw latency and bandwidth; how else
would you explain it?
There are a tons of variables. The one I keep thinking about is PIO
sending for larger message size than usual. If th
Greg,
Greg Lindahl wrote:
that in practice using multiple VLs will suffer from significant
negative effects due to implementation details. Does anyone know of a
proof point of this?
In practice, the per-port amount of buffering in the switch crossbars is
not big enough for multiple VLs (buffe
Hi Greg,
Greg Lindahl wrote:
On Fri, Jan 30, 2009 at 07:07:34AM -0700, Michael H. Frese wrote:
>
Johnn Adams said "Facts are stubborn things," and there just aren't
enough of them in your example to determine whether bandwidth or latency
dominates communication time.
Mark asked for an exam
Bill Broadley wrote:
The differences I've seen between "raid edition" drives and regular drives are:
* Dramatically better vibration resistance. If you are going to bolt a drive
Enough to make them shouting-proof ? :-)
http://www.youtube.com/watch?v=tDacjrSCeq4
Patrick
__
Kyle,
Kyle Spaans wrote:
Take that as you will, but for me it only means that Prof. Dongarra is only
tengentially related to beowulf through NETLIB FORTRAN code. And thusly,
probably is
not a ``mad scientist'' of beowful fame. ;-)
Jack Dongarra's group has produced a large set of free and o
Lux, James P wrote:
Recognizing the name, I’m prompted to ask the real question, is Jack an
Italian mad scientist?
Jack has Sicilian roots.
Patrick
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe)
Bogdan Costescu wrote:
about on this list: interconnect hardware being able to DMA directly
to/from CPU cache. I don't know how useful such a feature is for a
You can do something similar today using Direct Cache Access (DCA) on
(recent) Intel chips with IOAT. It's an indirect cache access, y
Greg Lindahl wrote:
Submit one. HPCC isn't perfect, but it's the best that's available,
I disagree. The tests integration sucks and some of the tests themselves
are poorly written. For example, last time I looked, bench_lat_bw
measures *8* half-RTTs with MPI_Wtime(). If your MPI_Wtime() is no
Lawrence Stewart wrote:
Is anyone aware of available test suites or API benchmark suites for
shmem? I am thinking of the equivalent of the Intel MPI tests or
Intel MPI Benchmarks, awful though they are.
I don't know any publicly available shmem validation or benchmark
suites. Not surprising s
Perry E. Metzger wrote:
You realize that most big HPC systems are using interconnects that
don't generate many or any interrupts, right?
Of course. Usually one even uses interrupt pacing/mitigation even in
gig ethernet on a modern machine -- otherwise you're not going to get
reasonable performa
Perry E. Metzger wrote:
from processing interrupts, or prevent your OS from properly switching
to a high priority process following an interrupt, but SMM will and
you can't get rid of it.
You can usually disable SMI, either through the BIOS or directly from
the chipset. However, you will lose
Greg,
Greg Lindahl wrote:
I see that HP has a 6-port switch for ~ $4k, too small.
Don't know if it was specific to the model we tested, but hardware flow
control does not work in one direction, even when turned on.
Arastra looks nice, except that their inexpensive 10base-CR only does
SFP+, n
Hi Mark,
Mark Hahn wrote:
With any network you need to avoid like the plauge any kind of loop,
they can cause weird problems and are pretty much unnessasary. for
well, I don't think that's true - the most I'd say is that given
It is kind of true for wormhole switches, you can deadlock if you
Hi Jan,
Jan Heichler wrote:
1) most applications are latency driven - not bandwidth driven. That
means that half bisectional bandwidth is not cutting your application
performance down to 50%. For most applications the impact should be less
than 5% - for some it is really 0%.
If the app is pu
Gilad Shainer wrote:
The injection rate is irrelevant
The injection rate is super relevant. If your injection rate is 10% of
The injection rate is absolutely irrelevant for contention due to
Head-of-Line Blocking. You will have the same fabric efficiency under
contention if your link rate i
Hi Hakon,
Håkon Bugge wrote:
This is information we're using to optimize how pnt-to-pnt communication
is implemented. The code-base involved is fairly complicated and I do
not expect resource management systems to cope with it.
Why not ? It's its job to know the resources it has to manage. Th
Gilad Shainer wrote:
Not only that I was there, but also had conversations afterwards. It is
a really "fair" comparison when you have different injection
rate/network capacity parameters. You can also take 10Mb and inject it
into 10Gb/s network to show the same, and you always can create the
netw
Gilad Shainer wrote:
Static routing is the best approach if your pattern is known. In other
If your pattern is known, and if it is persistent, and it is perfectly
synchronized, and if you have a single job running on the fabric, and if
you have total control of the process/node mapping and if
Hi Don,
Don Holmgren wrote:
latency difference here matters to many codes). Perhaps of more
significance, though, is that you can use oversubscription to lower the
cost of your fabric. Instead of connecting 12 ports of a leaf switch to
nodes and using the other 12 ports as uplinks, you might
Gilad,
Gilad Shainer wrote:
My apologizes. I meant the MPI includes an option to collect several MPI
messages into one network message. For applications cases, sometimes it
helps with performance and sometimes it does not. OSU have shown both
cases, and every user can decide what works best for
Gilad Shainer wrote:
It is the same benchmark that QLogic were and are using for MPI message
rate, and I guess you know that better then me, don't you? I want
to make sure when one do a comparison he/she will be using the same
benchmark/output to compare.
It is not the benchmark, it's the
Peter St. John wrote:
One could use the ...I'm thinking of the extra-big-packet size in IP6. But
if you have small numbers of large datasets, you could increase your
perceived bandwidth with two NICs and larger packets, maybe by using some
protocol other than TCP?
If you don't drop packets,
Peter St. John wrote:
I don't get it? I would have thought that if a large package were split
between two NICs with two cables, then assuming the buffering and
recombination at each end to be faster than the transmission, then the
transmission would be faster than over a single cable? You don't m
Greg Lindahl wrote:
On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:
No, it just means the NIC supports it.
Well, then how about ethtool -S? That looks like an actual count of
flow control events, so rx flow control events means the switch
must support it in some fashion
Hi Greg,
Greg Lindahl wrote:
ethtool -a eth0
and it says RX/TX pause are on, doesn't that mean that the switch
supports it?
No, it just means the NIC supports it. RX means that the NIC will send
PAUSE packets if the host does not consume fast enough (rare) and TX
means that the NIC will sto
Hi Joe, Brendan
Joe Landman wrote:
Since it is a full duplex switched network, there should not be any
collisions happening. Since the image is less than 1 MB total, I don't
There could be blocking ... if one unit grabs the single network pipe
of the display node while the another node trie
amjad ali wrote:
Which Implementations of MPI (no matter commercial or free), make automatic
and efficient use of shared memory for message passing within a node. (means
which MPI librarries auomatically communicate over shared memory instead of
interconnect on the same node).
All of them. Pret
Hi,
润东 万 wrote:
> I am thinking of building a Beowulf of 17 dual-core nodes, one head node
> and 16 computation nodes for material research simulation. I am also a
> newcomer to the simulation world, but I have some programming experience (no
> parellel programming), knowledge of Unix operati
language barriers and lack of steady contact. Other times,
problems do not reach us because integrators/customers try to fix them
internally. This is not perfect, but we tend to fix things that are broken.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
Hi Andrew,
andrew holway wrote:
I'm trying to find out about the effects of virtualisation on high
performance interconnects. Effects on latency and bandwidth.
Virtualization has virtually (pun intended) no effect on OS-bypass
(user-level talking directly to the virtualized hardware) operatio
ver really starts.
What is your Myricom Tech Support ticket number ?
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.b
Hi Ivan,
Ivan Paganini wrote:
The myrinet connection was working right, but sometimes a user program
just got stuck - one of the processes was sleeping, and all others
were running. Then, the program hangs. Investigating this further,
Unless you are using bocking receives ("--mx-recv blocking"
stage the binary on local disk prior to spawning, to
not rely on GPFS over Ethernet to serve it. Or even run GFPS over IPoM too.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your su
Hi Jim,
Jim Lux wrote:
Highly parallelized real time signal processing? Seems like a classic
Wouldn't you need a real-time OS and a real-time communication layer to
do real-time processing? Or at least within the same level of time
accuracy ? The Linux scheduler is still on a 10ms quantum o
Greg Lindahl wrote:
time on all your nodes, which should improve performance when you have
big clusters and a lot of synchronization in your code.
Provided it scales well and it's integrated in the kernel. I guess You
could also think about revisiting the gang scheduling ideas with a
better s
Hi Patrick,
Patrick Ohly wrote:
That's all for now (and probably enough stuff, too ... although perhaps
you prefer detailed emails over bullet items on a PowerPoint
presentation). So what do you think?
Since you are asking, here is my personal opinion: I don't think there
is a need for a syst
Mark Hahn wrote:
I don't think that's what I meant. imagine instead that you have 48pt
GE switches, each of which has 4x 10G extra ports. now, take
5 such switches and fully connect them (each switch has a 10G link
to each of the other 4 switches). I don't think 802.3ad helps here,
since what
Hi Mark,
Mark Hahn wrote:
my question is: do switches these days have smart protocols for mapping
and routing in such a configuration? I know that the original spanning
That's 802.3ad. Quick pointer:
http://en.wikipedia.org/wiki/Link_aggregation
You can use it between switches to use multip
Hi Jim,
Jim Lux wrote:
At 10:52 AM 5/23/2007, Peter St. John wrote:
But oh and Jim if you recall any papers about this I could read that
would be "Jim" Dandy.
I seem to recall that if you google hypercube and intel, you'll turn up
some of the papers that were written early on. The guys who
Hi Jess,
Jess Cannata wrote:
By too expensive, I mean much more expensive than Gig-E which is "free"
on the NIC side and quite cheap on the switch side.
Everything is very expensive when compared to GigE :-)
What I should have said is that the NetEffect card is competitive as the
number of n
Jonathan Boyle wrote:
For 2 processor blocking calls, mpptest indicates a latency of about 30
microseconds.
However when I measure communication times in my own program using a loop as
follows
If you don't want flow control problems, you need to do a pingpong (each
node send and recv al
[EMAIL PROTECTED] wrote:
Back then we were struggling with PIO transfers and how they were
treated in the CPU/North bridge (write combining and all that). I
believe this might still be an issue, correct ?
WC is well implemented on Opteron, it will aggregate consecutive PIO
writes at 16, 32 and
nside, just
wires on PCB.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
; application people understand easily).
I would bet that UPC could more efficiently leverage a strided or vector
communication primitive instead of message aggregation. I don't know if
GasNet provides one, I know ARMCI does.
Patrick
--
Patrick Geoffray
Myricom,
I
savagely want to cut Greg's hair when he is wrong, but they mostly (and
definitively Quadrics) know what they are doing.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To ch
cores/nodes would exercise the same metric (many sends/recvs on the same
NIC at the same time), but would be harder to cheat and be much more
meaningful IMHO.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowu
throughput becomes 13.5
Gb/s. The real limit depends on the chipset and can be much lower than that.
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or un
ating at the same time. Bigger pipes
helps contention a bit, but not much.
People doing their homework are still buying more 2G than 10G today,
because of better price/performance for their codes (and thin cables).
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.
expensive), but it probably will never happen.
scientists, who have their own codes and trying to persuade them to recompile
would be very hard - which would be necessary as we've not been able to
convince MPICH-GM to build shared libraries on Linux on Power with the IBM
compilers. :-(
hes. It's
going to get better eventually, but it's going to take time. I would
expect cheaper (quad) fiber solutions sooner than pervasive 10G-BaseT.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing
ssage), looking at an MPI trace would make it
obvious. This is where the improvement/investment ratio is the greatest
for communications.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
the TCP overhead is not that large.
Well, there is always this case where the storage nodes are very
oversubscribed to save enough to pay for the service contract :-\
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list,
significant
I don't know if time is really the constraint here. For grads students,
sure, but I would not think that more time would help with profs. A good
programing book maybe, but they are too proud to read those :-)
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
to not confuse everything else in the system
that assumes I/O bound applications sleep.
I totally agree, and interrupt coalescing is a wonderful thing.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf
.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Hi Toon,
Toon Knapen wrote:
In particular I'm interested if e.g. the mpirun script of version 1.2.7
is supposed to be able to launch an application that was compiled with
1.2.5.2.
The mpirun scripts are specific to each device, ie the mpirun.ch_p4 is
not the same as mpirun.ch_gm or mpirun.ch
CPU for everything is just fine.
However, you cannot have both, it's a design choice.
I like this thread, don't you ? I wasted tons of precious time, but
that's is what I want to see on this list, that's not marketing fluff,
even if half
e. Tax break is usually a good incentive for that :-)
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Look at the
upcoming EuroPVMMPI or Cluster conference for example. I would never
believe the comparative results from another vendor.
used far more bandwidth than Greg's orignal post which you critcized as
spam.
Yes, I did. That's always the case in this situation. But please note
elf. We bought a
Quadrics cluster a long time ago to do just that :-) You can also ask
friends to get access to clusters. The web is the last place I would
look to find reliable information.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
__
use
the test bed but you have to allow your benchmark code to be available
to everyone and the code will be run on all interconnects and the
results public.
What do you think of that ?
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
software market ?
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Vincent Diepeveen wrote:
Not at all good marketing that third remark.
Because if there was really something interesting to report,
then it would already have been reported by the *official* marketing
department.
No. Marketing effort implies coordination, that's why most announcements
are emb
Joe Landman wrote:
Greg Lindahl wrote:
On Wed, Jun 28, 2006 at 08:28:06AM -0400, Mark Hahn wrote:
the "I know something that I can't tell" bit was childish though ;)
Indeed, it was. I plead jet-lag.
No. It was good marketing. Anyone on the list not at least a little
curious what it is th
Chris,
Chris Dagdigian wrote:
In short, this was appropriate (and interesting). We've all seen vendor
spam and disguised marketing and this does not rise anywhere close to
that level.
I disagree on the level. I use the rule that a vendor should never
initiate a thread, only answer someone el
Greg Lindahl wrote:
On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:
I have keep it quiet even when you where saying things driven by
marketing rather than technical considerations (the packet per
second nonsense),
Patrick, that "packet per second nonsense" is the
Greg Lindahl wrote:
Second, we have a new whitepaper about performance of the Intel
Woodcrest CPU and InfiniPath interconnect on real applications, email
me for a copy.
Third, MH MHH MH. (That's the sound I make when I
can't tell you something.)
Since when is Beowulf a plac
Hi Mark,
Mark Hahn wrote:
- isn't Mellanox still the sole source for IB chips, for both
nics and switches? this seems odd if it's a thriving ecosystem.
no offense intended! yes, I know quadrics/SGI/SCI/Myri are all
also sole-source. but compared to the eth worl
Hi Gilad,
Gilad Shainer wrote:
a) There's likely to be 10Gbps ethernet over ordinary cat 5e/6 cabling
soon. (Solarflare is one company working on it)
It is not a surprise, as you can run InfiniBand on cat 6 cables today.
There are several solution in the market that make it happened.
You
Vincent,
Vincent Diepeveen wrote:
Just measure the random ring latency of that 1024 nodes Myri system and
compare.
There is several tables around with the random ring latency.
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
I just ran it on a 8 nodes dual Opteron 2.2 GHz with F card (Myrinet-2G)
Vincent,
So, I just get back from vacation today and I find this post in my huge
mailbox. Reason would tell me to not waste time and ignore it, but I
can't resist such a treat.
Diepeveen wrote:
With so many nodes i'd go for either infiniband or quadrics, assuming
the largest partition also gets
Mark Hahn wrote:
does anyone know why the submission deadline for top500
is so far in advance of the list's publication?
Because everybody ignores it ?
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
Beowulf mailing list, Be
quad fiber transceiver is about $350. It was
expected that this premium would disappear, but it's not happening and
that's why the quad fiber cable is being ratified as an official 10GigE
medium.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http
93 matches
Mail list logo