We are going through a similar experience at one of our customer sites.
They are trying to run Intel MPI on more than 1,000 nodes. Are you
experiencing problems starting the MPD ring? We noticed it takes a
really long time especially when the node count is large. It also just
doesn't work sometimes.
I didn't mean to imply we're using Intel MPI (in fact, we're using HP-MPI,
but it had some issues with very large number of fd's as well - in fact,
I think we caused them to recode from select to epoll.)
so my comment was general: MPI vendors sometimes forget about how many
fd's they're using per node.
in general, though, with a modern linux system, you should be able
to simply tweak ulimit -n. I don't think even a sysctl is necessary
(though there may also be network-derived limits - open sockets,
routing entries, iptables, core memory limits, etc)
it would probably be illuminating to measure exactly what the critical
number is - can do 999 nodes, but 1000 fails? also, you may find
that turning off some features will reduce the consumption of fd's
or sockets (disable stdin forwarding/replication to all but the rank0?
disable stdout/err from all but rank0?)
this reminds me of a peeve of mine, that eth-based MPI never takes
advantage of the hardware's inherent broad/multicast capabilities.
yes, it's convenient to use the standard TCP stack so you can ignore
reliable delivery issues, but creating rings and consuming many
sockets by forwarding stdio are actually great examples of the downside.
treating eth as a multicast fabric and actually doing retrans in MPI
(or a sub-layer) would solve some problems. and I suspect could lead
to some interesting performance advantages.
regards, mark.
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Mark Hahn
Sent: Friday, September 29, 2006 8:47 AM
To: Clements, Brent M (SAIC)
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow
tostart up, sometimes not at all
Does anyone have any experience running intel mpi over 1000 nodes and
do you have any tips to speed up task execution? Any tips to solve this
issue?
it's not uncommon for someone to write naive select() code that fails
when the number of open file descriptors hits 1024... yes, even in
the internals of major MPI implementations.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
operator may differ from spokesperson. [EMAIL PROTECTED]
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf