Re: [Beowulf] Monitoring communication + processing time

2025-02-06 Thread Weidendorfer, Josef
Have a look at the tools page of VI-HPS: https://www.vi-hps.org/tools/tools.html Most is open source, some is commercial. It includes mpiP and OpenSpeedShop, but there is also Scalasca, TAU, Vampir … Josef > Am 06.02.2025 um 12:17 schrieb Jim Cownie : > > There are a number of open source M

Re: [Beowulf] Monitoring communication + processing time

2025-02-06 Thread Jim Cownie
There are a number of open source MPI profiling libraries which Google can no doubt find for you; as recommended below, mpiP looks sane (though I haven't tried it myself) Or, you can use the MPI Profiling interface to intercept MPI calls and time them yourself, though this is in effect writing y

Re: [Beowulf] Monitoring communication + processing time

2025-02-02 Thread Chris Samuel
On 15/1/25 5:04 pm, Alexandre Ferreira Ramos via Beowulf wrote: Does anyone have a hint about how we should proceed for this monitoring? LLNL also has an MPI profiling library: https://github.com/LLNL/mpiP I've not tried it myself, but I like the idea of it. All the best, Chris _

Re: [Beowulf] Monitoring communication + processing time

2025-01-17 Thread Prentice Bisbal
If you need a free/open-source tool, OpenSpeedShop may fit the bill. I've never used it myself, but I've stopped by the Krell Institute booth over the years at SC and got a few live demos. Give it a look-see. https://github.com/OpenSpeedShop On 1/17/25 9:52 AM, Michael DiDomenico wrote: sadly

Re: [Beowulf] Monitoring communication + processing time

2025-01-17 Thread Michael DiDomenico
sadly most people still use printf's to debug C code. but there are some parallel debuggers on the market like Totalview, but it's pricey depending on how man ranks you want to spin up under the debugger On Thu, Jan 16, 2025 at 7:48 AM Alexandre Ferreira Ramos via Beowulf wrote: > > Hi all, I ho

[Beowulf] Monitoring communication + processing time

2025-01-15 Thread Alexandre Ferreira Ramos via Beowulf
Hi all, I hope you are find! We are working on a project of parallel computing. We are needing to monitor communication and processing time. Our code is an algorithm for parallel simulated annealing written in C and we are using MPI. We do have communication within multicores processor and among d

Re: [Beowulf] Monitoring and Metrics

2017-10-08 Thread Benson Muite
May also be of interest: JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev, Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy http://russianscdays.org/files/pdf1

Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread lange
> On Sat, 7 Oct 2017 08:21:08 -0400, Josh Catana said: > This may have been brought up in the past, but I couldn't find much in my message  archive. > What are people using for HPC cluster monitoring and metrics lately? I've been low on time to add features to my home grown solution

Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread Lachlan Musicman
> On 10/7/2017 8:21 AM, Josh Catana wrote: > > This may have been brought up in the past, but I couldn't find much in my > message archive. > What are people using for HPC cluster monitoring and metrics lately? I've > been low on time to add features to my home grown solution and looking at > some

Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread Paul Edmon
So for general monitoring of the cluster usage we use: https://github.com/fasrc/slurm-diamond-collector and pipe to Graphana.  We also use XDMod: http://open.xdmod.org/7.0/index.html As for specific node alerting, we use the old standby of Nagios. -Paul Edmon- On 10/7/2017 8:21 AM, Josh Cat

[Beowulf] Monitoring and Metrics

2017-10-07 Thread Josh Catana
This may have been brought up in the past, but I couldn't find much in my message archive. What are people using for HPC cluster monitoring and metrics lately? I've been low on time to add features to my home grown solution and looking at some OTS products. I'm looking for something that can do mo

Re: [Beowulf] Monitoring and reporting Infiniband errors

2014-06-19 Thread John Hearns
pps. I guess I could clear the errors every time this runs, but have decided to just do an initial clear of the errors and look at the cumulative rate. ppps. there is a better list for this chatter, isn't there... On 19 June 2014 15:10, John Hearns wrote: > If anyone is interested, here is my

Re: [Beowulf] Monitoring and reporting Infiniband errors

2014-06-19 Thread John Hearns
If anyone is interested, here is my solution, which seems good enough. Someone will no doubt say there is a neater way! A shell script which runs ibqueryerrors and returns 1 if anything is found: #!/bin/bash # check for errors on the Infiniband fabric 0 # another script runs for port 1 errors=`/

[Beowulf] Monitoring and reporting Infiniband errors

2014-06-19 Thread John Hearns
Does anyone have good tips on moniroting a cluster for Infiniband errors? Specifically Mellanox/OpenFabrics on an SGI cluster. I am thinking of running ibcheckerrors or ibqueryerrors and parsing the output. I have Monit set up on the cluster head node http://mmonit.com/monit/ which I find quite

Re: [Beowulf] monitoring...

2013-08-30 Thread Tina Friedrich
Just to throw another possibility in here - we use Zenoss, which does both. And can use Nagios plugins. Tina On 30/08/13 07:36, Tim Cutts wrote: > > On 29 Aug 2013, at 20:38, Raphael Verdugo P. > wrote: > >> Hi, >> >>I need help . Ganglia or Nagios to monitoring activity in cluster?. >>

Re: [Beowulf] monitoring...

2013-08-29 Thread Tim Cutts
On 29 Aug 2013, at 20:38, Raphael Verdugo P. wrote: > Hi, > > I need help . Ganglia or Nagios to monitoring activity in cluster?. > Both. They have different if overlapping purposes. Ganglia is very nice for historical load metric graphs. Nagios is rather better at actually alerting

[Beowulf] monitoring...

2013-08-29 Thread Raphael Verdugo P.
Hi, I need help . Ganglia or Nagios to monitoring activity in cluster?. -- Raphael Verdugo P. Unix Admin & Developer raphael.verd...@gmail.com +56 999010022, ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change y

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Mark Hahn wrote: > there's a semi-recent kernel feature which allows the kernel to avoid > user-space by putting console traffic onto the net directly > see Documentation/networking/netconsole.txt Now that looks very interesting. Thanks for the pointer! Cheers Carsten _

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Robert G. Brown wrote: > > "putting a cheap monitor on a suspect or crashed node" > One monitor to > 1300 1U server is not practical :) > Or even after a crash. If the primary graphics card is being used as a > console, the frame buffer will probably retain the last kernel oops > written to

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Robert G. Brown
On Tue, 9 Sep 2008, Carsten Aulbert wrote: We did get a few messages, albeit not from the kernel when an error happened. I'll have another look today, maybe I did something wrong. If your kernel is out and out crashing, you might not get anything at all. In that case, let me add: "putting a

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Robert G. Brown
On Tue, 9 Sep 2008, Carsten Aulbert wrote: My question now, is there a cute little way to gather all the console outputs of > 1000 nodes? The nodes don't have physical serial cables attached to them - nor do we want to use many concentrators to achieve this - but the off-the-shelf Supermicro box

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Mark Hahn
We did get a few messages, albeit not from the kernel when an error happened. I'll have another look today, maybe I did something wrong. there's a semi-recent kernel feature which allows the kernel to avoid user-space by putting console traffic onto the net directly see Documentation/networkin

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Loic Tortay
Carsten Aulbert wrote: [server console management for many servers with conserver] > We use conserver to get serial console access to almost all our machines. Below is the forwarded answer to your messages from my coworker who's in charge of this. The tools he created for interfacing IPMI and con

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Lawrence Stewart
Carsten Aulbert wrote: > Hi all, > > I would tend to guess this problem is fairly common and many solutions > are already in place, so I would like to enquirer about your solutions > to the problem: > > In our large cluster we have certain nodes going down with I/O hard disk > errors. We have some

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Perry E. Metzger
Carsten Aulbert <[EMAIL PROTECTED]> writes: > For the time being we are experimenting with using "script" in many > "screen" environment which should be able to monitor ipmitool's SoL > output, but somehow that strikes me as inefficient as well. First, you should probably never want script+screen

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi, Geoff Galitz wrote: > You can also configure any standard (distribution shipped) syslog to log > remotely to your head node or even a seperate logging master. Anything that > gets reported to the syslog facility can be reported/archived in this > manner, you just need to dig into the document

RE: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Geoff Galitz
>Does this capture (almost) everything what happens to a machine? w have >not yet looked into syslog-ng but a looks into your config files would >be very nice. You can also configure any standard (distribution shipped) syslog to log remotely to your head node or even a seperate logging master.

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi thanks for the reply Reuti wrote: > I setup syslog-ng on the nodes to log to the headnode. There each node > will have a distinct file e.g. "/var/log/nodes/node42.messages". If you > are interested, I could post my configuration files for headnode and > clients. Does this capture (almost) eve

Re: [Beowulf] Monitoring crashing machines

2008-09-09 Thread Reuti
Hi, Am 09.09.2008 um 09:53 schrieb Carsten Aulbert: Hi all, I would tend to guess this problem is fairly common and many solutions are already in place, so I would like to enquirer about your solutions to the problem: In our large cluster we have certain nodes going down with I/O hard disk

[Beowulf] Monitoring crashing machines

2008-09-09 Thread Carsten Aulbert
Hi all, I would tend to guess this problem is fairly common and many solutions are already in place, so I would like to enquirer about your solutions to the problem: In our large cluster we have certain nodes going down with I/O hard disk errors. We have some suspicion about the causes but would