Have a look at the tools page of VI-HPS:
https://www.vi-hps.org/tools/tools.html
Most is open source, some is commercial.
It includes mpiP and OpenSpeedShop, but there is also Scalasca, TAU, Vampir …
Josef
> Am 06.02.2025 um 12:17 schrieb Jim Cownie :
>
> There are a number of open source M
There are a number of open source MPI profiling libraries which Google can no
doubt find for you; as recommended below, mpiP looks sane (though I haven't
tried it myself)
Or, you can use the MPI Profiling interface to intercept MPI calls and time
them yourself, though this is in effect writing y
On 15/1/25 5:04 pm, Alexandre Ferreira Ramos via Beowulf wrote:
Does anyone have a hint about how we should proceed for this monitoring?
LLNL also has an MPI profiling library: https://github.com/LLNL/mpiP
I've not tried it myself, but I like the idea of it.
All the best,
Chris
_
If you need a free/open-source tool, OpenSpeedShop may fit the bill.
I've never used it myself, but I've stopped by the Krell Institute booth
over the years at SC and got a few live demos. Give it a look-see.
https://github.com/OpenSpeedShop
On 1/17/25 9:52 AM, Michael DiDomenico wrote:
sadly
sadly most people still use printf's to debug C code. but there are
some parallel debuggers on the market like Totalview, but it's pricey
depending on how man ranks you want to spin up under the debugger
On Thu, Jan 16, 2025 at 7:48 AM Alexandre Ferreira Ramos via Beowulf
wrote:
>
> Hi all, I ho
Hi all, I hope you are find!
We are working on a project of parallel computing. We are needing to
monitor communication and processing time.
Our code is an algorithm for parallel simulated annealing written in C and
we are using MPI.
We do have communication within multicores processor and among d
May also be of interest:
JobDigest – Detailed System Monitoring-Based Supercomputer Application
Behavior Analysis
Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev,
Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin and Sergey Zhumatiy
http://russianscdays.org/files/pdf1
> On Sat, 7 Oct 2017 08:21:08 -0400, Josh Catana said:
> This may have been brought up in the past, but I couldn't find much in my
message archive.
> What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution
> On 10/7/2017 8:21 AM, Josh Catana wrote:
>
> This may have been brought up in the past, but I couldn't find much in my
> message archive.
> What are people using for HPC cluster monitoring and metrics lately? I've
> been low on time to add features to my home grown solution and looking at
> some
So for general monitoring of the cluster usage we use:
https://github.com/fasrc/slurm-diamond-collector
and pipe to Graphana. We also use XDMod:
http://open.xdmod.org/7.0/index.html
As for specific node alerting, we use the old standby of Nagios.
-Paul Edmon-
On 10/7/2017 8:21 AM, Josh Cat
This may have been brought up in the past, but I couldn't find much in my
message archive.
What are people using for HPC cluster monitoring and metrics lately? I've
been low on time to add features to my home grown solution and looking at
some OTS products.
I'm looking for something that can do mo
pps. I guess I could clear the errors every time this runs, but have
decided to just do an initial clear of the errors and look at the
cumulative rate.
ppps. there is a better list for this chatter, isn't there...
On 19 June 2014 15:10, John Hearns wrote:
> If anyone is interested, here is my
If anyone is interested, here is my solution, which seems good enough.
Someone will no doubt say there is a neater way!
A shell script which runs ibqueryerrors and returns 1 if anything is found:
#!/bin/bash
# check for errors on the Infiniband fabric 0
# another script runs for port 1
errors=`/
Does anyone have good tips on moniroting a cluster for Infiniband errors?
Specifically Mellanox/OpenFabrics on an SGI cluster.
I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
output.
I have Monit set up on the cluster head node
http://mmonit.com/monit/
which I find quite
Just to throw another possibility in here - we use Zenoss, which does
both. And can use Nagios plugins.
Tina
On 30/08/13 07:36, Tim Cutts wrote:
>
> On 29 Aug 2013, at 20:38, Raphael Verdugo P.
> wrote:
>
>> Hi,
>>
>>I need help . Ganglia or Nagios to monitoring activity in cluster?.
>>
On 29 Aug 2013, at 20:38, Raphael Verdugo P. wrote:
> Hi,
>
> I need help . Ganglia or Nagios to monitoring activity in cluster?.
>
Both. They have different if overlapping purposes. Ganglia is very nice for
historical load metric graphs. Nagios is rather better at actually alerting
Hi,
I need help . Ganglia or Nagios to monitoring activity in cluster?.
--
Raphael Verdugo P.
Unix Admin & Developer
raphael.verd...@gmail.com
+56 999010022,
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change y
Mark Hahn wrote:
> there's a semi-recent kernel feature which allows the kernel to avoid
> user-space by putting console traffic onto the net directly
> see Documentation/networking/netconsole.txt
Now that looks very interesting. Thanks for the pointer!
Cheers
Carsten
_
Robert G. Brown wrote:
>
> "putting a cheap monitor on a suspect or crashed node"
>
One monitor to > 1300 1U server is not practical :)
> Or even after a crash. If the primary graphics card is being used as a
> console, the frame buffer will probably retain the last kernel oops
> written to
On Tue, 9 Sep 2008, Carsten Aulbert wrote:
We did get a few messages, albeit not from the kernel when an error
happened. I'll have another look today, maybe I did something wrong.
If your kernel is out and out crashing, you might not get anything at
all. In that case, let me add:
"putting a
On Tue, 9 Sep 2008, Carsten Aulbert wrote:
My question now, is there a cute little way to gather all the console
outputs of > 1000 nodes? The nodes don't have physical serial cables
attached to them - nor do we want to use many concentrators to achieve
this - but the off-the-shelf Supermicro box
We did get a few messages, albeit not from the kernel when an error
happened. I'll have another look today, maybe I did something wrong.
there's a semi-recent kernel feature which allows the kernel to
avoid user-space by putting console traffic onto the net directly
see Documentation/networkin
Carsten Aulbert wrote:
[server console management for many servers with conserver]
>
We use conserver to get serial console access to almost all our machines.
Below is the forwarded answer to your messages from my coworker who's in
charge of this.
The tools he created for interfacing IPMI and con
Carsten Aulbert wrote:
> Hi all,
>
> I would tend to guess this problem is fairly common and many solutions
> are already in place, so I would like to enquirer about your solutions
> to the problem:
>
> In our large cluster we have certain nodes going down with I/O hard disk
> errors. We have some
Carsten Aulbert <[EMAIL PROTECTED]> writes:
> For the time being we are experimenting with using "script" in many
> "screen" environment which should be able to monitor ipmitool's SoL
> output, but somehow that strikes me as inefficient as well.
First, you should probably never want script+screen
Hi,
Geoff Galitz wrote:
> You can also configure any standard (distribution shipped) syslog to log
> remotely to your head node or even a seperate logging master. Anything that
> gets reported to the syslog facility can be reported/archived in this
> manner, you just need to dig into the document
>Does this capture (almost) everything what happens to a machine? w have
>not yet looked into syslog-ng but a looks into your config files would
>be very nice.
You can also configure any standard (distribution shipped) syslog to log
remotely to your head node or even a seperate logging master.
Hi
thanks for the reply
Reuti wrote:
> I setup syslog-ng on the nodes to log to the headnode. There each node
> will have a distinct file e.g. "/var/log/nodes/node42.messages". If you
> are interested, I could post my configuration files for headnode and
> clients.
Does this capture (almost) eve
Hi,
Am 09.09.2008 um 09:53 schrieb Carsten Aulbert:
Hi all,
I would tend to guess this problem is fairly common and many solutions
are already in place, so I would like to enquirer about your solutions
to the problem:
In our large cluster we have certain nodes going down with I/O hard
disk
Hi all,
I would tend to guess this problem is fairly common and many solutions
are already in place, so I would like to enquirer about your solutions
to the problem:
In our large cluster we have certain nodes going down with I/O hard disk
errors. We have some suspicion about the causes but would
30 matches
Mail list logo