Re: [Beowulf] And wearing another hat ...

2023-11-13 Thread Christopher Samuel
On 11/13/23 09:06, Joshua Mora wrote: Some folks trying to bypass legally government restrictions. I'm afraid that seems to be a parody/hoax/performance art thing: https://www.vice.com/en/article/88xk7b/del-complex-ai-training-barge > But there’s one glaring issue: Del Complex is not a real

Re: [Beowulf] naming clusters

2023-03-23 Thread Christopher Samuel
On 3/23/23 3:12 pm, Prentice Bisbal via Beowulf wrote: honestly is there any better task for a system admin than coming up with good hostnames? I remember at $JOB-2 the first of our HPC systems had been all racked up but it wasn't until I had was in the datacentre in the CentOS 5 installer s

Re: [Beowulf] Checkpointing MPI applications

2023-03-23 Thread Christopher Samuel
On 2/19/23 10:26 am, Scott Atchley wrote: Hi Chris, Hi Scott! It looks like it tries to checkpoint application state without checkpointing the application or its libraries (including MPI). I am curious if the checkpoint sizes are similar or significantly larger to the application's typical

Re: [Beowulf] [External] Checkpointing MPI applications

2023-03-23 Thread Christopher Samuel
Hi Prentice, On 2/20/23 7:46 am, Prentice Bisbal via Beowulf wrote: Is anyone working on DMTCP or MANA going to start monitoring the dmtcp-forum mailing list? Sorry I didn't get a chance to circle back here! I did raise this with them and they promised to reach out to you, hopefully they'll

Re: [Beowulf] [External] Re: old sm/sgi bios

2023-03-23 Thread Christopher Samuel
On 3/23/23 11:59 am, Fischer, Jeremy wrote: HPUX 9. Hands down. v9? Luxury! For my sins in the mid 90s I was part of the small team that managed a heterogenous UNIX network for folks doing portable compiler development, I think we had ~12 UNIX variants on ~8 hardware platforms (conservative

[Beowulf] Checkpointing MPI applications

2023-02-18 Thread Christopher Samuel
Hi all, The list has been very quiet recently, so as I just posted something to the Slurm list in reply to the topic of checkpointing MPI applications I thought it might interest a few of you here (apologies if you've already seen it there). If you're looking to try checkpointing MPI applica

[Beowulf] OpenMPI over libfabric (was Re: Top 5 reasons why mailing lists are better than Twitter)

2022-11-21 Thread Christopher Samuel
On 11/21/22 4:39 am, Scott Atchley wrote: We have OpenMPI running on Frontier with libfabric. We are using HPE's CXI (Cray eXascale Interface) provider instead of RoCE though. Yeah I'm curious to know if Matt's issues are about OpenMPI->libfabric or libfabric->RoCE ? FWIW we're using Cray's

Re: [Beowulf] [External] beowulf hall of fame

2022-02-26 Thread Christopher Samuel
On 2/26/22 5:10 am, H. Vidal, Jr. wrote: Is Don on the list any more? I can neither confirm nor deny it. :-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Compu

Re: [Beowulf] Question about fair share

2022-01-24 Thread Christopher Samuel
On 1/24/22 11:17 am, Tom Harvill wrote: We use a 'fair share' feature of our scheduler (SLURM) and have our decay half-life (the time needed for priority penalty to halve) set to 30 days.  Our maximum job runtime is 7 days.  I'm wondering what others use, please let me know if you can spare a

Re: [Beowulf] SC21 Beowulf Bash Panles

2021-11-15 Thread Christopher Samuel
On 11/14/21 12:16 pm, Douglas Eadline wrote: While there is always a bit of Beowulf snark surrounding the Bash I wanted to mention the technical panels are looking to be very interesting. Enjoy! Sadly the timing doesn't work for me this year (time zones and family commitments) but I'm sure it

Re: [Beowulf] List archives

2021-08-16 Thread Christopher Samuel
Hi John, On 8/16/21 12:57 am, John Hearns wrote: The Beowulf list archives seem to end in July 2021. I was looking for Doug Eadline's post on limiting AMD power and the results on performance. Hmm, that's odd, I'll take a look tonight, thanks for the heads up! All the best, Chris -- Chris

Re: [Beowulf] [External] RIP CentOS 8

2020-12-08 Thread Christopher Samuel
On 12/8/20 1:06 pm, Prentice Bisbal via Beowulf wrote: I wouldn't be surprised if this causes Scientific Linux to come back into existence. It sounds like Greg K is already talking about CentOS-NG (via the ACM SIGHPC syspro Slack): https://www.linkedin.com/posts/gmkurtzer_centos-project-shi

Re: [Beowulf] [External] Re: Administrivia: Beowulf list moved to new server

2020-11-23 Thread Christopher Samuel
On 11/23/20 10:33 pm, Tony Brian Albers wrote: What they said: Thank you all for your kind words on and off list, really appreciated! My next task is to invite back to the list those who got kicked off when our previous hosting lost its reverse DNS records and various sites started rejectin

Re: [Beowulf] [External] CentOS 8 with OpenHPC 1.3.9 available on Qlustar

2020-04-17 Thread Christopher Samuel
On 4/17/20 12:17 PM, Prentice Bisbal via Beowulf wrote: I'm aware. I just meant to correct the announcement which stated "Slurm 18.10.x", which is a version that never existed. I know, I was just commenting for Roland. :-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA __

Re: [Beowulf] [External] CentOS 8 with OpenHPC 1.3.9 available on Qlustar

2020-04-17 Thread Christopher Samuel
On 4/17/20 10:14 AM, Prentice Bisbal via Beowulf wrote: I think you mean Slurm 18.08.x Just a heads up that Slurm 18.08 is no longer supported, 20.02 is the current release and 19.05 is now only getting security fixes from what I've read on the Slurm list (though some fixes have gone into th

Re: [Beowulf] [EXTERNAL] Re: HPE completes Cray acquisition

2019-09-27 Thread Christopher Samuel
On 9/27/19 9:19 AM, Scott Atchley wrote: Cray: This one goes up to 10^18 ROTFL. Sir, you win the interwebs today. ;-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Pen

Re: [Beowulf] [EXTERNAL] Re: HPE completes Cray acquisition

2019-09-27 Thread Christopher Samuel
On 9/27/19 7:40 AM, Lux, Jim (US 337K) via Beowulf wrote: “A HPE company” seems sort of bloodless and corporate.  I would kind of hope for  something like “CRAY – How Fast Do You Want to Go?” or something like that to echo back to their long history of “just make it fast” "Cray: this one goe

[Beowulf] HPE completes Cray acquisition

2019-09-25 Thread Christopher Samuel
Cray joins SGI as part of the HPE stable: https://www.hpe.com/us/en/newsroom/press-release/2019/09/hpe-completes-acquisition-of-supercomputing-leader-cray-inc.html > As part of the acquisition, Cray president and CEO Peter Ungaro, will join HPE as head of the HPC and AI business unit in Hybrid

Re: [Beowulf] Build Recommendations - Private Cluster

2019-08-21 Thread Christopher Samuel
On 8/20/19 11:03 PM, John Hearns via Beowulf wrote: A Transputer cluster? Squ! I know John Taylor (formerly Meiko/Quadrics) very well. Hah, when I was a young sysadmin we had a heterogenous network and one box was a Parsys transputer system running Idris. The only UNIX system I've admin

Re: [Beowulf] Build Recommendations - Private Cluster

2019-08-21 Thread Christopher Samuel
On 8/21/19 3:00 PM, Richard Edwards wrote: So I am starting to see a pattern. Some combination of CentOS + Ansible + OpenHPC + SLURM + Old CUDA/Nvidia Drivers;-). My only comment there would be I do like xCAT, especially with statelite settings so you can PXE boot a RAM disk on the nodes but

Re: [Beowulf] software for activating one of many programs but not the others?

2019-08-20 Thread Christopher Samuel
On 8/20/19 10:40 AM, Alex Chekholko via Beowulf wrote: Other examples include RPM or EasyBuild+Lmod or less common tools like Singularity or Snap/Snappy or Flatpak. +1 for Easybuild from me. https://easybuilders.github.io/easybuild/ There's also Spack (I really don't like the name, it's too

Re: [Beowulf] Lustre on google cloud

2019-07-22 Thread Christopher Samuel
On 7/22/19 10:48 AM, Jonathan Aquilina wrote: I am looking at https://cloud.google.com/blog/products/storage-data-transfer/introducing-lustre-file-system-cloud-deployment-manager-scripts Amazon's done similar: https://aws.amazon.com/blogs/storage/building-an-hpc-cluster-with-aws-parallelclust

Re: [Beowulf] flatpack

2019-07-22 Thread Christopher Samuel
On 7/21/19 7:30 PM, Jonathan Engwall wrote: Some distros will be glad to know Flatpack will load your software center with working downloads. Are you thinking of this as an alternative to container systems & tools like easybuild as a software delivery system for HPC systems? How widely supp

Re: [Beowulf] Rsync - checksums

2019-06-17 Thread Christopher Samuel
On 6/17/19 6:43 AM, Bill Wichser wrote: md5 checksums take a lot of compute time with huge files and even with millions of smaller ones.  The bulk of the time for running rsync is spent in computing the source and destination checksums and we'd like to alleviate that pain of a cryptographic al

Re: [Beowulf] Containers in HPC

2019-05-24 Thread Christopher Samuel
On 5/22/19 6:10 AM, Gerald Henriksen wrote: Paper on arXiv that may be of interest to some as it may be where HPC is heading even for private clusters: In case it's of interest NERSC has a page on how Shifter does containers and how it packs filesystems to improve performance here: https://

Re: [Beowulf] Frontier Announcement

2019-05-08 Thread Christopher Samuel
On 5/8/19 10:47 AM, Jörg Saßmannshausen wrote: As I follow these things rather loosely, my understanding was that OpenACC should run on both nVidia and other GPUs. So maybe that is the reason why it is a 'pure' AMD cluster where both GPUs and CPUs are from the same supplier? IF all of that is wo

Re: [Beowulf] Frontier Announcement

2019-05-07 Thread Christopher Samuel
On 5/7/19 1:59 PM, Prentice Bisbal via Beowulf wrote: I agree. That means a LOT of codes will have to be ported from CUDA to whatever AMD uses. I know AMD announced their HIP interface to convert CUDA code into something that will run on AMD processors, but I don't know how well that works in

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Christopher Samuel
On 5/2/19 8:40 AM, Faraz Hussain wrote: So should I be paying Mellanox to help? Or is it a RedHat issue? Or is it our harware vendor, HP who should be involved?? I suspect that would be set out in the contract for the HP system. The clusters I've been involved in purchasing in the past have a

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Christopher Samuel
On 5/1/19 8:50 AM, Faraz Hussain wrote: Unfortunately I get this: root@lustwzb34:/root # systemctl status rdma Unit rdma.service could not be found. You're missing this RPM then, which might explain a lot: $ rpm -qi rdma-core Name: rdma-core Version : 17.2 Release : 3.el7 Arc

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Christopher Samuel
On 5/1/19 7:05 AM, Faraz Hussain wrote: [hussaif1@lustwzb34 ~]$ sminfo ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0) sminfo: iberror: failed: Failed to open '(null)' port '0' Sorry I'm late to this. What does this say? systemctl status rdma You should see something alon

Re: [Beowulf] Large amounts of data to store and process

2019-03-15 Thread Christopher Samuel
On 3/14/19 12:30 AM, Jonathan Aquilina wrote: I will obviously keep the list updated in regards to Julia and my experiences with it but the little I have looked at the language it is easy to write code for. Its still in its infancy as the latest version I believe is 1.0.1 Whilst new it has b

[Beowulf] Application independent checkpoint/resume?

2019-03-04 Thread Christopher Samuel
Hi folks, Just wondering if folks here have recent experiences here with application independent checkpoint/resume mechanisms like DMTCP or CRIU? Especially interested for MPI uses, and extra bonus points for experiences on Cray. :-) From what I can see CRIU doesn't seem to support MPI at a

Re: [Beowulf] New tools from FB and Uber

2018-10-30 Thread Christopher Samuel
On 31/10/18 5:14 am, Tim Cutts wrote: I vaguely remember hearing about Btrfs from someone at Oracle, it seems the main developer has moved around a bit since! Yeah Chris Mason (and another) left Oracle for Fusion-IO in 2012 and then shifted from there to Facebook in late 2013. Jens Axboe (the

Re: [Beowulf] Oh.. IBM eats Red Hat

2018-10-30 Thread Christopher Samuel
On 31/10/18 4:07 am, INKozin via Beowulf wrote: Will Red Hat come out Blue Hat after IBM blue washing? Best one I've heard was by Kenneth Hoste on the Easybuild Slack. Deep Purple. ;-) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC __

Re: [Beowulf] An Epyc move for Cray

2018-10-30 Thread Christopher Samuel
On 31/10/18 9:39 am, Christopher Samuel wrote: For those who haven't seen, Cray has announced their new Shasta architecture for forthcoming systems like NERSC-9 (the replacement for Edison). Now I've seen the Cray PR it seems it might not be as closely coupled as it initially read

[Beowulf] An Epyc move for Cray

2018-10-30 Thread Christopher Samuel
For those who haven't seen, Cray has announced their new Shasta architecture for forthcoming systems like NERSC-9 (the replacement for Edison). https://www.hpcwire.com/2018/10/30/cray-unveils-shasta-lands-nersc-9-contract/ It's interesting as Cray have jumped back to using AMD CPUs (Epyc) alo

Re: [Beowulf] Poll - Directory implementation

2018-10-24 Thread Christopher Samuel
On 25/10/18 3:42 am, Tom Harvill wrote: - what directory solution do you implement? - if LDAP, which flavor? - do you have any opinions one way or another on the topic? At VLSCI we originally ran 389-DS multi-master with an LDAP server on each cluster management node plus another one on the s

Re: [Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?

2018-09-25 Thread Christopher Samuel
On 10/09/18 11:16, Joe Landman wrote: If you have dumps from the crash, you could load them up in the debugger. Would be the most accurate route to determine why that was triggered. Thanks Joe, after a bit of experimentation we've now successfully got a crash dump. It seems to confirm what I

Re: [Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?

2018-09-09 Thread Christopher Samuel
On 10/09/18 11:16, Joe Landman wrote: If you have dumps from the crash, you could load them up in the debugger. Would be the most accurate route to determine why that was triggered. Thanks Joe! Looking at our nodes I don't think we've got crash dumps enabled, I'll see if we can get that done

[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?

2018-09-09 Thread Christopher Samuel
Hi folks, We've had 2 different nodes crash over the past few days with kernel panics triggered by (what is recorded as) a "simd exception" (console messages below). In both cases the triggering application is given as the same binary, a user application built against OpenFOAM v16.06. This doesn

Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA

2018-08-18 Thread Christopher Samuel
On 18/08/18 17:22, Jörg Saßmannshausen wrote: if the problem is RMDA, how about InfiniBand? Will that be broken as well? For RDMA it appears yes, though IPoIB still works for us (though ours is OPA rather than IB Kilian reported the same). All the best, Chris -- Chris Samuel : http://www.c

Re: [Beowulf] emergent behavior - correlation of job end times

2018-07-24 Thread Christopher Samuel
On 25/07/18 04:52, David Mathog wrote: One possibility is that at the "leading" edge the first job that reads a section of data will do so slowly, while later jobs will take the same data out of cache. That will lead to a "peloton" sort of effect, where the leader is slowed and the followers ac

Re: [Beowulf] Fwd: Project Natick

2018-06-10 Thread Christopher Samuel
On 11/06/18 07:46, John Hearns via Beowulf wrote: Stuart Midgley works for DUG? Yup, for over a decade.. :-) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing

Re: [Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

2018-05-17 Thread Christopher Samuel
On 14/05/18 21:53, Michael Di Domenico wrote: Can you expand on "image stored on lustre" part? I'm pretty sure i understand the gist, but i'd like to know more. I didn't set this part of the system up, but we have a local chroot on the management nodes disk that we add/modify/remove things fr

Re: [Beowulf] cursed (and perhaps blessed) Intel microcode

2018-05-09 Thread Christopher Samuel
Hi Mark, On 30/03/18 16:28, Chris Samuel wrote: I'll try and nudge a person I know there on that... They did some prodding, and finally new firmware emerged at the end of last month. /tmp/microcode-20180425$ iucode_tool -L intel-ucode-with-caveats/06-4f-01 microcode bundle 1: intel-ucode-wit

Re: [Beowulf] Bright Cluster Manager

2018-05-01 Thread Christopher Samuel
On 02/05/18 06:57, Robert Taylor wrote: It appears to do node management, monitoring, and provisioning, so we would still need a job scheduler like lsf, slurm,etc, as well. Is that correct? I've not used it, but I've heard from others that it can/does supply schedulers like Slurm, but (at leas

Re: [Beowulf] Large Dell, odd IO delays

2018-02-14 Thread Christopher Samuel
On 15/02/18 09:26, David Mathog wrote: Sometimes for no reason that I can discern an IO operation on this machine will stall. Things that should take seconds will run for minutes, or at least until I get tired of waiting and kill them. Here is today's example: gunzip -c largeFile.gz > largeF

Re: [Beowulf] Dell syscfg for KNL nodes - different to regular Dell syscfg?

2018-02-13 Thread Christopher Samuel
On 14/02/18 13:38, Christopher Samuel wrote: Now that *might* be because I'm having to (currently) run it on a non-KNL system for testing, and perhaps it probes the BMC to work out what options it makes sense to show me.. So yes, that appeared to be the case. Also if you run it with

Re: [Beowulf] Dell syscfg for KNL nodes - different to regular Dell syscfg?

2018-02-13 Thread Christopher Samuel
Hi Kilian, On 14/02/18 12:40, Kilian Cavalotti wrote: AFAIK, despite their unfortunate sharing of the same name, Dell's syscfg and Intel's syscfg are completely different tools: That I understand. :-) The problem is that the Dell syscfg doesn't seem to have the options that Slurm thinks it s

[Beowulf] Dell syscfg for KNL nodes - different to regular Dell syscfg?

2018-02-13 Thread Christopher Samuel
Hi all, I'm helping bring up a cluster which includes a handful of Dell KNL boxes (PowerEdge C6320p). Now Slurm can manipulate the MCDRAM settings on KNL nodes via syscfg, but Dell ones need to use the Dell syscfg and not the Intel one. The folks have a Dell syscfg (6.1.0) but that doesn't appe

Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-07 Thread Christopher Samuel
On 08/01/18 09:18, Richard Walsh wrote: Mmm ... maybe I am missing something, but for an HPC cluster-specific solution ... how about skipping the fixes, and simply requiring all compute node jobs to run in exclusive mode and then zero-ing out user memory between jobs ... ?? If you are running

Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-07 Thread Christopher Samuel
On 07/01/18 23:22, Jörg Saßmannshausen wrote: the first court cases against Intel have been filed: These would have to be Meldown related then, given that Spectre is so widely applicable. Greg K-H has a useful post up about the state of play with the various Linux kernel patches for mainline

Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-05 Thread Christopher Samuel
On 06/01/18 12:00, Gerald Henriksen wrote: For anyone interested this is AMD's response: https://www.amd.com/en/corporate/speculative-execution Cool, so variant 1 is likely the one that SuSE has firmware for to disable branch prediction on Epyc. cheers, Chris -- Chris Samuel : http://www.

Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-05 Thread Christopher Samuel
On 06/01/18 03:46, Jonathan Aquilina wrote: Chris on a number of articles I read they are saying AMD's are not affected by this. That's only 1 of the 3 attacks to my understanding. The Spectre paper says: # Hardware. We have empirically verified the vulnerability of several # Intel processors

Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-05 Thread Christopher Samuel
On 05/01/18 10:48, Jörg Saßmannshausen wrote: What I would like to know is: how about compensation? For me that is the same as the VW scandal last year. We, the users, have been deceived. I think you would be hard pressed to prove that, especially as it seems that pretty much every mainstream

Re: [Beowulf] [upgrade strategy] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-05 Thread Christopher Samuel
On 03/01/18 23:56, Remy Dernat wrote: So here is me question : if this is not confidential, what will you do ? Any system where you do not have 100% trust in your users, their passwords and the devices they use will (IMHO) need to be patched. But as ever this will need to be a site-specific r

Re: [Beowulf] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-03 Thread Christopher Samuel
On 03/01/18 19:46, John Hearns via Beowulf wrote: I guess the phrase "to some extent" is the vital one here. Are there any security exploits which use this information? It's more the fact that it reduces/negates the protection that existing kernel address space randomisation gives you, the ide

Re: [Beowulf] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-02 Thread Christopher Samuel
On 03/01/18 14:46, Christopher Samuel wrote: This is going to be interesting I think... Also looks like ARM64 may have a similar issue, a subscriber only article on LWN points to this patch set being worked on to address the problem there: https://lwn.net/Articles/740393/ All the best

[Beowulf] Intel CPU design bug & security flaw - kernel fix imposes performance penalty

2018-01-02 Thread Christopher Samuel
Hi all, Just a quick break from my holiday in Philadelphia (swapped forecast 40C on Saturday in Melbourne for -10C forecast here) to let folks know about what looks like a longstanding Intel CPU design flaw that has security implications. There appears to be no microcode fix possible and the ker

Re: [Beowulf] Openlava down?

2017-12-24 Thread Christopher Samuel
On 24/12/17 10:52, Jeffrey Layton wrote: I remember that email. Just curious if things have progressed so Openlava is no longer available. That would appear to be the case. Here is the DMCA notice from IBM to Github: https://github.com/github/dmca/blob/master/2016/2016-10-17-IBM.md FWIW the

Re: [Beowulf] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread Christopher Samuel
On 19/12/17 09:20, Prentice Bisbal wrote: What are the pros/cons of using these two methods, other than the portability issue I already mentioned? Does srun+pmi use a different method to wire up the connections? Some things I read online seem to indicate that. If slurm was built with PMI suppo

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-19 Thread Christopher Samuel
On 19/11/17 10:40, Jonathan Engwall wrote: > I had no idea x86 began its life as a co-processor chip, now it is not > even a product at all. Ah no, this was when floating point was done via a co-processor for the Intel x86.. -- Christopher SamuelSenior Systems Administrator Mel

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-15 Thread Christopher Samuel
On 16/11/17 13:59, C Bergström wrote: > I'm torn between Knowing Stu and what he does I'll take the former over the latter.. :-) -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone:

Re: [Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-15 Thread Christopher Samuel
h they killed in 2010). -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by P

[Beowulf] Intel kills Knights Hill, Xeon Phi line "being revised"

2017-11-15 Thread Christopher Samuel
addendum to say: # [Update: Intel denies they are dropping the Xeon Phi line, # saying only that it has "been revised based on recent # customer and overall market needs."] cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of

Re: [Beowulf] Killing nodes with Open-MPI?

2017-11-05 Thread Christopher Samuel
no > OOPS > or other diagnostics and has to be power cycled. It was indeed a driver bug, and is now fixed in Mellanox OFED 4.2 (which came out a few days ago). cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne

Re: [Beowulf] slow mpi init/finalize

2017-10-17 Thread Christopher Samuel
MPI with the config option: --with-verbs to get it to enable IB support? cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Re: [Beowulf] slow mpi init/finalize

2017-10-15 Thread Christopher Samuel
I2 is a contrib plugin in the source tree). Hope that helps.. Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ Beow

Re: [Beowulf] mpi alltoall help

2017-10-10 Thread Christopher Samuel
On 11/10/17 02:58, Michael Di Domenico wrote: > i'm getting stuck trying to run some fairly large IMB-MPI alltoall > tests under openmpi 2.0.2 on rhel 7.4 Did this work on RHEL 7.3? I've heard rumours of issues with RHEL 7.4 and OFED. cheers, Chris -- Christopher Samuel

[Beowulf] Intel/Cray Aurora system pushed back a few years, expanded to 1PF

2017-10-04 Thread Christopher Samuel
Knights Hill (the followup to KNL) might be the cause of the delay... All the best, Chris (back in Melbourne) -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Re: [Beowulf] What is rdma, ofed, verbs, psm etc?

2017-09-23 Thread Christopher Samuel
ency than our circa 2013 FDR14 Infiniband cluster. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ Beowulf

[Beowulf] Administrivia: List admin away at Slurm User Group next week

2017-09-20 Thread Christopher Samuel
u might need to do that a few times before it finally sinks in. :-) Take care all, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 _

Re: [Beowulf] What is rdma, ofed, verbs, psm etc?

2017-09-20 Thread Christopher Samuel
en it is worth understanding the basics, even just doing some performance comparisons can be educational. But of course you have to have the gear that can do this in the first place! Best of luck, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The Uni

Re: [Beowulf] What is rdma, ofed, verbs, psm etc?

2017-09-19 Thread Christopher Samuel
7 (at least in the experience of the folks I'm helping out here). cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Re: [Beowulf] slurm in heterogenous cluster

2017-09-19 Thread Christopher Samuel
On 18/09/17 23:11, Mikhail Kuzminsky wrote: > Thank you very much ! > I hope than modern major slurm versions will be succesfully translated > and builded also w/old Linux distributions > (for example, w/2.6 kernel). We run Slurm 16.05.8 on RHEL6 (2.6.32 base) without issue. --

Re: [Beowulf] slurm in heterogenous cluster

2017-09-17 Thread Christopher Samuel
.x or: slurmdbd: 17.02.x slurmctld: 16.05.x slurmd: 16.05.x & 15.08.x or: slurmdbd: 16.05.x slurmctld: 16.05.x slurmd: 16.05.x & 15.08.x or: slurmdbd: 16.05.x slurmctld: 15.08.x slurmd: 15.08.x or: slurmdbd: 15.08.x slurmctld: 15.08.x slurmd: 15.08.x Good luck! Chris -- Christopher

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-17 Thread Christopher Samuel
On 15/09/17 04:45, Prentice Bisbal wrote: > I'm happy to announce that I finally found the cause this problem: numad. Very interesting, it sounds like it was migrating processes onto a single core over time! Anything diagnostic in its log? -- Christopher SamuelSenior

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-13 Thread Christopher Samuel
0.05| 2503 1| 13| 29| 0.00| 0.00| 0.00| 0.00|| 99.57| 0.43| 2503 1| 14| 30| 0.00| 0.00| 0.00| 0.00|| 99.94| 0.06| 2503 1| 15| 31| 0.00| 0.00| 0.00| 0.00|| 99.58| 0.42| 2503 cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinforma

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-10 Thread Christopher Samuel
h and then diff that. Hopefully that will reveal any differences in kernel boot options, driver messages, power saving settings, etc, that might be implicated. Good luck! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbou

Re: [Beowulf] RAID5 rebuild, remount with write without reboot?

2017-09-10 Thread Christopher Samuel
t I have that just "happy to be walking away" > feeling about the whole incident. +1 :-) Glad to hear you survived.. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Em

Re: [Beowulf] cluster deployment and config management

2017-09-05 Thread Christopher Samuel
On 06/09/17 09:51, Christopher Samuel wrote: > Nothing like your scale, of course, but it works and we know if a node > has booted a particular image it will be identical to any other node > that's set to boot the same image. I should mention that we set the osimage for nodes v

Re: [Beowulf] cluster deployment and config management

2017-09-05 Thread Christopher Samuel
e only person who understood it left. Don't miss it at all... cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Re: [Beowulf] Supercomputing comes to the Daily Mail

2017-08-14 Thread Christopher Samuel
he next year, the spaceborne computer will continuously run # through a set of computing benchmarks to determine its performance # over time. Meanwhile, on the ground, an identical copy of the # computer will run in a lab as a control. No details on the actual systems there though. cheers, Chris

Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Christopher Samuel
the same OS image then you'd not expect the kernel command lines etc to differ, but the UEFI settings might (depending on how they are configured usually). cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbo

Re: [Beowulf] How to debug slow compute node?

2017-08-13 Thread Christopher Samuel
as going on and why performance was so bad. It wasn't until they got Mellanox into the calls that Mellanox pointed this out to them. cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Ph

Re: [Beowulf] mlx 10g ethernet

2017-08-06 Thread Christopher Samuel
On 05/08/17 03:49, Michael Di Domenico wrote: > so given that qperf seems to agree with iperf, i guess it's an > interesting question now why, lustre lnet_selftest and IMB sendrecv > seem throttled at 500MB/sec Is this over TCP/IP or using RoCE (RDMA over Converged Ethernet) ? --

Re: [Beowulf] Cluster Hat

2017-08-06 Thread Christopher Samuel
Raspbian (Debian Jessie), in case that # makes a difference. Hope that helps.. Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3

Re: [Beowulf] How to know if infiniband network works?

2017-08-02 Thread Christopher Samuel
e IB (and fail if it cannot) by doing this before running the application: export OMPI_MCA_btl=openib,self,sm Out of interest, are you running it via a batch system of some sort? All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The Univ

Re: [Beowulf] Hyperthreading and 'OS jitter'

2017-08-01 Thread Christopher Samuel
ame that Bull's test was too small to show any benefit! All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 __

Re: [Beowulf] Hyperthreading and 'OS jitter'

2017-08-01 Thread Christopher Samuel
reason they wrote the original cpuset code in the Linux kernel so they could constrain a set of cores for the boot services and the rest were there to run jobs on. All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melb

Re: [Beowulf] Administrivia: emergency Mailman work for the Beowulf list

2017-07-04 Thread Christopher Samuel
, sorry about that. All the best, Chris On 05/07/17 10:07, Christopher Samuel wrote: > Hi all, > > Someone sent a message from a site that uses DMARC last night and > consequently almost 500 subscribers had their subscriptions suspended as > DMARC requires sites to break the

Re: [Beowulf] Administrivia: emergency Mailman work for the Beowulf list

2017-07-04 Thread Christopher Samuel
On 05/07/17 10:07, Christopher Samuel wrote: > I'm about to install the latest backport of Mailman on beowulf.org which > will then allow me to automatically reject any DMARC'd emails to the > list as a quick fix. I've now completed this work, though with a slight m

[Beowulf] Administrivia: emergency Mailman work for the Beowulf list

2017-07-04 Thread Christopher Samuel
ilman on beowulf.org which will then allow me to automatically reject any DMARC'd emails to the list as a quick fix. Then I'll need to work out how to get the folks who got dropped back onto the list. :-/ Sorry about this, Chris -- Christopher SamuelSenior Systems Administr

[Beowulf] BeeGFS usage question

2017-06-28 Thread Christopher Samuel
, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To

Re: [Beowulf] Register article on Epyc

2017-06-21 Thread Christopher Samuel
I thought it interesting that the only performance info in that article for Epyc were SpecINT and (the only mention for SpecFP was for Radeon). -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0

Re: [Beowulf] Heads up - Stack-Clash local root vulnerability

2017-06-21 Thread Christopher Samuel
about with users own codes being copied onto the system or containers utilised through Shifter and Singularity which exist to disarm Docker containers. Phew, thanks so much for pointing that out! :-) All the best, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinfor

Re: [Beowulf] Heads up - Stack-Clash local root vulnerability

2017-06-21 Thread Christopher Samuel
s... Yes, a double edged sword, lots more vulnerable software that will never get an update.. :-/ cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 __

Re: [Beowulf] Heads up - Stack-Clash local root vulnerability

2017-06-21 Thread Christopher Samuel
already (and possibly predate Qualsys identifying this) from what I've heard. :-( Qualsys's PoC's are due to drop next Tuesday (timezone unclear). cheers, Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email:

Re: [Beowulf] Heads up - Stack-Clash local root vulnerability

2017-06-20 Thread Christopher Samuel
On 21/06/17 10:21, Christopher Samuel wrote: > I suspect in those cases you have to rely entirely on the kernel > mitigation of increasing the stack guard gap size. I'm now seeing indications this kernel change can break some applications (we've not touched our HPC syst

[Beowulf] Heads up - Stack-Clash local root vulnerability

2017-06-20 Thread Christopher Samuel
stion of statically linked binaries (yes, I know, don't do that, but they are common) and containers such as Shifter & Singularity with older glibc's. I suspect in those cases you have to rely entirely on the kernel mitigation of increasing the stack guard gap size. cheers, Chr

  1   2   3   4   5   6   >