[Beowulf] HPC Principal System Engineer at Broad

2024-04-25 Thread Paul Edmon via Beowulf
A friend ask me to pass this along. Figured some folks on this list might be interested. https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773 -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org

[Beowulf] Cleaning up orphaned fuse mounts

2022-04-14 Thread Paul Edmon via Beowulf
Does anyone have a handy script or epilog script that you run to clean up fuse mounts that users may have made during a job? -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest

Re: [Beowulf] LSF vs Slurm

2022-03-10 Thread Paul Edmon via Beowulf
with the current version of slurm it should be much faster as things have really come a long way over the past decade with Slurm. -Paul Edmon- On 3/10/2022 11:39 AM, Lohit Valleru via Beowulf wrote: Hello Everyone, I wanted to ask if there is anyone who could explain me the benefits of movi

Re: [Beowulf] Question about fair share

2022-01-24 Thread Paul Edmon via Beowulf
/#Slurm_partitions -Paul Edmon- On 1/24/2022 2:59 PM, Tom Harvill via Beowulf wrote: Thank you Mr. Edmon, The link you provided is comprehensive and well-written. However, I don't see the scheduler configured half-life time length.  Do you know what it is?  And what is your clusters' maximum j

Re: [Beowulf] Question about fair share

2022-01-24 Thread Paul Edmon via Beowulf
Here is our fairshare policy doc: https://docs.rc.fas.harvard.edu/kb/fairshare/  We use the classic fairshare here. -Paul Edmon- On 1/24/2022 2:17 PM, Tom Harvill wrote: Hello, We use a 'fair share' feature of our scheduler (SLURM) and have our decay half-life (the time

Re: [Beowulf] Infiniband for MPI computations setup guide

2021-10-20 Thread Paul Edmon via Beowulf
Oh you will also need a IB subnet manager (opensm) running since you have an unmanaged switch.  You can start this on one of the compute nodes.   I would probably start up 2 so you have redundancy. -Paul Edmon- On 10/20/2021 6:08 AM, leo camilo wrote:  I have recently acquired a few ConnectX

Re: [Beowulf] Infiniband for MPI computations setup guide

2021-10-20 Thread Paul Edmon via Beowulf
u can provide commandline options to ensure that. 5. Test and verify it is working. -Paul Edmon- On 10/20/2021 6:08 AM, leo camilo wrote:  I have recently acquired a few ConnectX-3 cards and an unmanaged IB switch (IS5022) to upgrade my department's beowulf cluster. Thus far, I have be

Re: [Beowulf] Data Destruction

2021-09-29 Thread Paul Edmon via Beowulf
I guess the question is for a parallel filesystem how do you make sure you have 0'd out the file with out borking the whole filesystem since you are spread over a RAID set and could be spread over multiple hosts. -Paul Edmon- On 9/29/2021 10:32 AM, Scott Atchley wrote: For our users that

Re: [Beowulf] Data Destruction

2021-09-29 Thread Paul Edmon via Beowulf
Yeah, that's what we were surmising.  But paranoia and compliance being what it is we were curious what others were doing. -Paul Edmon- On 9/29/2021 10:32 AM, Renfro, Michael wrote: I have to wonder if the intent of the DUA is to keep physical media from winding up in the wrong hands.

Re: [Beowulf] Data Destruction

2021-09-29 Thread Paul Edmon via Beowulf
The former.  We are curious how to selectively delete data from a parallel filesystem.  For example we commonly use Lustre, ceph, and Isilon in our environment.  That said if other types allow for easier destruction of selective data we would be interested in hearing about it. -Paul Edmon

[Beowulf] Data Destruction

2021-09-29 Thread Paul Edmon via Beowulf
es of filesystems to people generally use for this and how do people ensure destruction?  Do these types of DUA's preclude certain storage technologies from consideration or are there creative ways to comply using more common scalable filesystems? Thanks in advance for the info. -

[Beowulf] Open Positions at FASRC

2021-09-20 Thread Paul Edmon via Beowulf
https://www.rc.fas.harvard.edu/about/employment/ -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo

Re: [Beowulf] fdr in edr switch

2021-03-02 Thread Paul Edmon
We haven't had any problems with plugging FDR stuff into and EDR switch.  It does downgrade the connections but still works. -Paul Edmon- On 3/2/2021 6:44 AM, Darren Wise wrote: Heya, I do very much the same QSFP28-EDR 100G adapter cards into QSFP+-FDR 56G just with the use of a cab

Re: [Beowulf] [EXTERNAL] Re: pdsh

2020-11-30 Thread Paul Edmon
Slurm is definitely still under active development with a vibrant community.  SchedMD is the one mainly driving its development.  In fact the 20.11 version just came out with some nice new features like scrontab, which I am super excited for. -Paul Edmon- On 11/29/2020 11:26 PM, Lux, Jim (US

Re: [Beowulf] 10G and rsync

2020-01-02 Thread Paul Edmon
fpsync for all our large scale data movement here and Globus for external transfers. -Paul Edmon- On 1/2/20 10:45 AM, Joe Landman wrote: On 1/2/20 10:26 AM, Michael Di Domenico wrote: does anyone know or has anyone gotten rsync to push wire speed transfers of big files over 10G links?  i&#

[Beowulf] Dual IB Cards in a Single Box

2019-05-10 Thread Paul Edmon
card running simultaneously? -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] Fortran is Awesome

2018-11-29 Thread Paul Edmon
with Fortran. -Paul Edmon- On 11/29/18 10:09 AM, Nathan Moore wrote: I've probably mentioned this before.  If a student only has one programming course, teaching fortran feels like malpractice, however, this book is awesome! Classical Fortran, Kupferschmid https://www.crcpress.com

Re: [Beowulf] Fortran is Awesome

2018-11-28 Thread Paul Edmon
tool that's best for the job.  That's the moral of the story. -Paul Edmon- On 11/28/2018 12:17 PM, Robert G. Brown wrote: On Wed, 28 Nov 2018, Paul Edmon wrote: Once C has native arrays and orders them properly, then we can talk :). Yeah, like this.  That's really the big diffe

Re: [Beowulf] Fortran is Awesome

2018-11-28 Thread Paul Edmon
does well it does very well, and it still does very well. Once C has native arrays and orders them properly, then we can talk :). -Paul Edmon- On 11/28/18 11:36 AM, Peter St. John wrote: Maybe I'm being too serious but in the old days, Fortran was the most mature, maintained compil

[Beowulf] Fortran is Awesome

2018-11-28 Thread Paul Edmon
Fortran is and remains an awesome language.  More people should use it: https://wordsandbuttons.online/fortran_is_still_a_thing.html -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription

Re: [Beowulf] Contents of Compute Nodes Images vs. Login Node Images

2018-10-23 Thread Paul Edmon
cific architectures we tell them to start up an interactive session on the hardware they want to run on to build. -Paul Edmon- On 10/23/18 12:15 PM, Ryan Novosielski wrote: Hi there, I realize this may not apply to all cluster setups, but I’m curious what other sites do with regard to sof

Re: [Beowulf] Lustre Upgrades

2018-07-24 Thread Paul Edmon
an problem anymore.  However it has made us very gun shy about trying Gluster again.  Instead we've decided to use Ceph as we've gained a bunch of experience with Ceph in our OpenNebula installation. -Paul Edmon- On 07/24/2018 11:02 AM, John Hearns via Beowulf wrote: Paul, thanks fo

Re: [Beowulf] Lustre Upgrades

2018-07-24 Thread Paul Edmon
nd in ones own environment. -Paul Edmon- On 07/24/2018 10:31 AM, John Hearns via Beowulf wrote: Forgive me for saying this, but the philosophy for software defined storage such as CEPH and Gluster is that forklift style upgrades should not be necessary. When a storage server is to be retire

Re: [Beowulf] Lustre Upgrades

2018-07-24 Thread Paul Edmon
you also have to have the budget to buy the new hardware. Right now we are just exploring our options. -Paul Edmon- On 07/24/2018 04:52 AM, Jörg Saßmannshausen wrote: Hi Paul, with a file system being 93% full, in my humble opinion it would make sense to increase the underlying hardware c

Re: [Beowulf] Lustre Upgrades

2018-07-23 Thread Paul Edmon
el for the IEEL appliance we have been running. Odds are you systems are fine as they aren't taking quite the pounding ours is.  The problem doesn't happen that frequently. -Paul Edmon- On 07/23/2018 02:03 PM, Michael Di Domenico wrote: On Mon, Jul 23, 2018 at 1:34 PM, Paul Edmon wro

Re: [Beowulf] Lustre Upgrades

2018-07-23 Thread Paul Edmon
scale linearly with size. They all have the same hardware. The head nodes are Dell R620's while the shelves are M3420 (mds) and M3260 (oss).  The MDT is 2.2T with 466G used and 268M inodes used.  Each OST is 30T with each OSS hosting 6.  The filesystem itself is 93% full. -Paul Edmon- On

Re: [Beowulf] Lustre Upgrades

2018-07-23 Thread Paul Edmon
est the upgrade path before committing to upgrading our larger systems.  One of the questions we had though was if we needed to run e2fsck before/after the upgrade as that could add significant time to the outage for that to complete. -Paul Edmon- On 07/23/2018 01:18 PM, Jeff Johnson wrote:

Re: [Beowulf] Lustre Upgrades

2018-07-23 Thread Paul Edmon
My apologies I meant 2.5.34 not 2.6.34.  We'd like to get up to 2.10.4 which is what our clients are running.  Recently we upgraded our cluster to CentOS7 which necessitated the client upgrade.  Our storage servers though stayed behind on 2.5.34. -Paul Edmon- On 07/23/2018 01:00 PM,

[Beowulf] Lustre Upgrades

2018-07-23 Thread Paul Edmon
your wisdom. -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-08 Thread Paul Edmon
d on your single node/core users tolerances for being requeued. -Paul Edmon- On 06/08/2018 03:55 AM, John Hearns via Beowulf wrote: Chris, good question. I can't give a direct asnwer there, but let me share my experiences. In the past I managed SGI ICE clusters and a large memory UV sy

Re: [Beowulf] Monitoring and Metrics

2017-10-07 Thread Paul Edmon
So for general monitoring of the cluster usage we use: https://github.com/fasrc/slurm-diamond-collector and pipe to Graphana.  We also use XDMod: http://open.xdmod.org/7.0/index.html As for specific node alerting, we use the old standby of Nagios. -Paul Edmon- On 10/7/2017 8:21 AM, Josh

Re: [Beowulf] slurm in heterogenous cluster

2017-09-18 Thread Paul Edmon
We run both CentOS 6 and 7 here for our install of slurm.  There has been no problems with using slurm on either simultaneously. -Paul Edmon- On 09/18/2017 09:11 AM, Mikhail Kuzminsky wrote: In message from Christopher Samuel (Mon, 18 Sep 2017 16:03:47 +1000): ... The best info is in the

[Beowulf] Jobs at Harvard Research Computing

2017-09-11 Thread Paul Edmon
We have a number of openings here at Harvard FAS RC. If you are interested please check out our employment page for details: https://www.rc.fas.harvard.edu/about/employment/ -Paul Edmon- ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by