Re: [Beowulf] Question about fair share

2022-01-24 Thread Skylar Thompson
On Mon, Jan 24, 2022 at 01:17:30PM -0600, Tom Harvill wrote: > > > Hello, > > We use a 'fair share' feature of our scheduler (SLURM) and have our decay > half-life (the time needed for priority penalty to halve) set to 30 days.  > Our maximum job runtime is 7 days.  I'm wondering what others use

Re: [Beowulf] [EXTERNAL] server lift

2021-10-22 Thread Skylar Thompson
On Fri, Oct 22, 2021 at 10:24:47AM -0700, David Mathog wrote: > On Thu, 21 Oct 2021 15:00:22 -0400 Prentice Bisbal wrote: > > > We have one of these where I work: > > > > https://serverlift.com/data-center-lifts/sl-350x/ > > Wish I had had something like that. > > The only downside to that unit

Re: [Beowulf] server lift

2021-10-18 Thread Skylar Thompson
We're using a GL-8 lift, which can fit down our ~2'-wide cold aisles. It can be a little bit awkward but definitely keeps the workplace safety people happy. On Mon, Oct 18, 2021 at 10:32:31AM -0400, Michael Di Domenico wrote: > we're using an older genie lift as a server lift currently. which as

Re: [Beowulf] Data Destruction

2021-09-29 Thread Skylar Thompson
In this case, we've successfully pushed back with the granting agency (US NIH, generally, for us) that it's just not feasible to guarantee that the data are truly gone on a production parallel filesystem. The data are encrypted at rest (including offsite backups), which has been sufficient for our

Re: [Beowulf] Data Destruction

2021-09-29 Thread Skylar Thompson
We have one storage system (DDN/GPFS) that is required to be NIST-compliant, and we bought self-encrypting drives for it. The up-charge for SED drives has diminished significantly over the past few years so that might be easier than doing it in software and then having to verify/certify that the so

Re: [Beowulf] Power Cycling Question

2021-07-16 Thread Skylar Thompson
One problem with suspend/sleep is if you have services that depend on persistent TCP connections. I don't know that GPFS (er, sorry, "Spectrum Scale"), for instance, would be consistently tolerant of its daemon connections being interrupted, even if the node in question wasn't actually doing any I/

Re: [Beowulf] Odd NFS write issue for commands issued in a script

2020-12-11 Thread Skylar Thompson
Is it possible that /usr/common/tmp/outfile.txt already exists, and the shell has noclobber set? On Tue, Dec 08, 2020 at 05:30:14PM -0800, David Mathog wrote: > Can anybody suggest why a script which causes writes to an NFS mounted > directory like so > >ssh remotenode 'command >/usr/common/t

Re: [Beowulf] ***UNCHECKED*** Re: OT, RAID controller replacement batteries?

2020-11-04 Thread Skylar Thompson
On Wed, Nov 04, 2020 at 09:54:36AM -0800, David Mathog wrote: > > on Mon, 2 Nov 2020 12:31:39 Skylar Thompson wrote: > > > We've had the same problems, one somewhat-effective trick has been to > > scavenge working batteries from systems we're sending to surplus s

Re: [Beowulf] OT, RAID controller replacement batteries?

2020-11-02 Thread Skylar Thompson
We've had the same problems, one somewhat-effective trick has been to scavenge working batteries from systems we're sending to surplus so we have our own supply of batteries to swap in. The failure rate is marginally better than the batteries we've bought from Amazon/Newegg/eBay (as you note, not g

Re: [Beowulf] [External] SLURM - Where is this exit status coming from?

2020-08-13 Thread Skylar Thompson
Hmm, apparently math is hard today. I of course meant 2^7, not 2^8. On Thu, Aug 13, 2020 at 02:37:46PM -0700, Skylar Thompson wrote: > I think this is an artifact of the job process running as a child process of > the job script, where POSIX defines the low-order 8 bits of the process >

Re: [Beowulf] [External] SLURM - Where is this exit status coming from?

2020-08-13 Thread Skylar Thompson
I think this is an artifact of the job process running as a child process of the job script, where POSIX defines the low-order 8 bits of the process exit code as indicating which signal the child process received when it exited. As others noted, 137 is 2^8+9, where 9 is SIGKILL (exceeding memory,

Re: [Beowulf] [EXTERNAL] Re: Have machine, will compute: ESXi or bare metal?

2020-02-19 Thread Skylar Thompson
dsh) to move files around. > > Note that with beagles (and Rpis) you usually use a "network over USB" to get > it up originally (the gadget interface) > > On 2/11/20, 7:57 PM, "Beowulf on behalf of Skylar Thompson" > wrote: > > On Tue, Feb 11, 2

Re: [Beowulf] HPC for community college?

2020-02-19 Thread Skylar Thompson
This was basically our intent with the LittleFe project[1] as well: a lot of small institutions don't have the facilities or the expertise to run a supercomputer, but they have students and faculty that would really benefit from being able to learn the basics of parallel computing and HPC if only t

Re: [Beowulf] Have machine, will compute: ESXi or bare metal?

2020-02-11 Thread Skylar Thompson
On Tue, Feb 11, 2020 at 06:25:24AM +0800, Benson Muite wrote: > > > On Tue, Feb 11, 2020, at 9:31 AM, Skylar Thompson wrote: > > On Sun, Feb 09, 2020 at 10:46:05PM -0800, Chris Samuel wrote: > > > On 9/2/20 10:36 pm, Benson Muite wrote: > > > > > > >

Re: [Beowulf] Have machine, will compute: ESXi or bare metal?

2020-02-10 Thread Skylar Thompson
On Sun, Feb 09, 2020 at 10:46:05PM -0800, Chris Samuel wrote: > On 9/2/20 10:36 pm, Benson Muite wrote: > > > Take a look at the bootable cluster CD here: > > http://www.littlefe.net/ > > From what I can see BCCD hasn't been updated for just over 5 years, and the > last email on their developer l

Re: [Beowulf] Interactive vs batch, and schedulers [EXT]

2020-01-17 Thread Skylar Thompson
In the Grid Engine world, we've worked around some of the resource fragmentation issues by assigning static sequence numbers to queue instances (a node publishing resources to a queue) and then having the scheduler fill nodes by sequence number rather than spreading jobs across the cluster. This le

Re: [Beowulf] software for activating one of many programs but not the others?

2019-08-20 Thread Skylar Thompson
We also use Environment Modules, with a well-established hierarchy for software installs (software-name/software-version/OS/OS-version/architecture). Combined with some custom Tcl functions and common header files for our module files, this lets us keep the size of most module files very small (2-5

Re: [Beowulf] GPFS question

2019-04-30 Thread Skylar Thompson
I'm glad to hear you got it working again! Out of curiousity, how many files and bytes do you have spread across how many NSDs? It's been nearly a decade since we've had to run mmfsck but the run-time of it is a concern now that our storage is an order of magnitude bigger than it was then. Also, i

Re: [Beowulf] live free SGE descendent (for Centos 7)?

2019-03-05 Thread Skylar Thompson
Hi David, Not sure if you saw this, but Univa just announced that they will be selling support for the open source GE forks: http://www.univa.com/about/news/press_2019/02282019.php I don't know how much development time they will include, but as you note, at some point the open source forks will

[Beowulf] HPC sysadmin position at University of Washington Genome Sciences

2019-01-07 Thread Skylar Thompson
Hi Beowulfers, UW Genome Sciences is hiring for a HPC sysadmin to support our bioinformatics and general research computing for the department: https://uwhires.admin.washington.edu/eng/candidates/default.cfm?szCategory=jobprofile&szOrderID=163509&szCandidateID=0&szSearchWords=linux&szReturnToSear

Re: [Beowulf] Poll - Directory implementation

2018-10-27 Thread Skylar Thompson
;nslcd is something completely different (*) and whoever chose similar >names should be forced to watch endless re-runs of the Parrot Sketch. >[1]https://wiki.samba.org/index.php/Nslcd >(*) obligatory Python reference > > On Sat, 27 Oct 2018 at 04:12, Skylar Thomp

Re: [Beowulf] Poll - Directory implementation

2018-10-26 Thread Skylar Thompson
want to speed up cacheing with sssd itself you can put its local >caches on a RAMdisk. This has the cost of no persistence of course and >uses up RAM which you may prefer to put to better use. > > On Sat, 27 Oct 2018 at 00:59, Skylar Thompson ><[2]skylar.thomp..

Re: [Beowulf] Poll - Directory implementation

2018-10-26 Thread Skylar Thompson
On Fri, Oct 26, 2018 at 08:44:28PM +, Ryan Novosielski wrote: > Our LDAP is very small, compared to the sorts of things some people run. > > We added indexes today on uid, uidNumber, and gidNumber and the problem went > away. Didn’t try it earlier as it had virtually no impact on our testing

Re: [Beowulf] Poll - Directory implementation

2018-10-25 Thread Skylar Thompson
At Univ. of WA Genome Sciences, we use Active Directory, but we also support a modest desktop environment. As much as I am not a fan of Microsoft, AD just works (even the replication) and, since someone else is responsible for the Windows gear here, I can just think of it as a LDAP/Krb5 store with

Re: [Beowulf] Contents of Compute Nodes Images vs. Login Node Images

2018-10-23 Thread Skylar Thompson
At Univ. of WA Genome Sciences, we run the same build on both login and compute nodes. The login nodes are obviously not as capable as our compute nodes, but it's easier for us to provision them in the same way. On Tue, Oct 23, 2018 at 04:15:51PM +, Ryan Novosielski wrote: > Hi there, > > I r

Re: [Beowulf] memory usage

2018-06-22 Thread Skylar Thompson
On Friday, June 22, 2018, Michael Di Domenico wrote: > On Fri, Jun 22, 2018 at 2:28 PM, Skylar Thompson > wrote: > > Assuming Linux, you can get that information out of /proc//smaps and > > numa_maps. > > the memory regions are in there for the used bits, but i do

Re: [Beowulf] memory usage

2018-06-22 Thread Skylar Thompson
Assuming Linux, you can get that information out of /proc//smaps and numa_maps. Skylar On Friday, June 22, 2018, Michael Di Domenico wrote: > does anyone know of a tool that looks at a process > (single/multi-threaded) and tells you how much memory it's using and > in which numa domain the allo

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-12 Thread Skylar Thompson
On Tue, Jun 12, 2018 at 11:08:44AM -0400, Prentice Bisbal wrote: > On 06/12/2018 12:33 AM, Chris Samuel wrote: > > >Hi Prentice! > > > >On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote: > > > >>I to make this work, I will be using job_submit.lua to apply this logic > >>and assign a j

Re: [Beowulf] Clearing out scratch space

2018-06-12 Thread Skylar Thompson
On Tue, Jun 12, 2018 at 10:06:06AM +0200, John Hearns via Beowulf wrote: > What do most sites do for scratch space? We give users access to local disk space on nodes (spinning disk for older nodes, SSD for newer nodes), which (for the most part) GE will address with the $TMPDIR job environment var

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-12 Thread Skylar Thompson
On Tue, Jun 12, 2018 at 02:28:25PM +1000, Chris Samuel wrote: > On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote: > > > Unfortunately we don't have a mechanism to limit > > network usage or local scratch usage > > Our trick in Slurm is to use the slur

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread Skylar Thompson
On Mon, Jun 11, 2018 at 02:36:14PM +0200, John Hearns via Beowulf wrote: > Skylar Thomson wrote: > >Unfortunately we don't have a mechanism to limit > >network usage or local scratch usage, but the former is becoming less of a > >problem with faster edge networking, and we have an opt-in bookkeepin

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-10 Thread Skylar Thompson
On Sun, Jun 10, 2018 at 06:46:04PM +1000, Chris Samuel wrote: > On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote: > > > We're a Grid Engine shop, and we have the execd/shepherds place each job in > > its own cgroup with CPU and memory limits in place. > &g

Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-09 Thread Skylar Thompson
We're a Grid Engine shop, and we have the execd/shepherds place each job in its own cgroup with CPU and memory limits in place. This lets our users make efficient use of our HPC resources whether they're running single-slot jobs, or multi-node jobs. Unfortunately we don't have a mechanism to limit

Re: [Beowulf] OT, X11 editor which works well for very remote systems?

2018-06-07 Thread Skylar Thompson
vim actually can actually use SCP automatically using URL-style file paths: http://vim.wikia.com/wiki/Editing_remote_files_via_scp_in_vim On Thu, Jun 07, 2018 at 07:22:57AM +0800, Deng Xiaodong wrote: > In this case I would think SSH + nano/vi may be better choice as the data > transited is less.

Re: [Beowulf] batch systems connection

2018-05-29 Thread Skylar Thompson
For software for which there are no pre-built RPMs, I'm a big fan of using fpm to build the RPMs, rather than trying to write a .spec file by hand: https://github.com/jordansissel/fpm You can point fpm at a directory tree, and build a RPM (or .deb, or Solaris pkg, etc.) for it. There's options fo

Re: [Beowulf] Pdsh output to multiple windows

2018-04-14 Thread Skylar Thompson
I'm not sure how to do this with pdsh, but I know Ansible can capture output per task. It doesn't get you output per window though at the point it's split up per host, you could write that to a named pipe per host and then read the output anywhere. Skylar On Sat, Apr 14, 2018 at 11:22 AM, Lux, Ji

Re: [Beowulf] Puzzling Intel mpi behavior with slurm

2018-04-05 Thread Skylar Thompson
At least for Grid Engine/OpenMPI the preferred mechanism ("tight integration") involves the shepherds running on each exec hosts to start MPI, without any SSH/RSH required at all. I'm not sure if you've run across this documentation, but it might help to figure out what's going on: https://slurm.s

Re: [Beowulf] Slow RAID reads, no errors logged, why?

2018-03-19 Thread Skylar Thompson
Could it be a patrol read, possibly hitting a marginal disk? We've run into this on some of our Dell systems, and exporting the RAID HBA logs reveals what's going on. You can see those with "omconfig storage controller controller=n action=exportlog" (exports logs in /var/log/lsi_mmdd.log) or an equ

Re: [Beowulf] Update on dealing with Spectre and Meltdown

2018-03-08 Thread Skylar Thompson
We installed the kernel updates when they became available. Fortunately we were a little slower on the firmware updates, and managed to rollback the few we did apply that introduced instability. We're a bioinformatics shop (data parallel, lots of disk I/O mostly to GPFS, few-to-no cross-communicati

Re: [Beowulf] Storage Best Practices

2018-02-19 Thread Skylar Thompson
For our larger groups, we'll meet with them regularly to discuss their space usage (and other IT needs). Even that's unlikely to be frequent enough, so we direct usage alerts to their designated "data manager" if they're getting close to running out of space. Aside from regularly clearing out scrat

Re: [Beowulf] Cluster Authentication (LDAP,NIS,AD)

2017-12-29 Thread Skylar Thompson
now extensively uses automounter maps for bind mounts. > I may well learn something useful here. > > On 28 December 2017 at 15:28, Skylar Thompson <mailto:skylar.thomp...@gmail.com>> wrote: > > We are an AD shop, with users, groups, and automounter maps (for a shor

Re: [Beowulf] Cluster Authentication (LDAP,NIS,AD)

2017-12-28 Thread Skylar Thompson
We are an AD shop, with users, groups, and automounter maps (for a short while longer at least[1]) in the directory. I think once you get to around schema level 2003R2 you'll be using RFC2307bis (biggest difference from RFC2307 is that it supports nested groups) which is basically what modern Linux

Re: [Beowulf] Thoughts on git?

2017-12-19 Thread Skylar Thompson
90% of the battle is using a VCS to begin with. Whether that's SVN, git, Mercurial, etc. is somewhat irrelevant - just pick something with the features (and ease of use is a feature!) that you and your team need and stick with it. In my professional life, I've found SVN to suit my needs and be eas

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Skylar Thompson
I would also suspect a thermal issue, though it could also be firmware. To verify a temperature problem, you might try setting up lm_sensors or scraping "ipmitool sdr" output (whichever is easier) regularly and try to make a performance-vs-temperature plot for each node. As Andrew mentioned, it cou

Re: [Beowulf] How to debug slow compute node?

2017-08-10 Thread Skylar Thompson
We ran into something similar, though it turned out being a microcode bug in the CPU that caused it to remain stuck in its lowest power state. Fortunately it was easily testable with "perf stat" so it was pretty clear which nodes were impacted, which also happened to be bought as a batch with a uni

Re: [Beowulf] HPC and Licensed Software

2017-04-14 Thread Skylar Thompson
Back when we had software requiring MathLM/FlexLM, we just used NAT to get the cluster nodes talking to the licensing server. We also had a consumable in Grid Engine so that people could keep their jobs queued if there were no licenses available. Skylar On Fri, Apr 14, 2017 at 12:23 PM, Mahmood S

Re: [Beowulf] solaris?

2017-02-14 Thread Skylar Thompson
It has a minor role for us for storage (ZFS), but we're retiring our Solaris boxes as quickly as we can in favor of more GPFS. Skylar On 02/14/2017 01:28 PM, Michael Di Domenico wrote: > just out of morbid curiosity, does Solaris even have a stake in HPC > anymore? I've not heard boo about it in

Re: [Beowulf] Suggestions to what DFS to use

2017-02-13 Thread Skylar Thompson
Is there anything in particular that is causing you to move away from GPFS? Skylar On 02/12/2017 11:55 PM, Tony Brian Albers wrote: > Hi guys, > > So, we're running a small(as in a small number of nodes(10), not > storage(170TB)) hadoop cluster here. Right now we're on IBM Spectrum > Scale(GPF

Re: [Beowulf] clusters of beagles

2017-01-28 Thread Skylar Thompson
On 01/27/2017 12:14 PM, Lux, Jim (337C) wrote: > The pack of Beagles do have local disk storage (there¹s a 2GB flash on > board with a Debian image that it boots from). > > The LittleFe depends on the BCDD (i.e. ³CD rom with cluster image², > actually a USB stick) which is the sort of thing I was

Re: [Beowulf] non-stop computing

2016-10-25 Thread Skylar Thompson
Assuming you can contain a run on a single node, you could use containers and the freezer controller (plus maybe LVM snapshots) to do checkpoint/restart. Skylar On 10/25/2016 11:24 AM, Michael Di Domenico wrote: > here's an interesting thought exercise and a real problem i have to tackle. > > i

Re: [Beowulf] Generation of strings MPI fashion..

2016-10-08 Thread Skylar Thompson
If you haven't used MPI before, I would break this up into chunks: 1. Write a MPI program that gathers a fixed amount of /dev/urandom (Brian's suggestion is wise) data and sends it back to the master. This will get you used to the basic MPI_Send/MPI_Recv commands. 2. Use the same program, but use

Re: [Beowulf] User notification of new software on the cluster

2016-09-27 Thread Skylar Thompson
On 09/27/2016 04:42 PM, Christopher Samuel wrote: > On 28/09/16 06:02, Rob Taylor wrote: > >> I wanted to ask how people announce new versions of software that has >> been installed on their cluster to their user base. > > We're pretty lucky, most users know to use "module avail" to see what's >

Re: [Beowulf] Netapp EF540 password reset

2016-08-26 Thread Skylar Thompson
I don't have any info to help, but you might try asking on the lopsa-tech list as well. I know there's Netapp admins over there who might be able to help. Skylar On Friday, August 26, 2016, dave.b...@diamond.ac.uk wrote: > Hi All, > > We have a Netapp EF540 that has a management password set.

Re: [Beowulf] Demo-cluster ideas

2016-03-08 Thread Skylar Thompson
Thanks for the report back! The color-coding is a nifty idea. Skylar On 03/07/2016 08:44 AM, Olli-Pekka Lehto wrote: > First iteration of the mini-cluster is now in production. Some takeaways and > observations: > > We did the first deployment of the mini-cluster, called "Sisunen" last weekend >

Re: [Beowulf] Demo-cluster ideas

2016-02-01 Thread Skylar Thompson
Hi Olli-Pekka, When we have LittleFe (http://littlefe.net) out in the wild (which sounds a lot like what you're trying to do!), GalaxSee and Game of Life are two favorites: http://shodor.org/petascale/materials/UPModules/NBody/ http://shodor.org/petascale/materials/UPModules/GameOfLife/ They're

Re: [Beowulf] job scheduler and accounting question

2015-07-14 Thread Skylar Thompson
On 07/14/2015 04:17 PM, Stuart Barkley wrote: > On Tue, 14 Jul 2015 at 15:06 -, Joe Landman wrote: > >> Has the gridengine mess ever been sorted out? > > We are using Son of Grid Engine which > seems to be not-dead-yet. It does seem to rely heavily on just one or

Re: [Beowulf] Software installation across cluster

2015-05-26 Thread Skylar Thompson
We also use Environment Modules. One of the great things about Modules is that the modulefiles are written in Tcl, which means you can make them arbitrarily complex. We have a common header that's sourced for all of our module files that automatically sets common environment variables based on the

Re: [Beowulf] Cluster networking/OSs

2015-05-08 Thread Skylar Thompson
Hi Trevor, I'm another BCCD developer. Since we target lower-end clusters, we don't support Infiniband, although as a Debian-based distribution it wouldn't be hard to install. Most of the software we support is pedagogical in nature - N-body simulations, numerical methods, etc. Our emphasis is on

Re: [Beowulf] RAID question

2015-03-14 Thread Skylar Thompson
On 3/13/2015 5:52 PM, mathog wrote: A bit off topic, but some of you may have run into something similar. Today I was called in to try and fix a server which had stopped working. Not my machine, the usual sysop is out sick. The model is a Dell PowerEdge T320 with a Raid PERC H710P controller

Re: [Beowulf] Replacement for C3 suite from ORNL

2015-02-28 Thread Skylar Thompson
On 02/27/2015 02:34 PM, Fabricio Cannini wrote: > Hello all > > Does anybody know of a replacement for the C3 suite of commands ( cpower > ...) that is very common in SGI machines and used to be available here: > > http://www.csm.ornl.gov/torc/C3/ We've been using pdsh: https://code.google.com/

Re: [Beowulf] HPC demonstrations

2015-02-11 Thread Skylar Thompson
On 02/10/2015 02:52 AM, John Hearns wrote: > I am giving an internal talk tomorrow, a lightweight introduction to HPC. > > > > Can anyone suggest any demonstrations of HPC which I could give – > something visual? > > Or are there any videos I could pick up from somewhere? > > > > I ahad th

Re: [Beowulf] IPoIB failure

2015-01-27 Thread Skylar Thompson
On 01/27/2015 02:24 PM, Christopher Samuel wrote: > On 24/01/15 01:29, Lennart Karlsson wrote: > >> This reminds me of when we upgraded to SL-6.6 (approximately the same as >> CentOS-6.6 and RHEL-6.6). >> >> The new kernel we got, could not handle our IPoIB for storage traffic, >> which broke down

Re: [Beowulf] Python libraries slow to load across Scyld cluster

2015-01-17 Thread Skylar Thompson
On 01/16/2015 04:38 PM, Don Kirkby wrote: > Thanks for the suggestions, everyone. I've used them to find more > information, but I haven't found a solution yet. > > It looks like the time is spent opening the Python libraries, but my attempts > to change the Beowulf configuration files have not

Re: [Beowulf] Python libraries slow to load across Scyld cluster

2015-01-15 Thread Skylar Thompson
Do any of your search paths (PATH, PYTHONPATH, LD_LIBRARY_PATH, etc.) include a remote filesystem (i.e. NFS)? This sounds a lot like you're blocked on metadata lookups on NFS. Using "strace -c" will give you a histogram of system calls by count and latency, which can be helpful in tracking down the

Re: [Beowulf] Putting /home on Lusture of GPFS

2014-12-24 Thread Skylar Thompson
On 12/24/2014 06:44 AM, Michael Di Domenico wrote: > On Tue, Dec 23, 2014 at 6:43 PM, Christopher Samuel > wrote: >> On 24/12/14 05:35, Michael Di Domenico wrote: >> >>> I've always shied away from gpfs/lustre on /home and favoured netapp's >>> for one simple reason. snapshots. i can't tell you

Re: [Beowulf] SC14 website scheduling issues

2014-11-16 Thread Skylar Thompson
On 11/15/2014 11:12 PM, Novosielski, Ryan wrote: >> On Nov 16, 2014, at 00:48, Stuart Barkley wrote: >> >> Another note: Just like last time SC was in New Orleans, the shuttle >> bus operator at the airport has never heard of a convention coming >> into town and seems to run the same shuttle sched

Re: [Beowulf] LSI Megaraid stalls system on very high IO?

2014-08-05 Thread Skylar Thompson
On 08/04/2014 01:24 AM, John Hearns wrote: > Mark, you are of course correct. > Flush often and flush early! > > As an aside, working with desktop systems with larger amounts of memory > I would adjust the 'swappiness' tunable > and also the min_free_kbytes. > Min_free_kbytes in Linux is by defau

Re: [Beowulf] Gentoo in the HPC environment

2014-06-30 Thread Skylar Thompson
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/30/2014 05:34 PM, Christopher Samuel wrote: > On 01/07/14 10:27, Christopher Samuel wrote: > >> then all the applications are in /usr/local > > To quickly qualify that, our naming scheme is: > > /usr/local/$application/$version-$compiler/ > >

Re: [Beowulf] Small files

2014-06-16 Thread Skylar Thompson
On 06/13/2014 01:37 PM, Lux, Jim (337C) wrote: > I¹ve always advocated using the file system as a database: in the sense of > ³lots of little files, one for each data blob², where ³data blob² is > bigger than a few bytes, but perhaps in the hundreds/thousands of bytes or > larger. > > 1) Rather th

Re: [Beowulf] Small files

2014-06-13 Thread Skylar Thompson
We've recently implemented a quota of 1 million files per 1TB of filesystem space. And yes, we had to clean up a number of groups' and individuals' spaces before implementing that. There seems to be a trend in the bioinformatics community for using the filesystem as a database. I think it's enabled

Re: [Beowulf] job scheduler and health monitoring system

2014-01-11 Thread Skylar Thompson
On 01/10/2014 12:36 PM, reza azimi wrote: > hello guys, > > I'm looking for a state of art job scheduler and health monitoring for > my beowulf cluster and due to my research I've found many of them which > made me confused. Can you help or recommend me the ones which are very > hot and they are

Re: [Beowulf] ZFS for HPC

2013-12-22 Thread Skylar Thompson
On 12/22/2013 6:51 PM, Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/12/13 13:30, Skylar Thompson wrote: The one time we actually had corruption on-disk, ZFS was nice enough to tell us exactly which files were corrupted. This made it really easy to recover

Re: [Beowulf] ZFS for HPC

2013-12-22 Thread Skylar Thompson
On 12/22/2013 4:14 PM, Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Andrew, On 26/11/13 23:10, Andrew Holway wrote: Does checksumming save our data? Well that will depend on whether your setup is to just detect corruption, or be able to correct it too. The on

Re: [Beowulf] BeoBash/SC13

2013-11-10 Thread Skylar Thompson
I'll be there, along with the rest of the LittleFe crew. Skylar ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Heterogeneous, intermitent beowulf cluster administration

2013-09-26 Thread Skylar Thompson
On 09/26/2013 06:25 AM, Gavin W. Burris wrote: > Hi, Ivan. > > I'm a nay-sayer in this kind of scenario. I believe your staff time, > and the time of your lab users, is too valuable to spend on > dual-classing desktop lab machines. I'm with Gavin here - hardware has gotten too cheap for this to

Re: [Beowulf] zfs

2013-09-14 Thread Skylar Thompson
On 9/14/2013 3:52 PM, Andrew Holway wrote: > Hello, > > Anyone using ZFS in production? Stories? challenges? Caveats? > > I've been spending a lot of time with zfs on freebsd and have found it > thoroughly awesome. > We have a bunch of ZFS-based storage systems, all running Solaris, and falling i

Re: [Beowulf] Strange "resume" statements generated for GRUB2

2013-06-10 Thread Skylar Thompson
On 06/10/2013 10:35 AM, Hearns, John wrote: > > > >> Taking into account small size of my swap partition (4GB only), less > than my RAM size, >> (I wrote about this situation in my 1st message) the hibernation image > may not fit into swap partition. Therefore coding of -part2 (for /) in > resum

Re: [Beowulf] Strange "resume" statements generated for GRUB2

2013-06-09 Thread Skylar Thompson
On 06/09/2013 11:37 AM, Mikhail Kuzminsky wrote: > I have swap in sda1 and "/" in sda2 partitions of HDD. At installation > of OpenSUSE 12.3 (where YaST2 is used) on my cluster node I found > erroneous, by my opinion, boot loader (GRUB2) settings. > > YaST2 proposed (at installation) to use > ...

Re: [Beowulf] Prevention of cpu frequency changes in cluster nodes (Was : cpupower, acpid & cpufreq)

2013-06-09 Thread Skylar Thompson
On 06/09/2013 11:35 AM, Mikhail Kuzminsky wrote: > I installed OpenSuSE 12.3/x86-64 now. I may now say about the reasons > why I am afraid of loading of cpufreq modules. > > 1) I found in /var/log/messages pairs of strings about governor like > > [kernel] "cpuidle: using governor ladder" > [kerne

Re: [Beowulf] why we need cheap, open learning clusters

2013-05-12 Thread Skylar Thompson
On 05/12/2013 10:55 AM, Lux, Jim (337C) wrote: > > This is why I think things like ArduWulf or, more particularly > LittleFE, are valuable. And it's also why nobody should start > packaging LittleFE clusters in an enclosure. Once all those mobos are > in a box with walls, it starts to discourag

Re: [Beowulf] El Reg: AMD reveals potent parallel processing breakthrough

2013-05-12 Thread Skylar Thompson
On 05/12/2013 07:07 AM, Lux, Jim (337C) wrote: > I think that if we want people to design and fix automobile and jet > engines, it is a wise thing to start them with lawnmower and moped engines > first, rather than have their first hands on experience be with a > hypersonic SCRAMjet burning hydrazi

Re: [Beowulf] El Reg: AMD reveals potent parallel processing breakthrough

2013-05-11 Thread Skylar Thompson
On 05/11/2013 10:39 AM, Lux, Jim (337C) wrote: > > > Hard to beat $19/node plus the cost of some wire and maybe a USB hub to > talk to them all. http://www.pjrc.com/store/teensy.html > rPi is in the same price range > > > > So, for, say, $200-300, you could give a student a platform with 8-10 > nod

Re: [Beowulf] Clustering VPS servers

2013-03-24 Thread Skylar Thompson
On 3/24/2013 10:25 AM, Geoffrey Jacobs wrote: > On 03/24/2013 01:56 AM, Jonathan Aquilina wrote: >> What I am not understanding is the difference between using a monolithic >> style kernel with everything compiled in vs. modules. Is there a lower >> memory footprint if modules are used. > Yes, if e

Re: [Beowulf] Configuration management tools/strategy

2013-01-06 Thread Skylar Thompson
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/06/2013 05:38 AM, Walid wrote: > Dear All, > > At work we are starting to evaluate Configuration management to be > used to manage several diverse hpc clusters, and their diverse node > types. I wanted to see what are other admins, and HPC users

Re: [Beowulf] Maker2 genomic software license experience?

2012-11-08 Thread Skylar Thompson
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/08/2012 06:10 AM, Tim Cutts wrote: > > On 8 Nov 2012, at 13:52, Skylar Thompson > wrote: > >> I guess if your development time is sufficiently shorter than >> the equivalent compiled code, it could make sense. > &

Re: [Beowulf] Maker2 genomic software license experience?

2012-11-08 Thread Skylar Thompson
On 11/08/12 02:35, Tim Cutts wrote: > On 8 Nov 2012, at 10:10, Andrew Holway wrote: > > >> It's all a bit academic now (ahem) as the MPI component is a Perl >> program, and Perl isn't supported on BlueGene/Q. :-( >> >> huh? perl mpi? >> >> Interpreted language? High performance message passing

Re: [Beowulf] Torrents for HPC

2012-06-12 Thread Skylar Thompson
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/12/2012 03:42 PM, Bill Broadley wrote: > Using MPI does make quite a bit of sense for clusters with high > speed interconnects. Although I suspect that being network bound > for IO is less of a problem. I'd consider it though, I do have > sdr/d

Re: [Beowulf] Torrents for HPC

2012-06-11 Thread Skylar Thompson
On 6/8/2012 5:06 PM, Bill Broadley wrote: > > I've built Myrinet, SDR, DDR, and QDR clusters ( no FDR yet), but I > still have users whose use cases and budgets still only justify GigE. > > I've setup a 160TB hadoop cluster is working well, but haven't found > justification for the complexity/c

Re: [Beowulf] rear door heat exchangers

2012-02-01 Thread Skylar Thompson
at full bore. It's pretty unpleasant standing behind them, though. -- -- Skylar Thompson (skylar.thomp...@gmail.com) -- http://www.cs.earlham.edu/~skylar/ ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscrip

Re: [Beowulf] Users abusing screen

2011-10-27 Thread Skylar Thompson
ers to do the right thing. Mostly it works, but sometimes we do need to bring out the LART stick. - -- - -- - -- Skylar Thompson (sky...@cs.earlham.edu) - -- http://www.cs.earlham.edu/~skylar/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla

Re: [Beowulf] Users abusing screen

2011-10-22 Thread Skylar Thompson
r to use script or nohup I guess I'm with Andrew, where the first thing I do upon logging in is either connecting to an existing screen session or starting a fresh one. -- -- Skylar Thompson (sky...@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ signature.asc

Re: [Beowulf] cluster scheduler for dynamic tree-structured jobs?

2010-05-15 Thread Skylar Thompson
$SGE_CELL/local_conf. I'm not sure about BDB though. -- -- Skylar Thompson (sky...@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (d

Re: [Beowulf] cluster scheduler for dynamic tree-structured jobs?

2010-05-15 Thread Skylar Thompson
reasonable data model for that stuff. Thanks in advance for your help and advice! SGE does this and can make it available as XML. -- -- Skylar Thompson (sky...@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ ___ Beowulf mai

Re: [Beowulf] which 24 port unmanaged GigE switch?

2010-04-05 Thread Skylar Thompson
switches at my last job with the two GBIC as uplinks bonded together with LACP. We probably burned out all four GBICs every year, and although Netgear was happy to continue replacing them it was certainly annoying. -- -- Skylar Thompson (sky...@cs.earlham.edu) -- http://www.cs.earlham.e

Re: [Beowulf] Anyone with really large clusters seeing memory leaks with OFED 1.5 for tcp based apps?

2010-01-31 Thread Skylar Thompson
into ... if anything. > > Thanks! We're running at OFED 1.4 for our GPFS cluster, with RDMA used for data and IPoIB used for metadata and backups. We're looking at an upgrade to 1.5 so if you do find anything out I'd be very interested in knowing. -- -- Skylar Thompson (sk

Re: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly

2010-01-13 Thread Skylar Thompson
dly. As an initial troubleshooting > we installed Ganglia, to check network utilization. But its normal. > We're not getting how to troubleshoot it and resolve the problem. Can > anybode help us resolve this issue? Is there anything amiss according to the service pro

Re: [Beowulf] are compute nodes always kept in a private I/P and switch space?

2010-01-13 Thread Skylar Thompson
l with CIFS/NFS so it's just easier giving the nodes fully-routeable IP addresses. -- -- Skylar Thompson (sky...@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ signature.asc Description: OpenPGP digital signature ___ Beowulf mailing li

Re: [Beowulf] PERC 5/E problems

2009-12-31 Thread Skylar Thompson
RAID... I can't speak to that card specifically, but Dell in the past did sneaky things like calling a system "RAID-capable", but in order to make it actually do RAID you'd have to buy a hardware key or daughter card at some inflated price. -- -- Skylar Thompson (sky...@cs.

Re: [Beowulf] PERC 5/E problems

2009-12-31 Thread Skylar Thompson
emember if the Dell BIOS has this option, but some BIOSs allow you to clear the PCI bus cache. That will trigger a full rescan of all the cards that are attached and could get it listed in the boot process again. If the BIOS doesn't have that option, you could try setting the BIOS cl

Re: [Beowulf] RAID for home beowulf

2009-10-03 Thread Skylar Thompson
t /boot using the /dev/md? device, but point your boot loader at one of the underlying /dev/sd? or /dev/hd? devices. This means updates get mirrored, but the boot loader itself only looks at one of the mirrors. -- -- Skylar Thompson (sky...@cs.earlham.edu)

  1   2   >