[Beowulf] emergent behavior - correlation of job end times

David Mathog Tue, 24 Jul 2018 12:20:02 -0700

Hi all,

Thought some of you might find this interesting.

Using the WGS (aka CA aka Celera) genome assembler there is a step whichruns a large number (in this instance, 47634) of overlap comparisons.There are N sequences (many millions, of three different types) and itmakes many sequence ranges and compares them pairwise, like 100-200 vs.1200-1300. There is a job scheduler that keeps 40 jobs going at alltimes. However, during a run jobs are independent, they do notcommunicate with the others or with the job controller.

The initial observation was that "top" showed a very nonrandomdistribution of elapsed times. Large numbers of jobs (20 or 30)appeared to have correlated elapsed times. So the end times for thejobs were determined and these were stored in a histogram with 1 minutewide bins. When plotted it shows the job end times clumping up, andwhat could be beat frequencies. I did not run this through any sort ofautocorrelation analysis but the patterns are easily seen by eye whenplotted. See for instance the region around 6200-6400. The patternsevolve over time, possibly because of differences in the regions ofdata. (Note, a script was changed around minute 2738, so don't comparepatterns before that with patterns after it.) The jobs were all runningsingle threaded and they were pretty much nailed at 99.9% CPU usageexcept when they started up or shut down. Each wrote its output througha gzip process to a compressed file, and they all seemed to be writingmore or less all the time. However the gzip processes used a negligiblefraction of the CPU time.

That histogram data is in end_times_histo.txt.gz on the 6th or so posthere:


   https://github.com/alekseyzimin/masurca/issues/45

The subrange data for the jobs is in ovlopt.gz.

So, the question is, what might be causing the correlation of the jobrun times?

The start times were also available and these do not indicate anyinduced "binning". That is, the controlling process isn't waiting for along interval to pass and then starting a bunch of jobs all at once.Probably it is spinning on a wait() with 1 second sleep() [because ituses no CPU time] and starts the next job as soon as one exits.

One possibility is that at the "leading" edge the first job that reads asection of data will do so slowly, while later jobs will take the samedata out of cache. That will lead to a "peloton" sort of effect, wherethe leader is slowed and the followers accelerated. iostat didn't showvery much disk IO though.

Another possibility is that the jobs are fighting for memory cache (eachis many Gb in size) and that somehow or other also syncs them.

My last guess is that the average run times in a given section of datamay be fairly constant, and that with a bit of drift in some parts ofthe run they became synchronized by chance.The extent of synchronization seems too high though, around 6500 minuteshalf the jobs are ending at about the same time, and it was like thatfor around 1000 minutes.


Is this sort of thing common? What else could cause it?

System info: Dell PowerEdge T630, Centos 6.9, CPU Xeon E5-2650 as 2 CPUSwith 10 cores/CPU and 2 threads/core for 40 "CPUs", NUMA with even cpuson node0 and odd on node1, 512Gb RAM, RAID5 with 4 disks for 11.7Tb.


Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] emergent behavior - correlation of job end times

Reply via email to