Camaleón put forth on 1/12/2011 3:56 AM: > On Tue, 11 Jan 2011 15:58:45 -0600, Stan Hoeppner wrote: > >> Camaleón put forth on 1/11/2011 9:38 AM: >> >>> I supposed you wouldn't care much in getting a script to run faster >>> with all the available core "occupied" if you had a modern (<4 years) >>> cpu and plenty of speedy ram because the routine you wanted to run it >>> should not take many time... unless you were going to process >>> "thousand" of images :-) >> >> That's a bit ironic. You're suggesting the solution is to upgrade to a >> new system with a faster processor and memory. > > Why did you get that impression? No, I said I thought you were running a > resource-scarce machine so in order to simulate your environment I made > the tests under my VM... nothing more.
My bad Camaleón. I misunderstood what you said. My apologies. >> However, all the newer processors have 2, 4, 6, 8, or 12 cores. So >> upgrading simply for single process throughput would waste all the >> other cores, which was the exact situation I found myself in. > > But of course! I would not even think in upgrade the whole computer just > to get one concrete task done a few more seconds faster. This depends on the task, of course. It my case it just wouldn't make sense, just as you say. I've managed some systems that we'd upgrade every two years because of a single application that never seemed to have enough horsepower under the hood. HPC compute centers seem to follow this trend. There's never enough cycles or enough nodes for many of them. >> The ironic part is that parallelizing the script to maximize performance >> on my system will also do the same for the newer chips, but to an even >> greater degree on those with 4, 6, 8, or 12 cores. Due to the fact that >> convert doesn't eat 100% of a core's time during its run, and the idle >> time in between one process finishing and xargs starting another, one >> could probably run 16-18 parallel convert processes on a 12 core Magny >> Cours with this script before run times stop decreasing. > > I think the script should also work very well with single-core cpus. This might depend on the hardware, but as I mentioned, it looks like the convert program doesn't use 100% CPU during its run, so yes, using the xargs script to fire up two concurrent convert processes with the kernel time slicing would probably decrease overall run time to some degree. > Yeah, and tests are there to demonstrate the gain. Which is always a big plus. No guess work. :) >> I had run 4 (2 core machine) and run time was a few seconds faster than >> 2 processes, 3 seconds IIRC. Running 8 processes pushed the system into >> swap and run time increased dramatically. Given that 4 processes only >> have a few seconds faster than two, yet consumed twice as much memory, >> the best overall number of processes to run on this system is two. > > Maybe the "best number of processes" is system-dependant (old processors > could work better with a conservative value but newer ones can get some > extra seconds with a higher one and without experiencing any significant > penalty). I don't have the machines here to confirm that hypothesis, but knowledge and experience tell me you're exactly correct. The reasons why you're correct are tied mostly to available L2/L3 cache bandwidth, and memory size and bandwidth. On my SUT, one convert process at its peak easily consumes more than half the memory bandwidth, which is why I only see a 50% reduction in run time using 2 processes, on running on each CPU, instead of a 100% reduction. Each 500 MHz Celeron CPU only has 128KB of L2 cache. System memory bandwidth of the 440BX chipset is only 800 MB/s. Depending on the size of the photos one is converting, if they're relatively small like my 8.3MP 1.8MB jpegs, I'd think something like a dual core Phenom II X2 w/ 6MB L3 cache and 21.4 GB/s memory b/w would likely continue to scale with reduced overall script run time up to 4 parallel convert processes, maybe more, due to the "excess" of L3 cache and the 10.7 GB/s available to each core. Conversely, I'd think that a quad core Athlon II X4 with no L3 cache and only 512KB L2 cache per core, with each core receiving effectively only 5.3 GB/s of b/w, would not scale effectively to core_count*2 parallel processes as the Phenom II X2 would. In fact, due to 4 cores with little cache sharing the same 21.4 GB/s of memory b/w, the quad core Athlon II would probably start seeing a decline in reduced run time going from 2 processes to 4 as twice as many cores compete for memory access, and tailing off dramatically as the process count is increased to 5 and up. Just a guess. Anyone have such systems to test with? :) -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4d2dfd15.8000...@hardwarefreak.com