On Wed, Mar 26, 2014 at 12:54:12PM +0000, Pádraig Brady wrote: > On 03/25/2014 02:12 PM, xeon Mailinglist wrote: > > For each file inside the directory $output, I do a cat to the file and > > generate a sha256 hash. This script takes 9 minutes to read 105 files, with > > the total data of 556MB and generate the digests. Is there a way to make > > this script faster? Maybe generate digests in parallel?
First, determine where the bottleneck is. Is it CPU power to run the hash command? Or is it I/O (network? disk?) to read the files? > > for path in $output > > do > > # sha256sum > > digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | > > awk '{ print $1 }') > > (( count ++ )) > > done If $output is actually the name of a directory, then your syntax is somewhat off. It should be: for file in "$output"/*; do digests[count++]=$( ... "$file" ... ) done I wouldn't use "output" as the name of a variable that holds a directory, though. That's confusing. I don't know what a hadoop is, or an hdfs, or a dfs... in any case, you do not appear to be "catting to" a file. The files appear to be some kind of input, not output (appended or overwritten). If there are only 105 input files, and therefore 105 loop iterations, then optimizing the bashy parts of the code to reduce forks isn't likely to do very much. Supposing you removed the awk, that would only save you 105 forks, which is not likely to be noticeable (we're talking milliseconds here) when the whole loop takes 9 minutes. > Off the top of my head I'd do something like the following to get xargs to > parallelize: Running multiple hdfs-whatevers in parallel may make the problem worse, if that's where the bottleneck is. Running multiple sha256sums in parallel would only help if the computer has multiple CPU cores, and if the CPU happens to be the bottleneck here. If it's a single-core machine, running multiple CPU-heavy processes in parallel would just make it worse, because you'd introduce a whole bunch of extra context switching. Really, we don't have anywhere near enough information about the problem to give a solution. We can only give suggestions. And this should be on help-bash, not bug-bash. I've Cc'ed the former.