improve performance of a script
For each file inside the directory $output, I do a cat to the file and generate a sha256 hash. This script takes 9 minutes to read 105 files, with the total data of 556MB and generate the digests. Is there a way to make this script faster? Maybe generate digests in parallel? for path in $output do # sha256sum digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | awk '{ print $1 }') (( count ++ )) done Thanks,
Re: improve performance of a script
On 03/25/2014 02:12 PM, xeon Mailinglist wrote: > For each file inside the directory $output, I do a cat to the file and > generate a sha256 hash. This script takes 9 minutes to read 105 files, with > the total data of 556MB and generate the digests. Is there a way to make this > script faster? Maybe generate digests in parallel? > > for path in $output > do > # sha256sum > digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | > awk '{ print $1 }') > (( count ++ )) > done This is not a bach question so please ask in a more appropriate user oriented rather than developer oriented list in future. Off the top of my head I'd do something like the following to get xargs to parallelize: digests=( $( find "$output" -type f | xargs -I '{}' -n1 -P$(nproc) \ sh -c "$HADOOP_HOME/bin/hdfs dfs -cat '{}' | sha256sum" | cut -f1 -d' ' ) ) You might want to distribute that load across systems too with something like dxargs or perhaps something like hadoop :p thanks, Pádraig.
Re: improve performance of a script
On Wed, Mar 26, 2014 at 12:54:12PM +, Pádraig Brady wrote: > On 03/25/2014 02:12 PM, xeon Mailinglist wrote: > > For each file inside the directory $output, I do a cat to the file and > > generate a sha256 hash. This script takes 9 minutes to read 105 files, with > > the total data of 556MB and generate the digests. Is there a way to make > > this script faster? Maybe generate digests in parallel? First, determine where the bottleneck is. Is it CPU power to run the hash command? Or is it I/O (network? disk?) to read the files? > > for path in $output > > do > > # sha256sum > > digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | > > awk '{ print $1 }') > > (( count ++ )) > > done If $output is actually the name of a directory, then your syntax is somewhat off. It should be: for file in "$output"/*; do digests[count++]=$( ... "$file" ... ) done I wouldn't use "output" as the name of a variable that holds a directory, though. That's confusing. I don't know what a hadoop is, or an hdfs, or a dfs... in any case, you do not appear to be "catting to" a file. The files appear to be some kind of input, not output (appended or overwritten). If there are only 105 input files, and therefore 105 loop iterations, then optimizing the bashy parts of the code to reduce forks isn't likely to do very much. Supposing you removed the awk, that would only save you 105 forks, which is not likely to be noticeable (we're talking milliseconds here) when the whole loop takes 9 minutes. > Off the top of my head I'd do something like the following to get xargs to > parallelize: Running multiple hdfs-whatevers in parallel may make the problem worse, if that's where the bottleneck is. Running multiple sha256sums in parallel would only help if the computer has multiple CPU cores, and if the CPU happens to be the bottleneck here. If it's a single-core machine, running multiple CPU-heavy processes in parallel would just make it worse, because you'd introduce a whole bunch of extra context switching. Really, we don't have anywhere near enough information about the problem to give a solution. We can only give suggestions. And this should be on help-bash, not bug-bash. I've Cc'ed the former.
Re: improve performance of a script
(I forgot to CC the list in my first reply) On Tue, Mar 25, 2014 at 07:12:16AM -0700, xeon Mailinglist wrote: > For each file inside the directory $output, I do a cat to the file and > generate a sha256 hash. This script takes 9 minutes to read 105 files, with > the total data of 556MB and generate the digests. Is there a way to make this > script faster? Maybe generate digests in parallel? > > for path in $output > do > # sha256sum > digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | > awk '{ print $1 }') > (( count ++ )) > done > > > Thanks, You were already told in #bash at Freenode that this is not a bash issue, and yet, you report it as a bug. Once bash runs the commands, it has no relation at all with their performance. Rather, ask the Hadoop people and also maybe the support for your operating system to see what you can do to optimize that. Maybe it cannot be optimized... it depends on what the bottleneck is (disk, network, etc.) -- Eduardo Alan Bustamante López
ls doesn't work in if statements in bash 4.3
I tested on bash 4.3 and 3.0 testing]$ bash --version bash --version GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu) In a directory I have: testing]$ ls -l total 16 -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1 -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2 -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3 -rw-r--r-- 1 hpierce hpierce 0 Mar 26 20:07 dog4 -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test dog1, dog2, and dog3 have content. dog4 is empty. test is a simple script: testing]$ cat test #!/bin/bash FILE=$1 echo $FILE if [ ! -e $FILE ] then echo "Usage: $0 " exit 1 else echo $FILE exists fi Here's a regular run: testing]$ for f in *; do ./test $f; done dog1 dog1 exists dog2 dog2 exists dog3 dog3 exists dog4 dog4 exists test test exists Now I add a ls: testing]$ for f in `ls dog*`; do ./test $f; done dog1 Usage: ./test dog2 Usage: ./test dog3 Usage: ./test dog4 Usage: ./test So I moved it to an earlier version of bash testing]$ bash --version bash --version GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu) testing]$ for f in `ls dog*`; do ./test $f; done dog1 dog1 exists dog2 dog2 exists dog3 dog3 exists dog4 dog4 exists
Re: ls doesn't work in if statements in bash 4.3
On Wednesday, March 26, 2014 8:30:12 PM UTC-4, billy...@gmail.com wrote: > I tested on bash 4.3 and 3.0 > > > > testing]$ bash --version > > bash --version > > GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu) > > > > In a directory I have: > > > > testing]$ ls -l > > total 16 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3 > > -rw-r--r-- 1 hpierce hpierce 0 Mar 26 20:07 dog4 > > -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test > > > > dog1, dog2, and dog3 have content. dog4 is empty. > > > > test is a simple script: > > > > testing]$ cat test > > #!/bin/bash > > FILE=$1 > > echo $FILE > > if [ ! -e $FILE ] > > then > > echo "Usage: $0 " > > exit 1 > > else > > echo $FILE exists > > fi > > > > Here's a regular run: > > > > testing]$ for f in *; do ./test $f; done > > dog1 > > dog1 exists > > dog2 > > dog2 exists > > dog3 > > dog3 exists > > dog4 > > dog4 exists > > test > > test exists > > > > Now I add a ls: > > > > testing]$ for f in `ls dog*`; do ./test $f; done > > dog1 > > Usage: ./test > > dog2 > > Usage: ./test > > dog3 > > Usage: ./test > > dog4 > > Usage: ./test > > > > So I moved it to an earlier version of bash > > > > testing]$ bash --version > > bash --version > > GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu) > > > > testing]$ for f in `ls dog*`; do ./test $f; done > > dog1 > > dog1 exists > > dog2 > > dog2 exists > > dog3 > > dog3 exists > > dog4 > > dog4 exists I thought about the changes I have made recently and I had added the following into my .bashrc: eval $(dircolors -b ~/.dir_colors) I commented it out, and now everything works. I think it's still a bug, though I know how to fix it.
Re: ls doesn't work in if statements in bash 4.3
On Wed 26 Mar 2014 17:45:33 billyco...@gmail.com wrote: > I thought about the changes I have made recently and I had added the > following into my .bashrc: > > eval $(dircolors -b ~/.dir_colors) > > I commented it out, and now everything works. I think it's still a bug, > though I know how to fix it. doubtful the problem is bash. if ls is writing control codes, then bash will treat them as part of the filename. you can verify by piping the output through hexdump and seeing what shows up. this is a good example though of why using `ls` is almost always the wrong answer. use unadorned globs: for f in dog*; do ... i'd point out that if any of the files in your dir have whitespace, your code would also break: touch 'dog a b c' for f in `ls dog*`; do ... -mike signature.asc Description: This is a digitally signed message part.
Re: ls doesn't work in if statements in bash 4.3
This is a "user" problem. You are using the wrong features for the task, your code should read: | for f in *; do ./test "$f"; done and quote all other variable expansions. NEVER do: for foo in `...`ORfor foo in $(...( This is wrong, because you're relying on word splitting and glob expansion, which is wrong 99% of the cases. Read: - http://mywiki.wooledge.org/WordSplitting - http://mywiki.wooledge.org/Quotes It is totally not a bash bug (read the manual, it's documented). On Wed, Mar 26, 2014 at 05:30:12PM -0700, billyco...@gmail.com wrote: > I tested on bash 4.3 and 3.0 > > testing]$ bash --version > bash --version > GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu) > > In a directory I have: > > testing]$ ls -l > total 16 > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1 > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2 > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3 > -rw-r--r-- 1 hpierce hpierce 0 Mar 26 20:07 dog4 > -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test > > dog1, dog2, and dog3 have content. dog4 is empty. > > test is a simple script: > > testing]$ cat test > #!/bin/bash > FILE=$1 > echo $FILE > if [ ! -e $FILE ] > then > echo "Usage: $0 " > exit 1 > else > echo $FILE exists > fi > > Here's a regular run: > > testing]$ for f in *; do ./test $f; done > dog1 > dog1 exists > dog2 > dog2 exists > dog3 > dog3 exists > dog4 > dog4 exists > test > test exists > > Now I add a ls: > > testing]$ for f in `ls dog*`; do ./test $f; done > dog1 > Usage: ./test > dog2 > Usage: ./test > dog3 > Usage: ./test > dog4 > Usage: ./test > > So I moved it to an earlier version of bash > > testing]$ bash --version > bash --version > GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu) > > testing]$ for f in `ls dog*`; do ./test $f; done > dog1 > dog1 exists > dog2 > dog2 exists > dog3 > dog3 exists > dog4 > dog4 exists > > > -- Eduardo Alan Bustamante López
Re: ls doesn't work in if statements in bash 4.3
On Wednesday, March 26, 2014 8:30:12 PM UTC-4, billy...@gmail.com wrote: > I tested on bash 4.3 and 3.0 > > > > testing]$ bash --version > > bash --version > > GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu) > > > > In a directory I have: > > > > testing]$ ls -l > > total 16 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3 > > -rw-r--r-- 1 hpierce hpierce 0 Mar 26 20:07 dog4 > > -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test > > > > dog1, dog2, and dog3 have content. dog4 is empty. > > > > test is a simple script: > > > > testing]$ cat test > > #!/bin/bash > > FILE=$1 > > echo $FILE > > if [ ! -e $FILE ] > > then > > echo "Usage: $0 " > > exit 1 > > else > > echo $FILE exists > > fi > > > > Here's a regular run: > > > > testing]$ for f in *; do ./test $f; done > > dog1 > > dog1 exists > > dog2 > > dog2 exists > > dog3 > > dog3 exists > > dog4 > > dog4 exists > > test > > test exists > > > > Now I add a ls: > > > > testing]$ for f in `ls dog*`; do ./test $f; done > > dog1 > > Usage: ./test > > dog2 > > Usage: ./test > > dog3 > > Usage: ./test > > dog4 > > Usage: ./test > > > > So I moved it to an earlier version of bash > > > > testing]$ bash --version > > bash --version > > GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu) > > > > testing]$ for f in `ls dog*`; do ./test $f; done > > dog1 > > dog1 exists > > dog2 > > dog2 exists > > dog3 > > dog3 exists > > dog4 > > dog4 exists The original intention was to play a randomly sorted list of songs (except ABBA): for f in `ls *.mp3 | grep -v abba | sort -R` do mplayer $f done what I can still do instead is: for f in `find . -name "*.mp3" | grep -v abba | sort -R` do mplayer $f done
Re: ls doesn't work in if statements in bash 4.3
On Wednesday, March 26, 2014 8:30:12 PM UTC-4, billy...@gmail.com wrote: > I tested on bash 4.3 and 3.0 > > > > testing]$ bash --version > > bash --version > > GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu) > > > > In a directory I have: > > > > testing]$ ls -l > > total 16 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2 > > -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3 > > -rw-r--r-- 1 hpierce hpierce 0 Mar 26 20:07 dog4 > > -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test > > > > dog1, dog2, and dog3 have content. dog4 is empty. > > > > test is a simple script: > > > > testing]$ cat test > > #!/bin/bash > > FILE=$1 > > echo $FILE > > if [ ! -e $FILE ] > > then > > echo "Usage: $0 " > > exit 1 > > else > > echo $FILE exists > > fi > > > > Here's a regular run: > > > > testing]$ for f in *; do ./test $f; done > > dog1 > > dog1 exists > > dog2 > > dog2 exists > > dog3 > > dog3 exists > > dog4 > > dog4 exists > > test > > test exists > > > > Now I add a ls: > > > > testing]$ for f in `ls dog*`; do ./test $f; done > > dog1 > > Usage: ./test > > dog2 > > Usage: ./test > > dog3 > > Usage: ./test > > dog4 > > Usage: ./test > > > > So I moved it to an earlier version of bash > > > > testing]$ bash --version > > bash --version > > GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu) > > > > testing]$ for f in `ls dog*`; do ./test $f; done > > dog1 > > dog1 exists > > dog2 > > dog2 exists > > dog3 > > dog3 exists > > dog4 > > dog4 exists I have a script which goes out and converts all my filenames (/home/user/) with spaces into filenames with underscores. I also convert them to lower case. There is something messed up in my .dir_colors file. I just thought it was interesting that for f in `ls dog*` doesn't work but for f in `find dog*` does work.