improve performance of a script

2014-03-26 Thread xeon Mailinglist
For each file inside the directory $output, I do a cat to the file and generate 
a sha256 hash. This script takes 9 minutes to read 105 files, with the total 
data of 556MB and generate the digests. Is there a way to make this script 
faster? Maybe generate digests in parallel?

for path in $output
do
# sha256sum
digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | awk 
'{ print $1 }')
(( count ++ ))
done


Thanks,


Re: improve performance of a script

2014-03-26 Thread Pádraig Brady
On 03/25/2014 02:12 PM, xeon Mailinglist wrote:
> For each file inside the directory $output, I do a cat to the file and 
> generate a sha256 hash. This script takes 9 minutes to read 105 files, with 
> the total data of 556MB and generate the digests. Is there a way to make this 
> script faster? Maybe generate digests in parallel?
> 
> for path in $output
> do
> # sha256sum
> digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | 
> awk '{ print $1 }')
> (( count ++ ))
> done

This is not a bach question so please ask in a more appropriate user
oriented rather than developer oriented list in future.
Off the top of my head I'd do something like the following to get xargs to 
parallelize:

digests=( $(
 find "$output" -type f |
 xargs -I '{}' -n1 -P$(nproc) \
 sh -c "$HADOOP_HOME/bin/hdfs dfs -cat '{}' | sha256sum" |
 cut -f1 -d' '
) )

You might want to distribute that load across systems too
with something like dxargs or perhaps something like hadoop :p

thanks,
Pádraig.



Re: improve performance of a script

2014-03-26 Thread Greg Wooledge
On Wed, Mar 26, 2014 at 12:54:12PM +, Pádraig Brady wrote:
> On 03/25/2014 02:12 PM, xeon Mailinglist wrote:
> > For each file inside the directory $output, I do a cat to the file and
> > generate a sha256 hash. This script takes 9 minutes to read 105 files, with
> > the total data of 556MB and generate the digests. Is there a way to make
> > this script faster? Maybe generate digests in parallel?

First, determine where the bottleneck is.  Is it CPU power to run the
hash command?  Or is it I/O (network? disk?) to read the files?

> > for path in $output
> > do
> > # sha256sum
> > digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | 
> > awk '{ print $1 }')
> > (( count ++ ))
> > done

If $output is actually the name of a directory, then your syntax is
somewhat off.  It should be:

for file in "$output"/*; do
  digests[count++]=$( ... "$file" ... )
done

I wouldn't use "output" as the name of a variable that holds a directory,
though.  That's confusing.

I don't know what a hadoop is, or an hdfs, or a dfs... in any case, you
do not appear to be "catting to" a file.  The files appear to be some kind
of input, not output (appended or overwritten).

If there are only 105 input files, and therefore 105 loop iterations,
then optimizing the bashy parts of the code to reduce forks isn't likely
to do very much.  Supposing you removed the awk, that would only save
you 105 forks, which is not likely to be noticeable (we're talking
milliseconds here) when the whole loop takes 9 minutes.


> Off the top of my head I'd do something like the following to get xargs to
> parallelize:

Running multiple hdfs-whatevers in parallel may make the problem worse,
if that's where the bottleneck is.

Running multiple sha256sums in parallel would only help if the computer
has multiple CPU cores, and if the CPU happens to be the bottleneck here.
If it's a single-core machine, running multiple CPU-heavy processes in
parallel would just make it worse, because you'd introduce a whole bunch
of extra context switching.

Really, we don't have anywhere near enough information about the problem
to give a solution.  We can only give suggestions.

And this should be on help-bash, not bug-bash.  I've Cc'ed the former.



Re: improve performance of a script

2014-03-26 Thread Eduardo A . Bustamante López
(I forgot to CC the list in my first reply)

On Tue, Mar 25, 2014 at 07:12:16AM -0700, xeon Mailinglist wrote:
> For each file inside the directory $output, I do a cat to the file and 
> generate a sha256 hash. This script takes 9 minutes to read 105 files, with 
> the total data of 556MB and generate the digests. Is there a way to make this 
> script faster? Maybe generate digests in parallel?
> 
> for path in $output
> do
> # sha256sum
> digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | 
> awk '{ print $1 }')
> (( count ++ ))
> done
> 
> 
> Thanks,
You were already told in #bash at Freenode that this is not a bash
issue, and yet, you report it as a bug.

Once bash runs the commands, it has no relation at all with their
performance.

Rather, ask the Hadoop people and also maybe the support for your
operating system to see what you can do to optimize that. Maybe it
cannot be optimized... it depends on what the bottleneck is (disk,
network, etc.)

-- 
Eduardo Alan Bustamante López



ls doesn't work in if statements in bash 4.3

2014-03-26 Thread billycongo
I tested on bash 4.3 and 3.0

testing]$ bash --version
bash --version
GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu)

In a directory I have:

testing]$ ls -l
total 16
-rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1
-rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2
-rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3
-rw-r--r-- 1 hpierce hpierce  0 Mar 26 20:07 dog4
-rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test

dog1, dog2, and dog3 have content.  dog4 is empty.

test is a simple script:

testing]$ cat test
#!/bin/bash
FILE=$1
echo $FILE
if [ ! -e $FILE ]
then
echo "Usage: $0 "
exit 1
else
echo $FILE exists
fi

Here's a regular run:

testing]$ for f in *; do ./test $f; done
dog1
dog1 exists
dog2
dog2 exists
dog3
dog3 exists
dog4
dog4 exists
test
test exists

Now I add a ls:

testing]$ for f in `ls dog*`; do ./test $f; done
dog1
Usage: ./test 
dog2
Usage: ./test 
dog3
Usage: ./test 
dog4
Usage: ./test 

So I moved it to an earlier version of bash

testing]$ bash --version
bash --version
GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu)

testing]$ for f in `ls dog*`; do ./test $f; done
dog1
dog1 exists
dog2
dog2 exists
dog3
dog3 exists
dog4
dog4 exists





Re: ls doesn't work in if statements in bash 4.3

2014-03-26 Thread billycongo
On Wednesday, March 26, 2014 8:30:12 PM UTC-4, billy...@gmail.com wrote:
> I tested on bash 4.3 and 3.0
> 
> 
> 
> testing]$ bash --version
> 
> bash --version
> 
> GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu)
> 
> 
> 
> In a directory I have:
> 
> 
> 
> testing]$ ls -l
> 
> total 16
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3
> 
> -rw-r--r-- 1 hpierce hpierce  0 Mar 26 20:07 dog4
> 
> -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test
> 
> 
> 
> dog1, dog2, and dog3 have content.  dog4 is empty.
> 
> 
> 
> test is a simple script:
> 
> 
> 
> testing]$ cat test
> 
> #!/bin/bash
> 
> FILE=$1
> 
> echo $FILE
> 
> if [ ! -e $FILE ]
> 
> then
> 
>   echo "Usage: $0 "
> 
>   exit 1
> 
> else
> 
> echo $FILE exists
> 
> fi
> 
> 
> 
> Here's a regular run:
> 
> 
> 
> testing]$ for f in *; do ./test $f; done
> 
> dog1
> 
> dog1 exists
> 
> dog2
> 
> dog2 exists
> 
> dog3
> 
> dog3 exists
> 
> dog4
> 
> dog4 exists
> 
> test
> 
> test exists
> 
> 
> 
> Now I add a ls:
> 
> 
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> 
> dog1
> 
> Usage: ./test 
> 
> dog2
> 
> Usage: ./test 
> 
> dog3
> 
> Usage: ./test 
> 
> dog4
> 
> Usage: ./test 
> 
> 
> 
> So I moved it to an earlier version of bash
> 
> 
> 
> testing]$ bash --version
> 
> bash --version
> 
> GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu)
> 
> 
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> 
> dog1
> 
> dog1 exists
> 
> dog2
> 
> dog2 exists
> 
> dog3
> 
> dog3 exists
> 
> dog4
> 
> dog4 exists


I thought about the changes I have made recently and I had added the following 
into my .bashrc:

eval $(dircolors -b ~/.dir_colors)

I commented it out, and now everything works.  I think it's still a bug, though 
I know how to fix it.



Re: ls doesn't work in if statements in bash 4.3

2014-03-26 Thread Mike Frysinger
On Wed 26 Mar 2014 17:45:33 billyco...@gmail.com wrote:
> I thought about the changes I have made recently and I had added the
> following into my .bashrc:
> 
> eval $(dircolors -b ~/.dir_colors)
> 
> I commented it out, and now everything works.  I think it's still a bug,
> though I know how to fix it.

doubtful the problem is bash.  if ls is writing control codes, then bash will 
treat them as part of the filename.  you can verify by piping the output 
through hexdump and seeing what shows up.

this is a good example though of why using `ls` is almost always the wrong 
answer.  use unadorned globs:
for f in dog*; do ...

i'd point out that if any of the files in your dir have whitespace, your code 
would also break:
touch 'dog a b c'
for f in `ls dog*`; do ...
-mike

signature.asc
Description: This is a digitally signed message part.


Re: ls doesn't work in if statements in bash 4.3

2014-03-26 Thread Eduardo A . Bustamante López
This is a "user" problem. You are using the wrong features for the
task, your code should read:

| for f in *; do ./test "$f"; done

and quote all other variable expansions.

NEVER do: for foo in `...`ORfor foo in $(...(

This is wrong, because you're relying on word splitting and glob
expansion, which is wrong 99% of the cases.

Read:

- http://mywiki.wooledge.org/WordSplitting
- http://mywiki.wooledge.org/Quotes


It is totally not a bash bug (read the manual, it's documented).

On Wed, Mar 26, 2014 at 05:30:12PM -0700, billyco...@gmail.com wrote:
> I tested on bash 4.3 and 3.0
> 
> testing]$ bash --version
> bash --version
> GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu)
> 
> In a directory I have:
> 
> testing]$ ls -l
> total 16
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3
> -rw-r--r-- 1 hpierce hpierce  0 Mar 26 20:07 dog4
> -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test
> 
> dog1, dog2, and dog3 have content.  dog4 is empty.
> 
> test is a simple script:
> 
> testing]$ cat test
> #!/bin/bash
> FILE=$1
> echo $FILE
> if [ ! -e $FILE ]
> then
>   echo "Usage: $0 "
>   exit 1
> else
> echo $FILE exists
> fi
> 
> Here's a regular run:
> 
> testing]$ for f in *; do ./test $f; done
> dog1
> dog1 exists
> dog2
> dog2 exists
> dog3
> dog3 exists
> dog4
> dog4 exists
> test
> test exists
> 
> Now I add a ls:
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> dog1
> Usage: ./test 
> dog2
> Usage: ./test 
> dog3
> Usage: ./test 
> dog4
> Usage: ./test 
> 
> So I moved it to an earlier version of bash
> 
> testing]$ bash --version
> bash --version
> GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu)
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> dog1
> dog1 exists
> dog2
> dog2 exists
> dog3
> dog3 exists
> dog4
> dog4 exists
> 
> 
> 

-- 
Eduardo Alan Bustamante López



Re: ls doesn't work in if statements in bash 4.3

2014-03-26 Thread billycongo
On Wednesday, March 26, 2014 8:30:12 PM UTC-4, billy...@gmail.com wrote:
> I tested on bash 4.3 and 3.0
> 
> 
> 
> testing]$ bash --version
> 
> bash --version
> 
> GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu)
> 
> 
> 
> In a directory I have:
> 
> 
> 
> testing]$ ls -l
> 
> total 16
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3
> 
> -rw-r--r-- 1 hpierce hpierce  0 Mar 26 20:07 dog4
> 
> -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test
> 
> 
> 
> dog1, dog2, and dog3 have content.  dog4 is empty.
> 
> 
> 
> test is a simple script:
> 
> 
> 
> testing]$ cat test
> 
> #!/bin/bash
> 
> FILE=$1
> 
> echo $FILE
> 
> if [ ! -e $FILE ]
> 
> then
> 
>   echo "Usage: $0 "
> 
>   exit 1
> 
> else
> 
> echo $FILE exists
> 
> fi
> 
> 
> 
> Here's a regular run:
> 
> 
> 
> testing]$ for f in *; do ./test $f; done
> 
> dog1
> 
> dog1 exists
> 
> dog2
> 
> dog2 exists
> 
> dog3
> 
> dog3 exists
> 
> dog4
> 
> dog4 exists
> 
> test
> 
> test exists
> 
> 
> 
> Now I add a ls:
> 
> 
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> 
> dog1
> 
> Usage: ./test 
> 
> dog2
> 
> Usage: ./test 
> 
> dog3
> 
> Usage: ./test 
> 
> dog4
> 
> Usage: ./test 
> 
> 
> 
> So I moved it to an earlier version of bash
> 
> 
> 
> testing]$ bash --version
> 
> bash --version
> 
> GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu)
> 
> 
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> 
> dog1
> 
> dog1 exists
> 
> dog2
> 
> dog2 exists
> 
> dog3
> 
> dog3 exists
> 
> dog4
> 
> dog4 exists

The original intention was to play a randomly sorted list of songs (except 
ABBA):

for f in `ls *.mp3 | grep -v abba | sort -R`
do
   mplayer $f
done

what I can still do instead is:

for f in `find . -name "*.mp3" | grep -v abba | sort -R`
do
   mplayer $f
done



Re: ls doesn't work in if statements in bash 4.3

2014-03-26 Thread billycongo
On Wednesday, March 26, 2014 8:30:12 PM UTC-4, billy...@gmail.com wrote:
> I tested on bash 4.3 and 3.0
> 
> 
> 
> testing]$ bash --version
> 
> bash --version
> 
> GNU bash, version 4.3.0(1)-release (x86_64-unknown-linux-gnu)
> 
> 
> 
> In a directory I have:
> 
> 
> 
> testing]$ ls -l
> 
> total 16
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog1
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog2
> 
> -rw-r--r-- 1 hpierce hpierce 77 Mar 26 20:09 dog3
> 
> -rw-r--r-- 1 hpierce hpierce  0 Mar 26 20:07 dog4
> 
> -rwxr-xr-x 1 hpierce hpierce 80 Mar 26 20:02 test
> 
> 
> 
> dog1, dog2, and dog3 have content.  dog4 is empty.
> 
> 
> 
> test is a simple script:
> 
> 
> 
> testing]$ cat test
> 
> #!/bin/bash
> 
> FILE=$1
> 
> echo $FILE
> 
> if [ ! -e $FILE ]
> 
> then
> 
>   echo "Usage: $0 "
> 
>   exit 1
> 
> else
> 
> echo $FILE exists
> 
> fi
> 
> 
> 
> Here's a regular run:
> 
> 
> 
> testing]$ for f in *; do ./test $f; done
> 
> dog1
> 
> dog1 exists
> 
> dog2
> 
> dog2 exists
> 
> dog3
> 
> dog3 exists
> 
> dog4
> 
> dog4 exists
> 
> test
> 
> test exists
> 
> 
> 
> Now I add a ls:
> 
> 
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> 
> dog1
> 
> Usage: ./test 
> 
> dog2
> 
> Usage: ./test 
> 
> dog3
> 
> Usage: ./test 
> 
> dog4
> 
> Usage: ./test 
> 
> 
> 
> So I moved it to an earlier version of bash
> 
> 
> 
> testing]$ bash --version
> 
> bash --version
> 
> GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu)
> 
> 
> 
> testing]$ for f in `ls dog*`; do ./test $f; done
> 
> dog1
> 
> dog1 exists
> 
> dog2
> 
> dog2 exists
> 
> dog3
> 
> dog3 exists
> 
> dog4
> 
> dog4 exists

I have a script which goes out and converts all my filenames (/home/user/) with 
spaces into filenames with underscores.  I also convert them to lower case.  

There is something messed up in my .dir_colors file.  I just thought it was 
interesting that

for f in `ls dog*`  doesn't work but

for f in `find dog*` does work.