[email protected] (Jim Gibson) writes:
> You have a logic error in your code somewhere. Your logic for counting
> the files with a name matching a certain pattern is too complicated to
> follow and point out where your error lies and how to fix it. The
> basic problem is that your logic depends upon detecting when the
> directory name changes, and will therefore not work on the last file
> encountered. It also depends upon all of the files in a directory
> being scanned sequentially. While that seems to work, it is not
> guaranteed by File::Find.
>
> A simpler method for counting files in each directory would be to use
> a hash with the directory name as key and the number of files as
> value. Here is a program for doing that:
Wow. That really is simpler. Nice to see such succinct code.
I guess all those same directory names getting crammed into $counts
just get swallowed by that hash characteristic... no need to seek out
the uniq name with $counts{$dir}++ = 0 like I was doing.
> Note that I recommend keeping the raw numerical data (file counts) and
> not saving the output lines. Generate the output lines when you print
> them.
I saw that, and liked that too.
One thing I'm pretty curious about is that when I change your code
enough to run it against the real archive .. which was just to change
the directory name being fed and change the printf a bit to get the
count of returned directories at print out time and the length of the
directory names (even shortened) accounted for.
[...]
>From yours:
------- ------- ---=--- ------- -------
my $startdir = './dir1';
my %counts;
find sub {
return unless -f;
return unless /^\d+$/;
$counts{$File::Find::dir}++;
}, $startdir;
my( $total_dir, $total_files);
print "# Files Directory\n------- ---------------\n";
for my $dir ( sort keys %counts ) {
(my $shortdir = $dir) =~ s{.*News/agent/nntp}{};
my $nfiles = $counts{$dir};
printf "%5d %s\n", $nfiles, $shortdir;
$total_dir++;
$total_files += $nfiles;
}
print "\n<$total_files> files in <$total_dir> directories\n”;
------- ------- ---=--- ------- -------
To this:
Yours with slight alterations.
------- ------- ---=--- ------- -------
my $startdir = '/home/gnusu/News/agent/nntp';
my %counts;
find sub {
return unless -f;
return unless /^\d+$/;
$counts{$File::Find::dir}++;
}, $startdir;
my( $total_dir, $total_files);
my $gcnt = 0;
for my $dir ( sort keys %counts ) {
$gcnt++;
(my $shortdir = $dir) =~ s{.*News/agent/nntp}{};
my $nfiles = $counts{$dir};
printf "%2d: %-56s %6d\n", $gcnt, $shortdir, $nfiles;
$total_dir++;
$total_files += $nfiles;
}
print "\n<$total_files> posts in <$total_dir> directories\n";
------- ------- ---=--- ------- -------
The outputs of my original code and yours with the alterations are the
same:
Printing out 44 directories, each with its file count and ending with
the totals.
<4261964> posts in <44> directories overall
What surprised me is the that when I ran them prefaced with the `time'
utility, I see the sloppy mess I wrote is nearly twice as fast.
I named my own code `cntNews' so named yours `hash_cntNews'
time cntNews (the code from OP):
[...] showing last 5 lines of output
42: nntp.perl.org/perl/beginners 126568
43: nntp.perl.org/perl/beginners/cgi 13609
44: nntp.perl.org/perl/perl6/users 3967
<4261964> posts in <44> directories overall
real 0m15.976s
user 0m10.948s
sys 0m3.416s
------- ------- ---=--- ------- -------
time hash_cntNews (your code with slight alterations):
[...] showing last 5 lines of output
42: /nntp.perl.org/perl/beginners 126568
43: /nntp.perl.org/perl/beginners/cgi 13609
44: /nntp.perl.org/perl/perl6/users 3967
<4261964> posts in <44> directories overall
real 0m29.886s
user 0m12.892s
sys 0m12.824s
Even though I'd rather use your code since it is more reliable,
because like you say; File::Find does not guarantee mine will always
work: But, counting over 4.2 million msgs, can you guess why that time
factor might be the way it is? One is nearly twice as fast.
I did try them at least a dozen times so far... and the timings
come out either the same or nearly so each time. The biggest difference I
saw was +- 2 seconds on each code's time. But still remained roughly
twice as fast.
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/