On Jun 11, 2012, at 7:31 AM, venkates wrote:
> Hi all,
>
> I am trying to filter files from a directory (code provided below) by
> comparing the contents of each file with a hash ref (a parsed id map file
> provided as an argument). The code is working however, is extremely slow.
> The .csv files (81 files) that I am reading are not very large (largest file
> is 183,258 bytes). I would appreciate if you could suggest improvements to
> the code.
>
> sub filter {
> my ( $pazar_dir_path, $up_map, $output ) = @_;
> croak "Not enough arguments! " if ( @_ < 3 );
>
> my $accepted = 0;
> my $rejected = 0;
>
> opendir DH, $pazar_dir_path or croak ("Error in opening directory
> '$pazar_dir_path': $!");
> open my $OUT, '>', $output or croak ("Cannot open file for writing
> '$output': $!");
> while ( my @data_files = grep(/\.csv$/,readdir(DH)) ) {
This loop will only be executed one time, so it need not be a loop at all.
readdir in list context (as established by grep) will return a list of all
entries in the directory the first time it i called. Therefore, the second time
this loop is executed, readdir will return an empty list.
> my @records;
There doesn't seem to be any reason to declare @records here, as it is not used
outside of its innermost scope.
> foreach my $file ( @data_files ) {
> open my $FH, '<', "$pazar_dir_path/$file" or croak ("Cannot open
> file '$file': $!");
> while ( my $data = <$FH> ) {
> chomp $data;
> my $record_output;
> @records = split /\t/, $data;
You can declare @records here: my @records = split /\t, $data;
> foreach my $up_acs ( keys %{$up_map} ) {
You are having to iterate over the entire hash here because you are not looking
for the primary hash key, but an entry in a nested hash. One way to speed up
your program would be to create a temporary hash that contains the entries of
your nested hash:
my %ensembles;
for my $up_acs ( keys %{$p_map} ) {
$ensembles{$up_map->{$up_acs}{'Ensembl_TRS'}} = 1;
}
Do that once at the beginning of the routine (or in the calling routine and
pass that hash instead of the full hash, if the routine is called more than
once an the hash doesn't change).
> foreach my $ensemble_id (
> @{$up_map->{$up_acs}{'Ensembl_TRS'}} ){
> if ( $records[1] eq $ensemble_id ) {
Now, instead of iterating over the %{$up_map} hash, you can test directly for
the entry:
if( $ensembles{$records[1] ) {
...
)
> $record_output = join( "\t", @records );
> print $OUT "$record_output\n";
> $accepted++;
> }
> else {
> $rejected++;
> next;
> }
> }
> }
> }
> close $FH;
> }
> }
> close $OUT;
> closedir (DH);
> print "accepted records: $accepted\n, rejected records: $rejected\n";
> return $output;
> }
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/