On Mon, Jun 11, 2012 at 4:31 PM, venkates <[email protected]> wrote:
> Hi all,
>
> I am trying to filter files from a directory (code provided below) by
> comparing the contents of each file with a hash ref (a parsed id map file
> provided as an argument). The code is working however, is extremely slow.
> The .csv files (81 files) that I am reading are not very large (largest
> file is 183,258 bytes). I would appreciate if you could suggest
> improvements to the code.
>
> sub filter {
> my ( $pazar_dir_path, $up_map, $output ) = @_;
> croak "Not enough arguments! " if ( @_ < 3 );
>
> my $accepted = 0;
> my $rejected = 0;
>
> opendir DH, $pazar_dir_path or croak ("Error in opening directory
> '$pazar_dir_path': $!");
> open my $OUT, '>', $output or croak ("Cannot open file for writing
> '$output': $!");
> while ( my @data_files = grep(/\.csv$/,readdir(DH)) ) {
> my @records;
> foreach my $file ( @data_files ) {
> open my $FH, '<', "$pazar_dir_path/$file" or croak ("Cannot
> open file '$file': $!");
> while ( my $data = <$FH> ) {
> chomp $data;
> my $record_output;
> @records = split /\t/, $data;
> foreach my $up_acs ( keys %{$up_map} ) {
> foreach my $ensemble_id ( @{$up_map->{$up_acs}{'Ensembl_
> **TRS'}} ){
> if ( $records[1] eq $ensemble_id ) {
> $record_output = join( "\t", @records );
> print $OUT "$record_output\n";
> $accepted++;
> }
> else {
> $rejected++;
> next;
> }
> }
> }
> }
> close $FH;
> }
> }
> close $OUT;
> closedir (DH);
> print "accepted records: $accepted\n, rejected records: $rejected\n";
> return $output;
> }
>
> __DATA__
>
> TF0000210 ENSMUST00000001326 SP1_MOUSE GS0000422
> ENSMUSG00000037974 7 148974877 149005136 Mus musculus
> MUC5AC 14570593 ELECTROPHORETIC MOBILITY SHIFT ASSAY
> (EMSA)::SUPERSHIFT
> TF0000211 ENSMUST00000066003 SP3_MOUSE GS0000422
> ENSMUSG00000037974 7 148974877 149005136 Mus musculus
> MUC5AC 14570593 ELECTROPHORETIC MOBILITY SHIFT ASSAY
> (EMSA)::SUPERSHIFT
>
>
> Thanks a lot,
>
> Aravind
>
> --
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> http://learn.perl.org/
>
>
>
Hi Aravind,
Even though I don't have time to go into the code and really find out why I
can instantly see a bad thing and in peudo code it looks like:
Loop (1) {
Loop (2) {
Loop (3) {
}
}
}
To me that looks like if I have to loop over 81 files like you said and
each of the underlying loops has 81 actions as well then I am doing
81x81x81= 531441 times what ever loop (3) does and 6561 times what ever
loop (2) does as well as 81 times what ever loop (1) does.
To make this a lot faster try and pull the loops appart. You are looping
over a directory, simply het the list and stop te loop, then open each file
seperately and only after youa re sure it is a useful file start processing
the contents of the file. Try and to limit as much as possible the
construct where you have a loop inside a loop.
You seem to be parsing the files line by line as you are saying they are
all very small you might want to pull them in in one go and then thread
them as one long string rather then processing them line by line.
Then there is one more thing I saw and don't quite understand you take your
hash and compare every key in your hash with each and every value in your
file. Why? Why not simply take the value in your file and see if it has a
matching key in the hash (a lot faster then your way). As you are only
comparing the two values you can simply ask the hash if key X exists if it
does you can process if not reject it instantly. As a hash is optimized for
this work you skip the need to loop over the entire hash every single time.
In very short if you want speed writting a loop in a loop in a loop in a
loop is never a good idea. If you have to a loop in a loop is possible more
thn that should be avoided as much as possible.
Regards,
Rob Coops