Re: how to sort a big file

Madhu Reddy Mon, 24 Feb 2003 07:44:29 -0800

Hi David,
 how are U...
I am using u r program for sorting...
below is u r program (at the end of mail)...
I am sorting 7.5 GB file with this program...
it has 13 millions of records...


i changed u r program to following 

if(@buffer > 500000){
                 my $tmp = "tmp" . $counter++ .
 ".txt";
.....
}
following are the statistics...

it took 5:30 hours to sort 13 millions record file...
on 8 CPU's and 8GB RAM

how to improve the speed....

i think 5 hours is more....

Thanx
-Madhu




--- david <[EMAIL PROTECTED]> wrote:
> Madhu Reddy wrote:
> 
> > Hi,
> >   I want to sort a file and  want to write the
> result
> > to same file....
> > I want to sort a based on 3rd column..
> > 
> > following is my file format
> > 
> >  C1   C2   C3     C4
> > 1234 guhr 89890 uierfer
> > 1324 guii 60977 hiofver
> > 5467 frwf 56576 errtttt
> > 
> > 
> > i want to sort above file based on column 3(C3)
> > and i want to write sorted result to same file....
> > 
> > After sorting my file should be
> > 
> > 5467 frwf 56576 errtttt
> > 1324 guii 60977 hiofver
> > 1234 guhr 89890 uierfer
> > 
> > 
> > 
> > how to do this ?
> > file may have around 20 millions rows ......
> > 
> 
> if you are using the *nix os, you should try the
> sort utility. if you are 
> not using *nix and you don't have the sort utility,
> you will have to rely 
> on Perl's sort function. with 20m rows, you probably
> don't want to store 
> everything in memory and then sort them. what you
> have to do is sort the 
> data file segment by segment and then merge them
> back. merging is the real 
> tricky business. the following script(which i did
> for someone a while ago)
> will do that for you. what it does is break the file
> into multiple chunks of 
> 100000 lines, sort the chunks in a disk tmp file and
> then merge all the 
> chunks back together. when i sort the file, i keep
> the smallest boundary of 
> each chunk and use this number to sort the file so
> you don't have to 
> compare all the tmp files.
> 
> #!/usr/bin/perl -w
> use strict;
> 
> my @buffer  = ();
> my @tmps    = ();
> my %bounds  = ();
> my $counter = 0;
> 
> open(FILE,"file.txt") || die $!;
> while(<FILE>){
>         push(@buffer,$_);
>         if(@buffer > 100000){
>                 my $tmp = "tmp" . $counter++ .
> ".txt";
>                 push(@tmps,$tmp);
>                 sort_it([EMAIL PROTECTED],$tmp);
>                 @buffer = ();
>         }
> }
> close(FILE);
> 
> merge_it(\%bounds);
> unlink(@tmps);
> 
> #-- DONE --#
> 
> sub sort_it{
>         my $ref = shift;
>         my $tmp = shift;
>         my $first = 1;
>         open(TMP,">$tmp") || die $!;
>         for(sort {my @fields1 = split(/\s/,$a);
>                   my @fields2 = split(/\s/,$b);
>                   $fields1[2] <=> $fields2[2] }
> @{$ref}){
>                 if($first){
>                         $bounds{$tmp} =
> (split(/\s/))[2];
>                         $first = 0;
>                 }
>                 print TMP $_;
>         }
>         close(TMP);
> }
> 
> sub merge_it{
>         my $ref = shift;
>         my @files = sort {$ref->{$a} <=> $ref->{$b}}
> keys %{$ref};
>         my $merged_to = $files[0];
>         for(my $i=1; $i<@files; $i++){
>                 open(FIRST,$merged_to) || dir $!;
>                 open(SECOND,$files[$i]) || dir $!;
>                 my $merged_tmp = "merged_tmp$i.txt";
>                 open(MERGED,">$merged_tmp") || die
> $!;
>                 my $line1 = <FIRST>;
>                 my $line2 = <SECOND>;
>                 while(1){
>                         if(!defined($line1) &&
> defined($line2)){
>                                 print MERGED $line2;
>                                 print MERGED
> while(<SECOND>);
>                                 last;
>                         }
>                         if(!defined($line2) &&
> defined($line1)){
>                                 print MERGED $line1;
>                                 print MERGED
> while(<FIRST>);
>                                 last;
>                         }
>                         last if(!defined($line1) &&
> !defined($line2));
>                         my $value1 =
> (split(/\s/,$line1))[2];
>                         my $value2 =
> (split(/\s/,$line2))[2];
>                         if($value1 == $value2){
>                                 print MERGED $line1;
>                                 print MERGED $line2;
>                                 $line1 = <FIRST>;
> $line2 = <SECOND>;
>                         }elsif($value1 > $value2){
>                                 while($value1 >
> $value2){
>                                         print MERGED
> $line2;
>                                         $line2 =
> <SECOND>;
>                                         last
> unless(defined $line2);
>                                         $value2 =
> (split(/\s/,$line2))[2];
>                                 }
>                         }else{
>                                 while($value1 <
> $value2){
>                                         print MERGED
> $line1;
>                                         $line1 =
> <FIRST>;
>                                         last
> unless(defined $line1);
>                                         $value1 =
> (split(/\s/,$line1))[2];
>                                 }
>                         }
>                 }
>                 close(FIRST);
>                 close(SECOND);
>                 close(MERGED);
>                 $merged_to = $merged_tmp;
>         }
> }
> 
> __END__
> 
> after the script finish, you wil notice some files
> named 
> merged_tmp<number>.txt. if you look at the
> merged_tmp<largest number>.txt,
> you should see your original files are sorted in
> this file. i decided not to 
> delete those merged_tmp files so you can see exactly
> how each chunk is 
> sorted one by one. great for debug. i omitted a lot
> of error checks which 
> you should add if you decided to use the script. it
> can sort extrememly 
> large file without using a lot of memory but it does
> use up your disk space 
> and it isn't very fast. finally, if you found the
> script not working, 
> please let me know so i can fix it.
> 
> david
> 
> -- 
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to sort a big file

Reply via email to