Re: how to sort a big file

david Tue, 11 Feb 2003 16:13:47 -0800

Madhu Reddy wrote:

> Hi,
>   I want to sort a file and  want to write the result
> to same file....
> I want to sort a based on 3rd column..
> 
> following is my file format
> 
>  C1   C2   C3     C4
> 1234 guhr 89890 uierfer
> 1324 guii 60977 hiofver
> 5467 frwf 56576 errtttt
> 
> 
> i want to sort above file based on column 3(C3)
> and i want to write sorted result to same file....
> 
> After sorting my file should be
> 
> 5467 frwf 56576 errtttt
> 1324 guii 60977 hiofver
> 1234 guhr 89890 uierfer
> 
> 
> 
> how to do this ?
> file may have around 20 millions rows ......
>


if you are using the *nix os, you should try the sort utility. if you are 
not using *nix and you don't have the sort utility, you will have to rely 
on Perl's sort function. with 20m rows, you probably don't want to store 
everything in memory and then sort them. what you have to do is sort the 
data file segment by segment and then merge them back. merging is the real 
tricky business. the following script(which i did for someone a while ago)
will do that for you. what it does is break the file into multiple chunks of 
100000 lines, sort the chunks in a disk tmp file and then merge all the 
chunks back together. when i sort the file, i keep the smallest boundary of 
each chunk and use this number to sort the file so you don't have to 
compare all the tmp files.

#!/usr/bin/perl -w
use strict;

my @buffer  = ();
my @tmps    = ();
my %bounds  = ();
my $counter = 0;

open(FILE,"file.txt") || die $!;
while(<FILE>){
        push(@buffer,$_);
        if(@buffer > 100000){
                my $tmp = "tmp" . $counter++ . ".txt";
                push(@tmps,$tmp);
                sort_it(\@buffer,$tmp);
                @buffer = ();
        }
}
close(FILE);

merge_it(\%bounds);
unlink(@tmps);

#-- DONE --#

sub sort_it{
        my $ref = shift;
        my $tmp = shift;
        my $first = 1;
        open(TMP,">$tmp") || die $!;
        for(sort {my @fields1 = split(/\s/,$a);
                  my @fields2 = split(/\s/,$b);
                  $fields1[2] <=> $fields2[2] } @{$ref}){
                if($first){
                        $bounds{$tmp} = (split(/\s/))[2];
                        $first = 0;
                }
                print TMP $_;
        }
        close(TMP);
}

sub merge_it{
        my $ref = shift;
        my @files = sort {$ref->{$a} <=> $ref->{$b}} keys %{$ref};
        my $merged_to = $files[0];
        for(my $i=1; $i<@files; $i++){
                open(FIRST,$merged_to) || dir $!;
                open(SECOND,$files[$i]) || dir $!;
                my $merged_tmp = "merged_tmp$i.txt";
                open(MERGED,">$merged_tmp") || die $!;
                my $line1 = <FIRST>;
                my $line2 = <SECOND>;
                while(1){
                        if(!defined($line1) && defined($line2)){
                                print MERGED $line2;
                                print MERGED while(<SECOND>);
                                last;
                        }
                        if(!defined($line2) && defined($line1)){
                                print MERGED $line1;
                                print MERGED while(<FIRST>);
                                last;
                        }
                        last if(!defined($line1) && !defined($line2));
                        my $value1 = (split(/\s/,$line1))[2];
                        my $value2 = (split(/\s/,$line2))[2];
                        if($value1 == $value2){
                                print MERGED $line1;
                                print MERGED $line2;
                                $line1 = <FIRST>; $line2 = <SECOND>;
                        }elsif($value1 > $value2){
                                while($value1 > $value2){
                                        print MERGED $line2;
                                        $line2 = <SECOND>;
                                        last unless(defined $line2);
                                        $value2 = (split(/\s/,$line2))[2];
                                }
                        }else{
                                while($value1 < $value2){
                                        print MERGED $line1;
                                        $line1 = <FIRST>;
                                        last unless(defined $line1);
                                        $value1 = (split(/\s/,$line1))[2];
                                }
                        }
                }
                close(FIRST);
                close(SECOND);
                close(MERGED);
                $merged_to = $merged_tmp;
        }
}

__END__

after the script finish, you wil notice some files named 
merged_tmp<number>.txt. if you look at the merged_tmp<largest number>.txt,
you should see your original files are sorted in this file. i decided not to 
delete those merged_tmp files so you can see exactly how each chunk is 
sorted one by one. great for debug. i omitted a lot of error checks which 
you should add if you decided to use the script. it can sort extrememly 
large file without using a lot of memory but it does use up your disk space 
and it isn't very fast. finally, if you found the script not working, 
please let me know so i can fix it.

david

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to sort a big file

Reply via email to