Hi David,
how are U...
I am using u r program for sorting...
below is u r program (at the end of mail)...
I am sorting 7.5 GB file with this program...
it has 13 millions of records...
i changed u r program to following
if(@buffer > 500000){
my $tmp = "tmp" . $counter++ .
".txt";
.....
}
following are the statistics...
it took 5:30 hours to sort 13 millions record file...
on 8 CPU's and 8GB RAM
how to improve the speed....
i think 5 hours is more....
Thanx
-Madhu
--- david <[EMAIL PROTECTED]> wrote:
> Madhu Reddy wrote:
>
> > Hi,
> > I want to sort a file and want to write the
> result
> > to same file....
> > I want to sort a based on 3rd column..
> >
> > following is my file format
> >
> > C1 C2 C3 C4
> > 1234 guhr 89890 uierfer
> > 1324 guii 60977 hiofver
> > 5467 frwf 56576 errtttt
> >
> >
> > i want to sort above file based on column 3(C3)
> > and i want to write sorted result to same file....
> >
> > After sorting my file should be
> >
> > 5467 frwf 56576 errtttt
> > 1324 guii 60977 hiofver
> > 1234 guhr 89890 uierfer
> >
> >
> >
> > how to do this ?
> > file may have around 20 millions rows ......
> >
>
> if you are using the *nix os, you should try the
> sort utility. if you are
> not using *nix and you don't have the sort utility,
> you will have to rely
> on Perl's sort function. with 20m rows, you probably
> don't want to store
> everything in memory and then sort them. what you
> have to do is sort the
> data file segment by segment and then merge them
> back. merging is the real
> tricky business. the following script(which i did
> for someone a while ago)
> will do that for you. what it does is break the file
> into multiple chunks of
> 100000 lines, sort the chunks in a disk tmp file and
> then merge all the
> chunks back together. when i sort the file, i keep
> the smallest boundary of
> each chunk and use this number to sort the file so
> you don't have to
> compare all the tmp files.
>
> #!/usr/bin/perl -w
> use strict;
>
> my @buffer = ();
> my @tmps = ();
> my %bounds = ();
> my $counter = 0;
>
> open(FILE,"file.txt") || die $!;
> while(<FILE>){
> push(@buffer,$_);
> if(@buffer > 100000){
> my $tmp = "tmp" . $counter++ .
> ".txt";
> push(@tmps,$tmp);
> sort_it([EMAIL PROTECTED],$tmp);
> @buffer = ();
> }
> }
> close(FILE);
>
> merge_it(\%bounds);
> unlink(@tmps);
>
> #-- DONE --#
>
> sub sort_it{
> my $ref = shift;
> my $tmp = shift;
> my $first = 1;
> open(TMP,">$tmp") || die $!;
> for(sort {my @fields1 = split(/\s/,$a);
> my @fields2 = split(/\s/,$b);
> $fields1[2] <=> $fields2[2] }
> @{$ref}){
> if($first){
> $bounds{$tmp} =
> (split(/\s/))[2];
> $first = 0;
> }
> print TMP $_;
> }
> close(TMP);
> }
>
> sub merge_it{
> my $ref = shift;
> my @files = sort {$ref->{$a} <=> $ref->{$b}}
> keys %{$ref};
> my $merged_to = $files[0];
> for(my $i=1; $i<@files; $i++){
> open(FIRST,$merged_to) || dir $!;
> open(SECOND,$files[$i]) || dir $!;
> my $merged_tmp = "merged_tmp$i.txt";
> open(MERGED,">$merged_tmp") || die
> $!;
> my $line1 = <FIRST>;
> my $line2 = <SECOND>;
> while(1){
> if(!defined($line1) &&
> defined($line2)){
> print MERGED $line2;
> print MERGED
> while(<SECOND>);
> last;
> }
> if(!defined($line2) &&
> defined($line1)){
> print MERGED $line1;
> print MERGED
> while(<FIRST>);
> last;
> }
> last if(!defined($line1) &&
> !defined($line2));
> my $value1 =
> (split(/\s/,$line1))[2];
> my $value2 =
> (split(/\s/,$line2))[2];
> if($value1 == $value2){
> print MERGED $line1;
> print MERGED $line2;
> $line1 = <FIRST>;
> $line2 = <SECOND>;
> }elsif($value1 > $value2){
> while($value1 >
> $value2){
> print MERGED
> $line2;
> $line2 =
> <SECOND>;
> last
> unless(defined $line2);
> $value2 =
> (split(/\s/,$line2))[2];
> }
> }else{
> while($value1 <
> $value2){
> print MERGED
> $line1;
> $line1 =
> <FIRST>;
> last
> unless(defined $line1);
> $value1 =
> (split(/\s/,$line1))[2];
> }
> }
> }
> close(FIRST);
> close(SECOND);
> close(MERGED);
> $merged_to = $merged_tmp;
> }
> }
>
> __END__
>
> after the script finish, you wil notice some files
> named
> merged_tmp<number>.txt. if you look at the
> merged_tmp<largest number>.txt,
> you should see your original files are sorted in
> this file. i decided not to
> delete those merged_tmp files so you can see exactly
> how each chunk is
> sorted one by one. great for debug. i omitted a lot
> of error checks which
> you should add if you decided to use the script. it
> can sort extrememly
> large file without using a lot of memory but it does
> use up your disk space
> and it isn't very fast. finally, if you found the
> script not working,
> please let me know so i can fix it.
>
> david
>
> --
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
>
__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]