Madhu Reddy wrote:
> Hi,
> I want to sort a file and want to write the result
> to same file....
> I want to sort a based on 3rd column..
>
> following is my file format
>
> C1 C2 C3 C4
> 1234 guhr 89890 uierfer
> 1324 guii 60977 hiofver
> 5467 frwf 56576 errtttt
>
>
> i want to sort above file based on column 3(C3)
> and i want to write sorted result to same file....
>
> After sorting my file should be
>
> 5467 frwf 56576 errtttt
> 1324 guii 60977 hiofver
> 1234 guhr 89890 uierfer
>
>
>
> how to do this ?
> file may have around 20 millions rows ......
>
if you are using the *nix os, you should try the sort utility. if you are
not using *nix and you don't have the sort utility, you will have to rely
on Perl's sort function. with 20m rows, you probably don't want to store
everything in memory and then sort them. what you have to do is sort the
data file segment by segment and then merge them back. merging is the real
tricky business. the following script(which i did for someone a while ago)
will do that for you. what it does is break the file into multiple chunks of
100000 lines, sort the chunks in a disk tmp file and then merge all the
chunks back together. when i sort the file, i keep the smallest boundary of
each chunk and use this number to sort the file so you don't have to
compare all the tmp files.
#!/usr/bin/perl -w
use strict;
my @buffer = ();
my @tmps = ();
my %bounds = ();
my $counter = 0;
open(FILE,"file.txt") || die $!;
while(<FILE>){
push(@buffer,$_);
if(@buffer > 100000){
my $tmp = "tmp" . $counter++ . ".txt";
push(@tmps,$tmp);
sort_it(\@buffer,$tmp);
@buffer = ();
}
}
close(FILE);
merge_it(\%bounds);
unlink(@tmps);
#-- DONE --#
sub sort_it{
my $ref = shift;
my $tmp = shift;
my $first = 1;
open(TMP,">$tmp") || die $!;
for(sort {my @fields1 = split(/\s/,$a);
my @fields2 = split(/\s/,$b);
$fields1[2] <=> $fields2[2] } @{$ref}){
if($first){
$bounds{$tmp} = (split(/\s/))[2];
$first = 0;
}
print TMP $_;
}
close(TMP);
}
sub merge_it{
my $ref = shift;
my @files = sort {$ref->{$a} <=> $ref->{$b}} keys %{$ref};
my $merged_to = $files[0];
for(my $i=1; $i<@files; $i++){
open(FIRST,$merged_to) || dir $!;
open(SECOND,$files[$i]) || dir $!;
my $merged_tmp = "merged_tmp$i.txt";
open(MERGED,">$merged_tmp") || die $!;
my $line1 = <FIRST>;
my $line2 = <SECOND>;
while(1){
if(!defined($line1) && defined($line2)){
print MERGED $line2;
print MERGED while(<SECOND>);
last;
}
if(!defined($line2) && defined($line1)){
print MERGED $line1;
print MERGED while(<FIRST>);
last;
}
last if(!defined($line1) && !defined($line2));
my $value1 = (split(/\s/,$line1))[2];
my $value2 = (split(/\s/,$line2))[2];
if($value1 == $value2){
print MERGED $line1;
print MERGED $line2;
$line1 = <FIRST>; $line2 = <SECOND>;
}elsif($value1 > $value2){
while($value1 > $value2){
print MERGED $line2;
$line2 = <SECOND>;
last unless(defined $line2);
$value2 = (split(/\s/,$line2))[2];
}
}else{
while($value1 < $value2){
print MERGED $line1;
$line1 = <FIRST>;
last unless(defined $line1);
$value1 = (split(/\s/,$line1))[2];
}
}
}
close(FIRST);
close(SECOND);
close(MERGED);
$merged_to = $merged_tmp;
}
}
__END__
after the script finish, you wil notice some files named
merged_tmp<number>.txt. if you look at the merged_tmp<largest number>.txt,
you should see your original files are sorted in this file. i decided not to
delete those merged_tmp files so you can see exactly how each chunk is
sorted one by one. great for debug. i omitted a lot of error checks which
you should add if you decided to use the script. it can sort extrememly
large file without using a lot of memory but it does use up your disk space
and it isn't very fast. finally, if you found the script not working,
please let me know so i can fix it.
david
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]