On Jun 30, 8:01 am, [EMAIL PROTECTED] (Cheez) wrote:
> Howdy, scripting with perl is a hobby and not a vocation so i
> apologize in advance for rough looking code.
>
> I have a very large list of 16-letter words called
> "hashsequence16.txt". This file is 203MB in size.
>
> I have a large list of data called "newrawdata.txt". This file is
> 95MB.
>
> For each 16-letter word, I am looping through "newrawdata.txt" to 1)
> find a match and 2) take the the full line of rawdata.txt and
> associate that with the 16-letter word.
>
> Using a filesize line-counter and timing how long it takes to process
> my data lets me know that I have 9534 hours to see if I can find an
> alternative solution. It's pretty brute force but I don't know if
> there is another way to do it.
>
> Any comments or guidance would be greatly appreciated.
>
> Thanks,
> Dan
> ==========================================
>
> print "**fisher**";
>
> $flatfile = "newrawdata.txt";
> # 95MB in size
>
> $datafile = "hashsequence16.txt";
> # 203MB in size
>
> my $filesize = -s "hashsequence16.txt";
> # for use in processing time calculation
>
> open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
> open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
> open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
> \n";
>
> @preparse = <FILE>;
> @hashdata = <FILE2>;
>
> close(FILE);
> close(FILE2);
>
> for my $list1 (@hashdata) {
> # iterating through hash16 data
>
> $finish++;
>
> if ($finish ==10 ) {
> # line counter
>
> $marker = $marker + $finish;
>
> $finish =0;
>
> $left = $filesize - $marker;
>
> printf "$left\/$filesize\n";
> # this prints every 17 seconds
> }
>
> ($line, $freq) = split(/\t/, $list1);
>
> for my $rawdata (@preparse) {
> # iterating through rawdata
>
> $rawdata=~ s/\n//;
>
> if ($rawdata =~ m/$line/) {
> # matching hash16 word with rawdata line
>
> my $first_pos = index $rawdata,$line;
>
> print SEQFILE "$first_pos\t$rawdata\n";
> # printing to info to new file
>
> }
>
> }
>
> print SEQFILE "PROCESS\t$line\n";
> # printing hash16 word and "process"
>
> }
Hi there, let me see if I can help you...
always include these two...it helps on debugging, etc..
use strict;
use warnings;
> @preparse = <FILE>;
> @hashdata = <FILE2>;
Maybe that's why your program runs so slow.
You are slurping big files into an array.
try something like ...
my $temp_file = "temp.txt";
open ($temp_file_fh, "<", $temp_file) or die $!;
while (<$temp_file_fh>){
s/[\r\n]+//; #Remove carriage returns and new lines
if ($_ =~ m/<your_regex-here>/){
print "found\n";
}
}
see what I mean? use slurping with really really small files. Even so.
> For each 16-letter word, I am looping through "newrawdata.txt" to 1)
> find a match and 2) take the the full line of rawdata.txt and
> associate that with the 16-letter word.
I'd just find whatever I'm looking for on both files, push values into
an array or external file.
then I'll create a hash to associate both entries.
Try below...
for example: file 1..want to find apples.file 1 contains apples and
oranges and bananas
#!/usr/bin/perl
use strict;
use warnings;
my $ca_dir_path = "ca_files";
my $ca_log_path = "log_ca.txt";
my @ca_iea_values;
my %log_ca;
opendir (CADIR, $ca_dir_path) or die $!;
chdir $ca_dir_path;
while (defined (my $file = readdir (CADIR))){
#skip . and .. files
next if $file =~ m#^\.\.?$#;
open (FILE, $file) or die $!;
while (<FILE>) {
chomp;
if ( m/^IEA\*/g ) {
my $match = $_;
push @ca_iea_values, /apples/;
$log_ca{ pop @ca_iea_values } = $file;
}
}
}
open (CA_LOG, ">$ca_log_path") or die $!;
foreach (sort { $a cmp $b } keys(%log_ca) ){
print CA_LOG "$_->$log_ca{$_}\n";
}
file 2: also looking for apples..this one file has apples, melons,
and berries
#!/usr/bin/perl
use strict;
use warnings;
my $aa_dir_path = "aa_files";
my $aa_log_path = "log_aa.txt";
my @aa_iea_values;
my %log_aa;
opendir (AADIR, $aa_dir_path) or die $!;
chdir $aa_dir_path;
while (defined (my $file = readdir (AADIR))){
#skip . and .. files
next if $file =~ m#^\.\.?$#;
open (FILE, $file) or die $!;
while (<FILE>) {
chomp;
if ( m/^IEA\{/g ) {
my $match = $_;
push @aa_iea_values, /apples/;
$log_aa{ pop @aa_iea_values } = $file;
}
}
}
open (AA_LOG, ">$aa_log_path") or die $!;
foreach (sort { $a cmp $b } keys(%log_aa) ){
print AA_LOG "$_->$log_aa{$_}\n";
}
file 3 would be the actual "report" generator
#!/usr/bin/perl
use warnings;
use strict;
my $ca_log_path = "log_ca.txt";
my $aa_log_path = "log_aa.txt";
my %final_report;
my @ca_filenames;
my @aa_filenames;
open (CAFILE, $ca_log_path) or die $!;
my @ca_files = <CAFILE>;
open(AAFILE, $aa_log_path) or die $!;
my @aa_files = <AAFILE>;
#sort arrays
my @ca_files_sorted = sort @ca_files;
my @aa_files_sorted = sort @aa_files;
my $total_items = @ca_files_sorted;
foreach(@ca_files_sorted){
s/\s+\z//; # Remove all trailing whitespace
push @ca_filenames, /\d+->(.+)/;
}
foreach(@aa_files_sorted){
s/\s+\z//; # Remove all trailing whitespace
push @aa_filenames, /\d+->(.+)/;
}
for (1..$total_items){
$final_report{ pop @ca_filenames } = pop @aa_filenames;
}
print "APPLES FILE 1 => APPLES FILE 2\n";
print '-' x 27, "\n";
foreach (sort { $a cmp $b } keys(%final_report) ){
print "$_ => $final_report{$_}\n";
}
Is this homework by the way dude?
anyway..my two cents..run them..if it works right away cool. If not,
that'll get you started. There's more than way to do it.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/