Made some more progress. It seems to be something about the size of the
string the regexp is processing, in addition to the new ID tag in the <a
href> for each message.  I tried running the regexp over different
portions of the $tmpPage string, and as it got longer and longer it
appears the CPU time went exponential.

So, on a 2Ghz CPU with 512M RAM and not much else going on, it didn't
finish processing a single pass in an hour.

The attached patch sort of fixes the problem. It's almost certainly not
the way this should be solved, but hopefully sheds some light on what's
going wrong.  What's a little odd about this solution is it leaves the
last message in the inbox on first run with --delete, but then the next
run gets the remaining message.
-- 
Adam Rosi-Kessel
http://adam.rosi-kessel.org
--- fetchyahoo  2005-11-13 08:49:58.000000000 -0500
+++ fetchyahoo.new      2005-11-13 08:44:24.000000000 -0500
@@ -853,7 +853,9 @@
   my $tmpLine = '';
 
   # the long regex matches and removes a single message
-  while ( $tmpPage =~ s/^.*?^[\s]*<tr 
class=msg(new|old).*?^<td.*?name="Mid".value="([^"]+)".*?^<td>(.*?)<.*?^<td>.*?^[\s]*<a.href=.*?ShowLetter\?MsgId=([^&]+)&.*?\n(.*?)\n.*?^[\s]*<td
 .*?>(.*?)<.*?^[\s]*<td>(.*?)<//ms ) {
+  # Adam Rosi-Kessel 2005/11/13 Hackish patch to stop regexp from hanging
+  $tmpPage =~ s/^.*?^[\s]*<tr class=msg/<tr class=msg/ms;
+  while ( $tmpPage =~ s/^<tr 
class=msg(new|old).*?^<td.*?name="Mid".value="([^"]+)".*?^<td>(.*?)<.*?^<td>.*?^[\s]*<a.*?href=.*?ShowLetter\?MsgId=([^&]+)&.*?\n(.*?)\n.*?^[\s]*<td
 .*?>(.*?)<.*?^[\s]*<td>(.*?)<//ms ) {
     if (! $2 eq $4) {
         print "\nWarning: message ID's $2 and $4 don't match.\n" unless $quiet;
     }
@@ -871,6 +873,7 @@
         if ($newOnly) { $tmpLine =~ s/^(new|old) //; }
         print $msgcount . ". " . $tmpLine . "\n";
     }
+  $tmpPage =~ s/^.*?^[\s]*(<tr class=msg)/$1/sm;
   }
 
   $pagecount = $pagecount+1 ;            # next summary page

Reply via email to