tag 614966 +patch
thanks

Okay, I managed to reproduce the bug, by killing a first update, and
then updating the mirror.

The issue was really painful to track (very random), and is related to
the "delayed type check" ("don't make any link test but wait for files
download to start instead")

The bug will not occur with -%N0 (disabled delayed type checking)

Basically, httrack scans all html files sequentially, using a heap of
links. Each new link is recorded on the heap, and httrack processes all
links until no more link is found. HTML pages produces more links when
scanned, static data (images, ..) are just skipped. The process stops
basically when all encountered links have either been already added, or
are forbidden.

To enhance the process, a background downloader ensure that links can be
added regularly, and once finished, the entry is kicked from the
background heap, and the link heap is notified that the file was
processed in background (so that httrack can just skip this entry). It
means that the background downloaded must find the file reference on the
links heap, obviously.

The "delayed type check" option is a feature allowing to start the
download of a file before it is added on the link heap. It allows to
have the HTTP headers ready before the link name is generated, allowing
to have a correct file extension on disk (ie. www.example.com/foo.cgi
will be named foo.gif if this is an image), as local filesystem browsing
require files to have a correct type (because files do not have any mime
type meta-data attached otherwise)

This is obviously buggy, because there is a small race condition window
where the background downloader will finish to download the file, before
the link is added.

In this case, httrack will fail to find the link reference on the link
heap, and will display the cryptic:

Info:   engine: warning: entry cleaned up, but no trace on heap:  (...)

This will cause many troubles, including corrupted files in case of HTTP
retries with preconditions, and many other headaches.


Suggested patch that should fix this longstanding (and very painful) issue:

diff -rudb httrack-3.43.12.orig/src/htsback.c httrack-3.43.12/src/htsback.c
--- httrack-3.43.12.orig/src/htsback.c  2010-12-21 11:30:12.000000000 +0100
+++ httrack-3.43.12/src/htsback.c       2011-02-27 21:18:11.531580000 +0100
@@ -2150,10 +2150,12 @@

 static int slot_can_be_finalized(httrackp* opt, const lien_back* back) {
   return
-    (back->r.is_write                             // not in memory (on
disk, ready)
+    back->r.is_write                             // not in memory (on
disk, ready)
     && !is_hypertext_mime(opt,back->r.contenttype, back->url_fil)
  // not HTML/hypertext
     && !may_be_hypertext_mime(opt,back->r.contenttype, back->url_fil)
  // may NOT be parseable mime type
-    );
+    /* Has not been added before the heap saw the link, or now exists
on heap */
+    && ( !back->early_add ||
hash_read(opt->hash,back->url_sav,"",0,opt->urlhack) >= 0 )
+    ;
 }

 void back_clean(httrackp* opt,cache_back* cache,struct_back* sback) {
@@ -3243,7 +3245,7 @@
                     /*
                     Solve "false" 416 problems
                     */
-                    if (back[i].r.statuscode==416) {  // 'Requested
Range Not Satisfiable'
+                    if
(back[i].r.statuscode==HTTP_REQUESTED_RANGE_NOT_SATISFIABLE) {  //
'Requested Range Not Satisfiable'
                       // Example:
                       // Range: bytes=2830-
                       // ->
diff -rudb httrack-3.43.12.orig/src/htscore.h httrack-3.43.12/src/htscore.h
--- httrack-3.43.12.orig/src/htscore.h  2010-12-21 11:30:13.000000000 +0100
+++ httrack-3.43.12/src/htscore.h       2011-02-27 21:07:51.514117000 +0100
@@ -207,6 +207,7 @@
   char info[256];         // éventuel status pour le ftp
   int stop_ftp;           // flag stop pour ftp
   int finalized;          // finalized (optim memory)
+  int early_add;          // was added before link heap saw it
 #if DEBUG_CHECKINT
   char magic2;
 #endif
diff -rudb httrack-3.43.12.orig/src/htshash.c httrack-3.43.12/src/htshash.c
--- httrack-3.43.12.orig/src/htshash.c  2010-12-21 11:30:13.000000000 +0100
+++ httrack-3.43.12/src/htshash.c       2011-02-27 20:20:09.714432000 +0100
@@ -63,7 +63,7 @@
 // type: numero enregistrement - 0 est case insensitive (sav) 1
(adr+fil) 2 (former_adr+former_fil)
 // recherche dans la table selon nom1,nom2 et le no d'enregistrement
 // retour: position ou -1 si non trouvé
-int hash_read(hash_struct* hash,char* nom1,char* nom2,int type,int
normalized) {
+int hash_read(const hash_struct* hash,char* nom1,char* nom2,int
type,int normalized) {
   char BIGSTK normfil_[HTS_URLMAXSIZE*2];
        char catbuff[CATBUFF_SIZE];
   char* normfil;
diff -rudb httrack-3.43.12.orig/src/htshash.h httrack-3.43.12/src/htshash.h
--- httrack-3.43.12.orig/src/htshash.h  2010-12-21 11:30:13.000000000 +0100
+++ httrack-3.43.12/src/htshash.h       2011-02-27 20:20:47.289581000 +0100
@@ -50,7 +50,7 @@
 #endif

 // tables de hachage
-int hash_read(hash_struct* hash,char* nom1,char* nom2,int type,int
normalized);
+int hash_read(const hash_struct* hash,char* nom1,char* nom2,int
type,int normalized);
 void hash_write(hash_struct* hash,int lpos,int normalized);
 int* hash_calc_chaine(hash_struct* hash,int type,int pos);
 unsigned long int hash_cle(char* nom1,char* nom2);
diff -rudb httrack-3.43.12.orig/src/htsparse.c
httrack-3.43.12/src/htsparse.c
--- httrack-3.43.12.orig/src/htsparse.c 2010-12-21 11:30:13.000000000 +0100
+++ httrack-3.43.12/src/htsparse.c      2011-02-27 21:10:47.974210000 +0100
@@ -3427,8 +3427,8 @@
         }     // bloc
         // erreur HTTP (ex: 404, not found)
       } else if (
-        (r->statuscode==412)
-        || (r->statuscode==416)
+        (r->statuscode==HTTP_PRECONDITION_FAILED)
+        || (r->statuscode==HTTP_REQUESTED_RANGE_NOT_SATISFIABLE)
         ) {    // Precondition Failed, c'est à dire pour nous
redemander TOUT le fichier
           if (fexist(liens[ptr]->sav)) {
             remove(liens[ptr]->sav);    // Eliminer
@@ -4283,6 +4283,9 @@
           return -1;
         }

+        /* We added the link before the parsed recorded it -- the
background download MUST NOT clean silently this entry! */
+        back[b].early_add = 1;
+
         /* Cache read failed because file does not exists (bad delayed
name!)
         Just re-add with the correct name, as we know the MIME now!
         */
@@ -4329,6 +4332,10 @@
             XH_uninit;    // désallocation mémoire & buffers
             return -1;
           }
+
+          /* We added the link before the parsed recorded it -- the
background download MUST NOT clean silently this entry! */
+          back[b].early_add = 1;
+
           if ((opt->debug>1) && (opt->log!=NULL)) {
             HTS_LOG(opt,LOG_DEBUG); fprintf(opt->log,"Type immediately
loaded from cache: %s"LF, delayed_back.r.contenttype);
             test_flush;







--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to