tag 614966 +patch thanks Okay, I managed to reproduce the bug, by killing a first update, and then updating the mirror.
The issue was really painful to track (very random), and is related to the "delayed type check" ("don't make any link test but wait for files download to start instead") The bug will not occur with -%N0 (disabled delayed type checking) Basically, httrack scans all html files sequentially, using a heap of links. Each new link is recorded on the heap, and httrack processes all links until no more link is found. HTML pages produces more links when scanned, static data (images, ..) are just skipped. The process stops basically when all encountered links have either been already added, or are forbidden. To enhance the process, a background downloader ensure that links can be added regularly, and once finished, the entry is kicked from the background heap, and the link heap is notified that the file was processed in background (so that httrack can just skip this entry). It means that the background downloaded must find the file reference on the links heap, obviously. The "delayed type check" option is a feature allowing to start the download of a file before it is added on the link heap. It allows to have the HTTP headers ready before the link name is generated, allowing to have a correct file extension on disk (ie. www.example.com/foo.cgi will be named foo.gif if this is an image), as local filesystem browsing require files to have a correct type (because files do not have any mime type meta-data attached otherwise) This is obviously buggy, because there is a small race condition window where the background downloader will finish to download the file, before the link is added. In this case, httrack will fail to find the link reference on the link heap, and will display the cryptic: Info: engine: warning: entry cleaned up, but no trace on heap: (...) This will cause many troubles, including corrupted files in case of HTTP retries with preconditions, and many other headaches. Suggested patch that should fix this longstanding (and very painful) issue: diff -rudb httrack-3.43.12.orig/src/htsback.c httrack-3.43.12/src/htsback.c --- httrack-3.43.12.orig/src/htsback.c 2010-12-21 11:30:12.000000000 +0100 +++ httrack-3.43.12/src/htsback.c 2011-02-27 21:18:11.531580000 +0100 @@ -2150,10 +2150,12 @@ static int slot_can_be_finalized(httrackp* opt, const lien_back* back) { return - (back->r.is_write // not in memory (on disk, ready) + back->r.is_write // not in memory (on disk, ready) && !is_hypertext_mime(opt,back->r.contenttype, back->url_fil) // not HTML/hypertext && !may_be_hypertext_mime(opt,back->r.contenttype, back->url_fil) // may NOT be parseable mime type - ); + /* Has not been added before the heap saw the link, or now exists on heap */ + && ( !back->early_add || hash_read(opt->hash,back->url_sav,"",0,opt->urlhack) >= 0 ) + ; } void back_clean(httrackp* opt,cache_back* cache,struct_back* sback) { @@ -3243,7 +3245,7 @@ /* Solve "false" 416 problems */ - if (back[i].r.statuscode==416) { // 'Requested Range Not Satisfiable' + if (back[i].r.statuscode==HTTP_REQUESTED_RANGE_NOT_SATISFIABLE) { // 'Requested Range Not Satisfiable' // Example: // Range: bytes=2830- // -> diff -rudb httrack-3.43.12.orig/src/htscore.h httrack-3.43.12/src/htscore.h --- httrack-3.43.12.orig/src/htscore.h 2010-12-21 11:30:13.000000000 +0100 +++ httrack-3.43.12/src/htscore.h 2011-02-27 21:07:51.514117000 +0100 @@ -207,6 +207,7 @@ char info[256]; // éventuel status pour le ftp int stop_ftp; // flag stop pour ftp int finalized; // finalized (optim memory) + int early_add; // was added before link heap saw it #if DEBUG_CHECKINT char magic2; #endif diff -rudb httrack-3.43.12.orig/src/htshash.c httrack-3.43.12/src/htshash.c --- httrack-3.43.12.orig/src/htshash.c 2010-12-21 11:30:13.000000000 +0100 +++ httrack-3.43.12/src/htshash.c 2011-02-27 20:20:09.714432000 +0100 @@ -63,7 +63,7 @@ // type: numero enregistrement - 0 est case insensitive (sav) 1 (adr+fil) 2 (former_adr+former_fil) // recherche dans la table selon nom1,nom2 et le no d'enregistrement // retour: position ou -1 si non trouvé -int hash_read(hash_struct* hash,char* nom1,char* nom2,int type,int normalized) { +int hash_read(const hash_struct* hash,char* nom1,char* nom2,int type,int normalized) { char BIGSTK normfil_[HTS_URLMAXSIZE*2]; char catbuff[CATBUFF_SIZE]; char* normfil; diff -rudb httrack-3.43.12.orig/src/htshash.h httrack-3.43.12/src/htshash.h --- httrack-3.43.12.orig/src/htshash.h 2010-12-21 11:30:13.000000000 +0100 +++ httrack-3.43.12/src/htshash.h 2011-02-27 20:20:47.289581000 +0100 @@ -50,7 +50,7 @@ #endif // tables de hachage -int hash_read(hash_struct* hash,char* nom1,char* nom2,int type,int normalized); +int hash_read(const hash_struct* hash,char* nom1,char* nom2,int type,int normalized); void hash_write(hash_struct* hash,int lpos,int normalized); int* hash_calc_chaine(hash_struct* hash,int type,int pos); unsigned long int hash_cle(char* nom1,char* nom2); diff -rudb httrack-3.43.12.orig/src/htsparse.c httrack-3.43.12/src/htsparse.c --- httrack-3.43.12.orig/src/htsparse.c 2010-12-21 11:30:13.000000000 +0100 +++ httrack-3.43.12/src/htsparse.c 2011-02-27 21:10:47.974210000 +0100 @@ -3427,8 +3427,8 @@ } // bloc // erreur HTTP (ex: 404, not found) } else if ( - (r->statuscode==412) - || (r->statuscode==416) + (r->statuscode==HTTP_PRECONDITION_FAILED) + || (r->statuscode==HTTP_REQUESTED_RANGE_NOT_SATISFIABLE) ) { // Precondition Failed, c'est à dire pour nous redemander TOUT le fichier if (fexist(liens[ptr]->sav)) { remove(liens[ptr]->sav); // Eliminer @@ -4283,6 +4283,9 @@ return -1; } + /* We added the link before the parsed recorded it -- the background download MUST NOT clean silently this entry! */ + back[b].early_add = 1; + /* Cache read failed because file does not exists (bad delayed name!) Just re-add with the correct name, as we know the MIME now! */ @@ -4329,6 +4332,10 @@ XH_uninit; // désallocation mémoire & buffers return -1; } + + /* We added the link before the parsed recorded it -- the background download MUST NOT clean silently this entry! */ + back[b].early_add = 1; + if ((opt->debug>1) && (opt->log!=NULL)) { HTS_LOG(opt,LOG_DEBUG); fprintf(opt->log,"Type immediately loaded from cache: %s"LF, delayed_back.r.contenttype); test_flush; -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org