Re: Wget ignores robot.txt entry

2003-02-14 Thread Randall R Schulz
Lowell, Max Bowsher reported: Or, on the command line -erobots=off :-) Whilst this does control whether wget downloads robots.txt, a quick test confirms that even when it does get robots.txt, it still wanders into cgi-bin. I'd suggest taking this to the wget list, except wget it currently m

Re: Wget ignores robot.txt entry

2003-02-14 Thread L Anderson
Randall R Schulz wrote: Lowell, What's in your "~/.wgetrc" file? If it contains this: robots = off Then wget will not respect a "robots.txt" file on the host from which it is retrieving files. Before I learned of this option (accessible _only_ via this directive in the .wgetrc file), I did

Re: Wget ignores robot.txt entry

2003-02-14 Thread Max Bowsher
I just did some archive-diving, and it turns out that the situation is not as grim as I thought. It seems that Hrvoje Niksic will be returning at some point later this year. http://www.mail-archive.com/wget@sunsite.dk/msg04467.html http://www.mail-archive.com/wget@sunsite.dk/msg04555.html Max.

RE: Wget ignores robot.txt entry

2003-02-14 Thread Richard Campbell
>Wasn't a patch applied to CVS HEAD of the wget repos only a few weeks ago. >Thats what it looks like anyway. A cursory scan through the cvs web interface shows a patch for an ftp vulnerability applied 4 weeks ago by someone who does not appear on most of the rest of the latest changes; everythin

Re: Wget ignores robot.txt entry

2003-02-13 Thread Elfyn McBratney
> Max, > > No, I don't think cURL does recursive retrieval. I don't think it does > Web page dependency retrieval, either. Both of these are a big deal for > me. How could a tool of wget's versatility be replaced by something > inferior? Whatever happened to technological meritocracy? (Please, no >

Re: Wget ignores robot.txt entry

2003-02-13 Thread Charles Wilson
Randall R Schulz wrote: What happens to an open source project when it devolves to this state? Who, for example, could hand out writable access to the wget CVS repository? Surely this isn't an unrecoverable state of affairs, is it? A fork happens. somebody gets fed up, opens up a new sourcef

Re: Wget ignores robot.txt entry

2003-02-13 Thread Randall R Schulz
Max, No, I don't think cURL does recursive retrieval. I don't think it does Web page dependency retrieval, either. Both of these are a big deal for me. How could a tool of wget's versatility be replaced by something inferior? Whatever happened to technological meritocracy? (Please, no laughing

Re: Wget ignores robot.txt entry

2003-02-13 Thread Christopher Faylor
On Fri, Feb 14, 2003 at 03:04:10AM -, Max Bowsher wrote: >Randall R Schulz wrote: >> Wget is orphaned? That's bad news, since it seems to have it all over >> cURL. (Sure. Go ahead and prove me wrong. I might as well get it over >> with... for now.) > >cURL doesn't do recursive web-suck (does it

Re: Wget ignores robot.txt entry

2003-02-13 Thread Max Bowsher
Randall R Schulz wrote: > Wget is orphaned? That's bad news, since it seems to have it all over > cURL. (Sure. Go ahead and prove me wrong. I might as well get it over > with... for now.) cURL doesn't do recursive web-suck (does it?) Yes, wget is orphaned. There's no one on the wget mailing list

Re: Wget ignores robot.txt entry

2003-02-13 Thread Randall R Schulz
Max, Right. How can I have read the wget man page so many times and not have seen that? I guess it's 'cause I'm always looking for something specific, like the difference between "-o" and "-O". The only think I hate worse than being wrong is not knowing it (plus showing it). Wget is orphaned

Re: Wget ignores robot.txt entry

2003-02-13 Thread Max Bowsher
Randall R Schulz wrote: > Lowell, > > What's in your "~/.wgetrc" file? If it contains this: > > robots = off > > Then wget will not respect a "robots.txt" file on the host from which > it is retrieving files. > > Before I learned of this option (accessible _only_ via this directive > in the .wgetrc

Re: Wget ignores robot.txt entry

2003-02-13 Thread Randall R Schulz
Lowell, What's in your "~/.wgetrc" file? If it contains this: robots = off Then wget will not respect a "robots.txt" file on the host from which it is retrieving files. Before I learned of this option (accessible _only_ via this directive in the .wgetrc file), I did something too clever by ha

Wget ignores robot.txt entry

2003-02-13 Thread L Anderson
Using the latest of things Cygwin, I downloaded some stuff with wget from to peruse off-line and noticed a problem I can't explain: The file has the entries: User-agent: * Disallow: /snapshots/ Disallow: /cgi-bin/ Disallow: /cgi2-bin/ so wget