On Sunday 03 November 2019 11:56:52 Reco wrote: > On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote: > > On Sunday 03 November 2019 10:23:50 Reco wrote: > > > On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote: > > > > Greetings all > > > > > > > > I am developing a list of broken webcrawlers who are repeatedly > > > > downloading my entire web site including the hidden stuff. > > > > > > > > These crawlers/bots are ignoring my robots.txt > > > > > > $ wget -O - https://www.shentel.com/robots.txt > > > --2019-11-03 15:22:35-- https://www.shentel.com/robots.txt > > > Resolving www.shentel.com (www.shentel.com)... 45.60.160.21 > > > Connecting to www.shentel.com > > > (www.shentel.com)|45.60.160.21|:443... connected. HTTP request > > > sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 ERROR > > > 403: Forbidden. > > > > > > Allowing said bots to *see* your robots.txt would be a step into > > > the right direction. > > > > But you are asking for shentel.com/robots.txt which is my isp. > > You should be asking for > > > > http://geneslinuxbox.net:6309/gene/robots.txt > > Wow. You sir owe me a new set of eyes.
Chuckle :) That was the default I'd pickup up from someplace years ago. > I advise you to compare your monstrosity to this (a hint - it does > work) - [1]. > > Reco > > [1] https://enotuniq.net/robots.txt I'll trim mine forthwith to the last entry. I've wondered if that was too long a list. And restart apache2 of course. But now I see the next access is not a 200, but a 404, that not intended. From the access log: coyote.coyote.den:80 209.197.24.34 - - [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4 HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko" that directory exists, shouldn't that have been a 200? Cheers, Gene Heskett -- "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) If we desire respect for the law, we must first make the law respectable. - Louis D. Brandeis Genes Web page <http://geneslinuxbox.net:6309/gene>