On Fri, 2020-12-04 at 17:06 -0500, Gene Heskett wrote: > On Friday 04 December 2020 16:14:29 Tixy wrote: > > > On Fri, 2020-12-04 at 14:51 -0500, Gene Heskett wrote: > > > On Friday 04 December 2020 12:39:24 Reco wrote: > > > > Hi. > > > > > > > > On Fri, Dec 04, 2020 at 08:39:42AM -0500, Gene Heskett wrote: > > > > > But I asked specifically how to enable it for one bot, and > I've > > > > > asked that question several times, getting smoke and mirror > > > > > answers you all assume are helpfull, but which are useless to > a > > > > > new user installing the now 7 years old and long out of date > > > > > package that in effect has no "how it works" docs. I asked 3 > > > > > questions in a previous day or so timeline, and no one has > > > > > actually attempted to actually answer even one of them. Here > is > > > > > one line from that log: and that I just blocked: > > > > > > > > > > coyote.coyote.den:80 192.99.6.226 - - > > > > > [04/Dec/2020:07:18:20 -0500] "GET > > > > > /gene/toolshed/c3/build/win32/prep/?C=S;O=D HTTP/1.1" 200 673 > > > > > "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; > > > > > http://mj12bot.com/)" > > > > > > > > Taken directly from the link. > > > > > > > > Bot Type Good crawler (always identifies itself) > > > > IP Range Distributed, Worldwide > > > > Obeys Robots.txt *Yes* > > > > > > Sorry, they do not, they've read it and ignored it 428 times in > the > > > life of that log which I zeroed out around 1 July of this year. > > > > Why would they read it if they we're going to just ignore it, > perhaps > > your robots.txt is broken? Hint, it is, in 2 or 3 different ways I > can > > see (if it's http://geneslinuxbox.net:6309/robots.txt we're talking > > about). That file doesn't have any syntactically correct entry in > > there for blocking that bot. > > And what might that be like, I'll fix it right now
OK, I'll do your proofreading... At the end of the robots.txt you are missing a colon from a rule that disallows everything for all bots... User-agent * Disallow: / That should be: User-agent: * Disallow: / But if you just want to disable the bot you reckon is a problem, the front page of their site (https://mj12bot.com/) says you want: User-agent: MJ12bot Disallow: / Or you could read their page to see the robots.txt syntax for slowing down crawling, which I assume is relevant to other bots to you may have problems with. The other rules above your disallow everything (which are superfluous if you keep that final rule) also have typos, you have a '0' here... User-0agent: * Disallow: /doc/ And this rule has a space in the URL... User-agent: * Disallow: stress test I'm pretty sure URLs can't have actual space characters in them and that must be a typo on your behalf. Also something I read when looking at this issue a few hours ago (but can't find again) reckoned that Google's bot let you have multiple statements on a line separated by spaces, e.g. Disallow: foo Disallow: bar So it seems likely that having a space in the URL like you have isn't legal, and could possibly upset parsing. -- Tixy