[cctalk] Re: Large language model (LLM) Web Scrapers

Bill Degnan via cctalk Wed, 17 Sep 2025 09:01:46 -0700

Paul
In a perfect world yes.  Here's a trick you guys can use... Generate a
robots.txt and add a few pages not to crawl.  Assuming bad bots will
ignore, one of the "d onot crawl" pages will have a trigger that blocks the
ip address of the session.  You would need the ability to communicate the
IP address of the offending bot to a process that does the blocking.
 There are various ways to do that.


On Wed, Sep 17, 2025 at 9:46 AM Paul Koning via cctalk <
[email protected]> wrote:

> A web crawler that does not obey robots.txt is not a law abiding outfit.
> Best would be to block it entirely.  If they are that dismissive of
> honesty, they are also unlikely to pay attention to such matters as
> copyright and intellectual property ownership.
>
>         paul
>
> > On Sep 16, 2025, at 8:55 PM, Wayne S via cctalk <[email protected]>
> wrote:
> >
> > They do not observe robots .txt
> > Sent from my iPhone
> >
> >> On Sep 16, 2025, at 17:53, Wayne S <[email protected]> wrote:
> >>
> >> I did notice the scraping.
> >> I toyed with the idea of putting ludicrous text files up that a normal
> user would not see and see which bot got them.
> >>
> >> Sent from my iPhone
> >>
> >>> On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <
> [email protected]> wrote:
> >>>
> >>> For those of you who run vintage computing-related info sites, have
> you
> >>> noticed all of the LLM scraper activity?    AI services are using the
> LLM
> >>> scrapers to populate their knowledge bases.
> >>>
> >>> At any given moment 5-10 of them are active on vintagecomputer.net.
> It’s
> >>> funny, when I ask an AI about something vintage computing-related,
> >>> something obscure, I can trick into giving me an answer from my own
> site.
> >>>
> >>> I have actually had to modify the site code to manage the traffic, to
> >>> improve efficiency.
> >>>
> >>> But they’re not going after just my site, these scrapers are absorbing
> >>> copies of the entire WWW.
> >>>
> >>> I wonder how long the WWW will remain open, it would be a bummer if I
> found
> >>> copies of my site elsewhere.
> >>>
> >>> Bill
>
>

[cctalk] Re: Large language model (LLM) Web Scrapers

Reply via email to