[cctalk] Re: Large language model (LLM) Web Scrapers

Paul Koning via cctalk Wed, 17 Sep 2025 06:47:03 -0700

A web crawler that does not obey robots.txt is not a law abiding outfit.  Best 
would be to block it entirely.  If they are that dismissive of honesty, they 
are also unlikely to pay attention to such matters as copyright and 
intellectual property ownership.


        paul

> On Sep 16, 2025, at 8:55 PM, Wayne S via cctalk <[email protected]> wrote:
> 
> They do not observe robots .txt
> Sent from my iPhone
> 
>> On Sep 16, 2025, at 17:53, Wayne S <[email protected]> wrote:
>> 
>> I did notice the scraping.
>> I toyed with the idea of putting ludicrous text files up that a normal user 
>> would not see and see which bot got them.
>> 
>> Sent from my iPhone
>> 
>>> On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <[email protected]> 
>>> wrote:
>>> 
>>> For those of you who run vintage computing-related info sites, have you
>>> noticed all of the LLM scraper activity?    AI services are using the LLM
>>> scrapers to populate their knowledge bases.
>>> 
>>> At any given moment 5-10 of them are active on vintagecomputer.net.  It’s
>>> funny, when I ask an AI about something vintage computing-related,
>>> something obscure, I can trick into giving me an answer from my own site.
>>> 
>>> I have actually had to modify the site code to manage the traffic, to
>>> improve efficiency.
>>> 
>>> But they’re not going after just my site, these scrapers are absorbing
>>> copies of the entire WWW.
>>> 
>>> I wonder how long the WWW will remain open, it would be a bummer if I found
>>> copies of my site elsewhere.
>>> 
>>> Bill

[cctalk] Re: Large language model (LLM) Web Scrapers

Reply via email to