Re: scraperbot protection - Patchwork and Bunsen behind Anubis

2025-04-22 Thread Guinevere Larsen

On 4/22/25 10:06 AM, Jonathan Wakely wrote:

On Tue, 22 Apr 2025 at 13:36, Guinevere Larsen via Gcc  wrote:

On 4/21/25 12:59 PM, Mark Wielaard wrote:

Hi hackers,

TLDR; When using https://patchwork.sourceware.org or Bunsen
https://builder.sourceware.org/testruns/ you might now have to enable
javascript. This should not impact any scripts, just browsers (or bots
pretending to be browsers). If it does cause trouble, please let us
know. If this works out we might also "protect" bugzilla, gitweb,
cgit, and the wikis this way.

We don't like to hav to do this, but as some of you might have noticed
Sourceware has been fighting the new AI scraperbots since start of the
year. We are not alone in this.

https://lwn.net/Articles/1008897/
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

We have tried to isolate services more and block various ip-blocks
that were abusing the servers. But that has helped only so much.
Unfortunately the scraper bots are using lots of ip addresses
(probably by installing "free" VPN services that use normal user
connections as exit point) and pretending to be common
browsers/agents.  We seem to have to make access to some services
depend on solving a javascript challenge.

Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI
scrapers might be doing this:
https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the
last post in the thread because it was hard to actually follow the
thread given the number of replies, please go all the way up and read
all 8 posts).

Essentially, there's a library developer that pays developers to just
"include this library and a few more lines in your TOS". This library
then allows the app to sell the end-user's bandwidth to clients of the
library developer, allowing them to make requests. This is how big
companies are managing to have so many IP addresses, so many of those
being residential IP addresses, and it also means that by blocking those
IP addresses we will be - necessarily - blocking real user traffic to
our platforms.

It seems to me that blocking real users *who are running these shady
apps* is perfectly reasonable.

They might not realise it, but those users are part of the problem. If
we block them, maybe they'll be incentivised to stop using the shady
apps. And if users stop using those apps, maybe those app developers
will stop bundling the libraries that piggyback on users' bandwidth.


If an IP mapped perfectly to one user, maybe. But I can't control what 
other users of the same ISP in the same area as me are doing, and we're 
sharing an IP. And worse, if I still lived with my family, no way would 
I be able to veto what my parents are using their phone for, so because 
they have a shady app I wouldn't be able to access systems? that doesn't 
seem fair at all


Not to mention the fact that "read and understand the entirety of the 
TOS of every single app" assumes a pretty decent amount of free time for 
users that they may not have, and we wouldn't want to make open source 
even more hostile for people who are overwhelmed or overworked already. 
Of course people should, but having that as a requirement excludes 
people like... well, myself to be quite honest.





I'm happy to see that the sourceware is moving to a more comprehensive
solution, and if this is successful, I'd suggest that we also try to do
that to the forgejo instance, and remove the IPs blocked because of this
scraping.

For now, maybe. This thread already explained how to get around Anubis
by changing the UserAgent string - how long will it be until these
peer-to-business network libraries figure that out?


hopefully longer than the bubble lasts

--
Cheers,
Guinevere Larsen
She/Her/Hers



Re: scraperbot protection - Patchwork and Bunsen behind Anubis

2025-04-22 Thread Guinevere Larsen

On 4/21/25 12:59 PM, Mark Wielaard wrote:

Hi hackers,

TLDR; When using https://patchwork.sourceware.org or Bunsen
https://builder.sourceware.org/testruns/ you might now have to enable
javascript. This should not impact any scripts, just browsers (or bots
pretending to be browsers). If it does cause trouble, please let us
know. If this works out we might also "protect" bugzilla, gitweb,
cgit, and the wikis this way.

We don't like to hav to do this, but as some of you might have noticed
Sourceware has been fighting the new AI scraperbots since start of the
year. We are not alone in this.

https://lwn.net/Articles/1008897/
https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

We have tried to isolate services more and block various ip-blocks
that were abusing the servers. But that has helped only so much.
Unfortunately the scraper bots are using lots of ip addresses
(probably by installing "free" VPN services that use normal user
connections as exit point) and pretending to be common
browsers/agents.  We seem to have to make access to some services
depend on solving a javascript challenge.


Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI 
scrapers might be doing this: 
https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the 
last post in the thread because it was hard to actually follow the 
thread given the number of replies, please go all the way up and read 
all 8 posts).


Essentially, there's a library developer that pays developers to just 
"include this library and a few more lines in your TOS". This library 
then allows the app to sell the end-user's bandwidth to clients of the 
library developer, allowing them to make requests. This is how big 
companies are managing to have so many IP addresses, so many of those 
being residential IP addresses, and it also means that by blocking those 
IP addresses we will be - necessarily - blocking real user traffic to 
our platforms.


I'm happy to see that the sourceware is moving to a more comprehensive 
solution, and if this is successful, I'd suggest that we also try to do 
that to the forgejo instance, and remove the IPs blocked because of this 
scraping.




So we have installed Anubis https://anubis.techaro.lol/ in front of
patchwork and bunsen. This means that if you are using a browser that
identifies as Mozilla or Opera in their User-Agent you will get a
brief page showing the happy anime girl that requires javascript to
solve a challenge and get a cookie to get through. Scripts and search
engines should get through without. Also removing Mozilla and/or Opera
from your User-Agent will get you through without javascript.

We want to thanks Xe Iaso who has helped us set this up and worked
with use over the Easter weekend solving some of our problems/typos.
Please check out if you want to be one of their patrons as thank you.
https://xeiaso.net/notes/2025/anubis-works/
https://xeiaso.net/patrons/

Cheers,

Mark



--
Cheers,
Guinevere Larsen
She/Her/Hers