Re: How to let crawlers in, but prevent their damage?

Chris Hostetter Mon, 10 Jan 2011 11:21:32 -0800

: What I mean is that when you have publicly exposed search that bots crawl, 
they 
: issue all kinds of crazy "queries" that result in errors, that add noise to 
Solr 
: caches, increase Solr cache evictions, etc. etc.


I teld with this type of thing a few years back by having my front end app 
executing queries to different solr tiers based on the User-Agent. Typical 
users to the main tier, known bots of partners to their own alt tier, 
known bots of public crawlers to a third alt tier.

in some cases these alternate tier had the same configs as my normal 
search tier, but by being distinct, the unusual and eratic query volume 
and number of unique queries didn't screw up the cache rates or user stats 
generated by log parsing that i would use on my regular search tier.  In 
other cases the tiers had slightly differnet configs, ie: the bots of my 
known parterns ran twice a day at predictible times, didn't do any 
faceting, and used a very predictible set of filters -- so i did 
snappulling only twice a day, and force warmed those filters.

i advocate this kind of distinct search tiers per "user base" even for 
human users -- assusming your volumne is high enough and you have the 
budget for the hardware -- users who do similar queries on a certain 
subset of documents (with tons of faceting on a certain subset fields) 
should all use the same set of query servers -- but if a differnt group of 
users tend to issue differnt types of queries (and facet on different 
fields) and you know this in advance -- you might as well have that second 
group of people query differnet boxes.

it's esentailly "session affinity" except it's not about sessions -- it's 
about expected behavior based on what you know about the user 


-Hoss

Re: How to let crawlers in, but prevent their damage?

Reply via email to