All, thanks for good feedback. Letting the load-balancer route bots to a specific slaves and humans to others seems like the way forward this time.
Thanks, Tobias 2008/9/1 Walter Underwood <[EMAIL PROTECTED]> > How many documents do you have in your index? How many unique > queries per day, bot and human? What are your cache hit ratios? > > Maybe you can increase the size of the caches and not worry about > it. Search engine position is important. Have marketing pay for > the extra memory (I'm not kidding). > > Sending all the bot queries to a separate machine is also > a reasonable approach. Heck, bill that machine to marketing! > > wunder > > On 9/1/08 7:34 AM, "Shalin Shekhar Mangar" <[EMAIL PROTECTED]> wrote: > > > Apart from hacking the internals, there's nothing inside Solr which will > let > > you do that. EHCache is for application layer caches, Solr is an external > > server so it can't know about your application. I think that over a > period > > of time, the caches will be back to normal (through user-generated > requests) > > and it shouldn't be a big problem. > > > > How slow are your user queries becoming? Will it help if you limit all > bot > > queries to certain fixed number of Solr instances? > > > > On Mon, Sep 1, 2008 at 7:44 PM, Tobias Hill <[EMAIL PROTECTED]> > wrote: > > > >> Maybe I was a bit unclear, let me try with other words. > >> > >> I didn't have the statistic-page in mind. All I care about is that I > don't > >> want a massive amount of bot-generated queries affect the internal > >> statistics of the caches in Solr. If caching would be possible to switch > >> off for bot-queries the cache would reflect the human search pattern > >> much better. This in turn increases the cache hit-rate enormously > >> for the clients that we do care most about (i.e. humans). > >> > >> Think about it: Say that you have 10-20 queries per second coming from > >> bots exploring the corners of your data (because that is what they do > best) > >> ... > >> wouldn't you consider it a problem that this result (which is highly > >> unlikely > >> to get another hit during it's lifetime) gets cached pushing out other > >> (possibly > >> human-generated) items from the cache in a LRU-fashion? > >> > >> Most other cache solutions I've worked with offer ways to handle things > >> like > >> > >> this by providing silent ways (statistically-wise) to get the data from > the > >> cache. > >> > >> For instance, we are using EHCache for another part of our application > like > >> this: > >> > >> Result result = > >> search.isCacheUpdateAllowed() ? cache.get(search) : > cache.*getQuietly* > >> (search); > >> > >> Equally, we never put any results emanating from a bot into that > EHCache. > >> And when we did the hit-rate on the cache was much worse than it is > today. > >> > >> > >> So my query remains: Is there an easy way to instruct solar to handle my > >> request > >> *quietly* cache-statistically-wise(*)? > >> > >> Best regards, > >> Tobias > >> > >> > >> (*) i.e. instruct solar to: > >> a1) serve result from the cache if possible > >> a2) ... and if so never update statistics of the cache for this > >> "get". > >> > >> - or - > >> > >> b1) serve the results from the index > >> a2) ... and if so never put that result in the cache. > >> > >> > >> > >> > >> > >> > >> 2008/9/1 Shalin Shekhar Mangar <[EMAIL PROTECTED]> > >> > >>> If you are serving cached queries to the bot, what would be the benefit > >> of > >>> suppressing those queries from figuring into the cache statistics page? > >>> > >>> On Mon, Sep 1, 2008 at 2:46 PM, Tobias Hill <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> Hi all, > >>>> > >>>> Is there any way to suppress that a certain query gets added to the > >>>> caches (or is allowed to affect cache statistics) in Solr? > >>>> > >>>> *Reason:* We have a very search oriented website. The SEO-aspects > >>>> of the site is also important why almost the entire search-space is > >>>> traversable for indexing bots (googlebot for instance). These bots > >>>> are a substantial part of the traffic on the site*. Needless to say, > >> the > >>>> usage pattern of a bot is very different from a human being ... and > >>>> in short the bots are filling the caches with "corner-data" from the > >>>> search-space. As a consequence human initiated searches suffer > >>>> a lot and are far from *as cached as they could be*. > >>>> > >>>> I have no problem with serving a bot a cached page, the only problem > >>>> is that the bots are allowed to be part of the cache-statistics. > >>>> > >>>> Is there any way to easily suppress this? > >>>> > >>>> Best regards, > >>>> Tobias > >>>> > >>>> > >>>> *) Actually this is not rare, see "Release It!: Design and Deploy > >>>> Production-Ready Software"-book for more details on this reality. > >>>> > >>> > >>> > >>> > >>> -- > >>> Regards, > >>> Shalin Shekhar Mangar. > >>> > >> > > > > > >