I get what you are trying to do.... yes, googlebot essentially fills up the cache with edge cases.

There is nothing in solr to prevent using the cache for some queries and not others -- given the way parts of solr works, it is a bad idea to turn off caching completly (a Document my be retrieved a few times within a single request)

One idea (i don't know if it is a good one) -- If you are in an a load balanced environment, you could send all the bot based requests to a single machine or set of machines while normal requests use the whole cluster. This would keep most of the machines with common 'user' requests.

ryan



On Sep 1, 2008, at 4:14 PM, Tobias Hill wrote:

Maybe I was a bit unclear, let me try with other words.

I didn't have the statistic-page in mind. All I care about is that I don't
want a massive amount of bot-generated queries affect the internal
statistics of the caches in Solr. If caching would be possible to switch
off for bot-queries the cache would reflect the human search pattern
much better. This in turn increases the cache hit-rate enormously
for the clients that we do care most about (i.e. humans).

Think about it: Say that you have 10-20 queries per second coming from
bots exploring the corners of your data (because that is what they do best)
...
wouldn't you consider it a problem that this result (which is highly
unlikely
to get another hit during it's lifetime) gets cached pushing out other
(possibly
human-generated) items from the cache in a LRU-fashion?

Most other cache solutions I've worked with offer ways to handle things like

this by providing silent ways (statistically-wise) to get the data from the
cache.

For instance, we are using EHCache for another part of our application like
this:

 Result result =
search.isCacheUpdateAllowed() ? cache.get(search) : cache.*getQuietly*
(search);

Equally, we never put any results emanating from a bot into that EHCache. And when we did the hit-rate on the cache was much worse than it is today.


So my query remains: Is there an easy way to instruct solar to handle my
request
*quietly* cache-statistically-wise(*)?

Best regards,
Tobias


(*) i.e. instruct solar to:
     a1) serve result from the cache if possible
a2) ... and if so never update statistics of the cache for this
"get".

      - or -

     b1) serve the results from the index
         a2) ... and if so never put that result in the cache.






2008/9/1 Shalin Shekhar Mangar <[EMAIL PROTECTED]>

If you are serving cached queries to the bot, what would be the benefit of suppressing those queries from figuring into the cache statistics page?

On Mon, Sep 1, 2008 at 2:46 PM, Tobias Hill <[EMAIL PROTECTED]> wrote:

Hi all,

Is there any way to suppress that a certain query gets added to the
caches (or is allowed to affect cache statistics) in Solr?

*Reason:* We have a very search oriented website. The SEO-aspects
of the site is also important why almost the entire search-space is
traversable for indexing bots (googlebot for instance). These bots
are a substantial part of the traffic on the site*. Needless to say, the
usage pattern of a bot is very different from a human being ... and
in short the bots are filling the caches with "corner-data" from the
search-space. As a consequence human initiated searches suffer
a lot and are far from *as cached as they could be*.

I have no problem with serving a bot a cached page, the only problem
is that the bots are allowed to be part of the cache-statistics.

Is there any way to easily suppress this?

Best regards,
Tobias


*) Actually this is not rare, see "Release It!: Design and Deploy
 Production-Ready Software"-book for more details on this reality.




--
Regards,
Shalin Shekhar Mangar.


Reply via email to