I get what you are trying to do.... yes, googlebot essentially fills
up the cache with edge cases.
There is nothing in solr to prevent using the cache for some queries
and not others -- given the way parts of solr works, it is a bad idea
to turn off caching completly (a Document my be retrieved a few times
within a single request)
One idea (i don't know if it is a good one) -- If you are in an a load
balanced environment, you could send all the bot based requests to a
single machine or set of machines while normal requests use the whole
cluster. This would keep most of the machines with common 'user'
requests.
ryan
On Sep 1, 2008, at 4:14 PM, Tobias Hill wrote:
Maybe I was a bit unclear, let me try with other words.
I didn't have the statistic-page in mind. All I care about is that I
don't
want a massive amount of bot-generated queries affect the internal
statistics of the caches in Solr. If caching would be possible to
switch
off for bot-queries the cache would reflect the human search pattern
much better. This in turn increases the cache hit-rate enormously
for the clients that we do care most about (i.e. humans).
Think about it: Say that you have 10-20 queries per second coming from
bots exploring the corners of your data (because that is what they
do best)
...
wouldn't you consider it a problem that this result (which is highly
unlikely
to get another hit during it's lifetime) gets cached pushing out other
(possibly
human-generated) items from the cache in a LRU-fashion?
Most other cache solutions I've worked with offer ways to handle
things like
this by providing silent ways (statistically-wise) to get the data
from the
cache.
For instance, we are using EHCache for another part of our
application like
this:
Result result =
search.isCacheUpdateAllowed() ? cache.get(search) :
cache.*getQuietly*
(search);
Equally, we never put any results emanating from a bot into that
EHCache.
And when we did the hit-rate on the cache was much worse than it is
today.
So my query remains: Is there an easy way to instruct solar to
handle my
request
*quietly* cache-statistically-wise(*)?
Best regards,
Tobias
(*) i.e. instruct solar to:
a1) serve result from the cache if possible
a2) ... and if so never update statistics of the cache for
this
"get".
- or -
b1) serve the results from the index
a2) ... and if so never put that result in the cache.
2008/9/1 Shalin Shekhar Mangar <[EMAIL PROTECTED]>
If you are serving cached queries to the bot, what would be the
benefit of
suppressing those queries from figuring into the cache statistics
page?
On Mon, Sep 1, 2008 at 2:46 PM, Tobias Hill <[EMAIL PROTECTED]>
wrote:
Hi all,
Is there any way to suppress that a certain query gets added to the
caches (or is allowed to affect cache statistics) in Solr?
*Reason:* We have a very search oriented website. The SEO-aspects
of the site is also important why almost the entire search-space is
traversable for indexing bots (googlebot for instance). These bots
are a substantial part of the traffic on the site*. Needless to
say, the
usage pattern of a bot is very different from a human being ... and
in short the bots are filling the caches with "corner-data" from the
search-space. As a consequence human initiated searches suffer
a lot and are far from *as cached as they could be*.
I have no problem with serving a bot a cached page, the only problem
is that the bots are allowed to be part of the cache-statistics.
Is there any way to easily suppress this?
Best regards,
Tobias
*) Actually this is not rare, see "Release It!: Design and Deploy
Production-Ready Software"-book for more details on this reality.
--
Regards,
Shalin Shekhar Mangar.