Re: range queries on string field with millions of values

Chris Hostetter Sat, 29 Nov 2008 15:00:00 -0800

: The results are correct.  But the response time sucks.
: 
: Reading the docs about caches, I thought I could populate the query result
: cache with an autowarming query and the response time would be okay.  But that
: hasn't worked.  (See excerpts from my solrConfig file below.)
: 
: A repeated query is very fast, implying caching happens for a particular
: starting point ("42" above).
: 
: Is there a way to populate the cache with the ENTIRE sorted list of values for
: the field, so any arbitrary starting point will get results from the cache,
: rather than grabbing all results from (x) to the end, then sorting all these
: results, then returning the first 10?


there's two "caches" that come into play for something like this...

the first cache is a low level Lucene cache called the "FieldCache" that 
is completley hidden from you (and for the most part: from Solr).  
anytime you sort on a field, it get's built, and reuse for all sorts on 
that field.  My originl concern was that it wasn't getting warmed on 
"newSearcher" (because you have to be explicit about that.

the second cache is the queryResultsCache which caches a "window" of an 
ordered list of documents based on a query, and a sort.  you can see this 
cache in your Solr stats, and yes: these two requests results in different 
cache keys for the queryResultsCache...

        q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
        q=yourField:[52+TO+*]&sort=yourField+asc&rows=10

...BUT! ... the two queries below will result in the same cache key, and 
the second will be a cache hit, provided a sufficient value for 
the "queryResultWindowSize" ...

        q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
        q=yourField:[42+TO+*]&sort=yourField+asc&rows=10&start=10

so perhaps the key to your problem is to just make sure that once the user 
gives you an id to start with, you "scroll" by increasing the start param 
(not altering the id) ... the first query might be "slow" but every query 
after that should be a cache hit (depending on your page size, and how far 
you expect people to scroll, you should consider increasing 
queryResultWindowSize)

But as Yonik said: the new TermsComponent may actually be a better option 
for you -- doing two requests for every page (the first to get the N Terms 
in your id field starting with your input, the second to do an query for 
docs matching any of those N ids) might actually be faster even though 
there won't likely even be any cache hits.


My opinion:  Your use case sounds like a waste of effort.  I can't imagine 
anyone using a library catalog system ever wanting to lookup a callnumber, 
and then scroll through all posisble books with similar call numbers -- it 
seems much more likely that i'd want to look at other books with similar 
authors, or keywords, or tags ... all things that are actaully *easier* to 
do with Solr.  (but then again: i don't work in a library.  i trust that 
you know something i don't about what your users want.)


-Hoss

Re: range queries on string field with millions of values

Reply via email to