: retrieving 1 million docids from solr through paging is resulting in deep
: pagin issues..so i wonder if i can use filter queries to fetch all the 1
: mllion docids chunk by chunk .. so for me the best filter wiould score... if
: i can find the maximum score i can filter out other docs ..
: 
: what is the minimum value of solr score? i don think it will have negative
: values.. so if its always above 0.. my first chunk wud be score [0 TO *]&
: rows =10000 my next chunk will start from the max score from first chunk to
: * with rows =10000 .. this will ensure that while fetching the 1000th chunk
: solr don have to get all the previous doc ids into memory ..

a) given an arbitrary query, there is no min/max score (think about 
function queries, you could write a math based query that results in 
-100000 being the highest score)

b) you could use an frange query on score to partition your docs like 
this.  you'd need to start with an unfiltered query, record the docid and 
score for all of "page #1" and then use the score of the last docid on 
page #1 as the min for your filter when asking for "page #2" (still with 
start=0 though) .. but you'd have to manually ignore any docs you'd 
already seen because of duplicate scores.

I'm not sure if this would really gain you much though -- yes this would 
work arround some of the memory issues inherient in "deep paging" but it 
would still require a lot or rescoring of documents again and again.

If that type of appraoch works for you, then you'd probably be better off 
using your own ID field as the sort/filter instead of score (since there 
would be no duplicates)

Based on your problem description though, it sounds like you don't 
actaully care about the scores -- and i don't see anything in your writup 
that suggests that the order actually matters to you -- you just want them 
"all" ... correct?

in that case, have you considered jsut using "sort=_docid_ asc" ?

that gives you the internal lucene doc id "sorting" which actually means 
no sorting work is needed, which i *think* means there is no in memory 
buffering needed for the deep paging situation.


-Hoss

Reply via email to