RE: Re: Paging and cursorMark

Vanlerberghe, Luc Wed, 23 Mar 2016 05:26:59 -0700

I worked on something similar a couple of years ago, but didn’t continue work 
on it in the end.


I've included the text of my original mail.
If you're interested, I could try to find the sources I was working on at the 
time

Luc

In Solr 4.7 an exciting new feature was added that allows one to page through a 
complete result set without having to worry about missing or double results at 
page boundaries while keeping resource utilization low.

I have a common use case that has similar performance and consistency problems 
that could be solved by extending the way CursorMarks work:

A. The user executes a search and obtains thousands of results of which he sees 
the first 'page'.
   Apart from scrolling through the list he also has a scrollbar (or paging 
controls) to jump to anywhere in the list.
B. The user uses the scrollbar to jump to an arbitrary place in the list.
C. The user scrolls down a bit (but past the current 'page') to find what he's 
looking for.
D. The user realizes he's too far down and scrolls up a bit again (but before 
the current 'page' again...)

(Yes, I know that users should be educated to refine their search, but 
unfortunately, if the client for which the application is developed specifies 
that it should be possible to use it this way...)

For the moment this is implemented by using the start/rows parameters to get 
the appropriate ‘page’ and this has the disadvantages that cursorMark solves:
- Solr (actually I use Lucene directly, but that doesn’t matter here) needs to 
store *all* documents up to document (start+rows) to be able to returns just 
the rows requested. Except for step A (where start==0), this may be a huge 
performance hit.
- If the index is modified concurrently (especially when using NRT), jumping to 
the next/previous page can cause documents being repeated or skipped at page 
boundaries (as explained in 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results)

Here's the way an extension to the cursorMark system could solve the problem:
A. Solr/Lucene executes the search and returns the total number of hits and the 
requested number of top documents.
   start=0, rows=n, cursorMark=*
B. start=x, rows=n, cursorMark=*: Here Solr should allow combining both 
start!=0 and cursorMark=*. It should execute a normal request using start=x and 
rows=n and add two cursorMarks : on corresponding to the sort values of the 
first document and one corresponding to the sort values of the last document
C. Use cursorMark to get the 'next' pages: This is the same way cursorMark 
works for the moment:  the user passes the cursorMark corresponding to the sort 
values of the last document.
D. Use the cursorMark corresponding to the sort values of the first document to 
get the 'previous' pages.
a
In terms of implementing these changes, I've been looking at the source code 
and already did the easy ones :)
- If a cursorMark is passed (either cursorMark=* or a 'real' value), Solr 
should return two cursorMarks in the result: nextCursorMark as before and 
prevCursorMark corresponding to the sort values of the first document. Done.
- start!=0 and cursorMark=* should no longer be mutually exclusive (but 
start!=0 and cursorMark!=* should). Done.
- When returning a result using a cursorMark, the start value returned should 
correspond to the actual position of the first document in the full result set. 
 For the next page, this equals to the number of documents skipped during 
processing, but unfortunately I didn't see a way (yet) to pass that information 
along everywhere.  This start value, together with the (possibly changed) 
numFound value can be used in the GUI to adjust the position of the scrollbar 
or the paging controls accordingly without having to estimate it.
- Implementing reverse paging could actually be easier than it sounds by 
internally reversing the sort order (really reversing, not just reversing 
ASC/DESC!) using the cursor as in the normal case and afterwards reversing the 
obtained list of documents.  I've updated PagingFieldCollector in 
TopFieldCollector.java by negating the values in reverseMul and overriding 
topDocs(start, howMany), but have to check everywhere partial results are 
merged as well...
- Implement a corresponding amount of test cases for the paging up case as that 
exist for the paging down case (help! :)

While working on the code, I thought of another use case as well: refreshing 
the current page:
Instead of passing the same start value again, the prevCursorMark could be 
passed, but with a hint that the document on or after this cursorMark should be 
returned.

Which brings me to the question of how to specify the new behavior to Solr 
without affecting the current behavior.

I propose that prevCursorMark and nextCursorMark simply encode the sort values 
for the first and last document (as nextCursorMark does now) and that a simple 
prefix is used when cursorMark should be used differently:
">": documents after the cursor position: use with nextCursorMark to get the 
next page of results
">=": documents after or on the cursor position: use with prevCursorMark to 
refresh the same page keeping the same sort position for the first document
"<": documents before the cursor position: use with prevCursorMark to get the 
previous page of results
"<=": documents before or on the cursor position: use with nextCursorMark to 
get the same page keeping the same sort position for the last document (for 
completeness, useful?)

So if prevCursorMark was "ABC" and nextCursorMark was "DEF",
- "<ABC" would return the previous page
- ">DEF" or "DEF" would return the next page
- ">=ABC" would return the same page (but with 'fresh' values/documents), 
keeping 'visual' position the same

I'd appreciate any comments on this or if anyone else has already started work 
on similar changes.
In the meantime I'll continue working on what I have and check how I can make 
my changes available (through a patch attached to a new issue in Jira?)

Luc Vanlerberghe





-----Original Message-----
From: Steve Rowe [mailto:sar...@gmail.com] 
Sent: dinsdag 22 maart 2016 16:37
To: solr-user@lucene.apache.org
Subject: [Possibly spoofed] Re: Paging and cursorMark

Hi Tom,

There is an outstanding JIRA issue to directly support what you want (with a 
patch even!) but no work on it recently: 
<https://issues.apache.org/jira/browse/SOLR-6635>.  If you’re so inclined, 
please pitch in: bring the patch up-to-date, test it, contribute improvements, 
etc.

--
Steve
www.lucidworks.com

> On Mar 22, 2016, at 10:27 AM, Tom Evans <tevans...@googlemail.com> wrote:
> 
> Hi all
> 
> With Solr 5.5.0, we're trying to improve our paging performance. When
> we are delivering results using infinite scrolling, cursorMark is
> perfectly fine - one page is followed by the next. However, we also
> offer traditional paging of results, and this is where it gets a
> little tricky.
> 
> Say we have 10 results per page, and a user wants to jump from page 1
> to page 20, and then wants to view page 21, there doesn't seem to be a
> simple way to get the nextCursorMark. We can make an inefficient
> request for page 20 (start=190, rows=10), but we cannot give that
> request a cursorMark=* as it contains start=190.
> 
> Consequently, if the user clicks to page 21, we have to continue along
> using start=200, as we have no cursorMark. The only way I can see to
> get a cursorMark at that point is to omit the start=200, and instead
> say rows=210, and ignore the first 200 results on the client side.
> Obviously, this gets more and more inefficient the deeper we page - I
> know that internally to Solr, using start=200&rows=10 has to do the
> same work as rows=210, but less data is sent over the wire to the
> client.
> 
> As I understand it, the cursorMark is a hash of the sort values of the
> last document returned, so I don't really see why it is forbidden to
> specify start=190&rows=10&cursorMark=* - why is it not possible to
> calculate the nextCursorMark from the last document returned?
> 
> I was also thinking a possible temporary workaround would be to
> request start=190&rows=10, note the last document returned, and then
> make a subsequent query for q=id:"<last doc id>"&rows=1&cursorMark=*.
> This seems to work, but means an extra Solr query for no real reason.
> Is there any other problem to doing this?
> 
> Is there some other simple trick I am missing that we can use to get
> both the page of results we want and a nextCursorMark for the
> subsequent page?
> 
> Cheers
> 
> Tom

RE: Re: Paging and cursorMark

Reply via email to