On 3/13/2018 10:26 AM, Sebastian Riemer wrote:
> However, now we want to introduce a similar navigation in our detail views, 
> where only ever one document is displayed. Again, the navigation bar looks 
> like this:
>
> << First   < Prev   1 - 15 of 62181   Next 
> ><http://test.litterare.local:3100/littera/libraries/2/cat/man?locale=en>   
> Last 
> >><http://test.litterare.local:3100/littera/libraries/2/cat/man?locale=en>
>
> But now, Prev / Next shall open up the previous / next _document_ instead of 
> the next page. The same goes for First and Last, it shall open the first / 
> last _document_ not the page.
>
> Our first approach to this was to simply add the param "fl=id" so we only get 
> the IDs of documents and set page size to ALL (i.e. no restriction on param 
> "rows"). That way, it was easy to extract the current document id from the 
> result list, and check which id was preceding and succeeding the current id, 
> as well as getting the very first id and the very last id, in order to render 
> the navigation bar.
>
> This lead to solr being heavily under load since it must load 62181 documents 
> (in this example) in order to return the ids. I somehow thought this would be 
> easy for solr to do, but it isn't.

This will indeed be very slow.  And you only have 62181 documents in
your result set, which is pretty easy for Solr to handle.  For a search
that has 100 million results, this approach is *impossible*.  I do have
searches like this on my index, and my index is not all that big
compared to some of the indexes that the community has built.

> Our second approach was, to simply keep the same value for params "start" and 
> "rows" since the user is always selecting a document from the list - thus the 
> selected document already is within the page. However, the edge cases are, 
> the selected document is the very first on the page or the very last one, 
> thus the previous or next document id is not within the page result from solr 
> -> I guess this we could handle by simply checking and sending a second query 
> where the param "start" would be adjusted accordingly.

Detail pages often include information that you do not want to store in
Solr.  A well-tuned Solr install will have responses that contain
everything that the application needs to build a search result grid, but
for really detailed information, the application should probably be
using the id information received from Solr to go to the main data
repository and retrieve full details.

Additionally, you should not allow the user to navigate to the last page
or to navigate to the last document, or even a page/document anywhere
near the end of the resultset.  The reason for this is that really high
start values are a serious performance killer.  61K is definitely a
start value high enough to see performance drops.  If the user tries to
page too deeply into results, your application should simply refuse to
go any further.  For comparison purposes -- the last time I checked how
deeply Google would let me go into a search result, I could get to page
39, but no further.  The number of results for my search was MILLIONS,
but Google wouldn't let me view them all.  The performance issues for
deep paging are universal for search engines, especially when it is
possible to jump to an arbitrary page number.

I recommend limiting how many results a user can page through to about
5000 or 10000.  If there are 50 results per page, this allows them to
get to at least page 99.  In general, most users of search engines will
never go deeper than about page 3.  There are some kinds of applications
where a typical user might visit the first few dozen pages ... but
anything deeper is NOT common.  If you have an atypical user, they are
probably prepared for large page numbers to take a lot longer to load. 
The main reason you should be limiting how deep users can go is that
when one user is going thousands of documents into a result set,
performance of the other queries on the system CAN drop dramatically.

> However I would not know how to retrieve the id of the very first document 
> and the very last document (except for executing separate queries with I 
> guess start=0, rows=1 and start=62181 and rows=1)

When you display a page of results, your application already has N
document IDs received from Solr to display a page of results.  Using
that information, you can navigate through the documents one at a time. 
Then if you reach the end of what you have on that page, you can issue
another query for the next page or the previous page.  If you are
restricting how deep a user can go, the performance of this approach
should be pretty good.

> For any query and a documentId (of which it is known it is within the query 
> result), what is a simple and efficient enough way, to get the following 
> navigational information:
>
> -          Previous document Id
>
> -          Next document id
>
> -          First document id
>
> -          Last document id

Having this information available is nearly impossible.

The values for each document will depend on the sort you use.  Change
the sort, and all the values will be wrong.  And if you delete documents
or add documents, those values will likely change, and the values for an
individual document could change several times per second.  Solr cannot
automatically provide this information, and it is pretty much impossible
to have accurate and up to date information if you calculate it at index
time and add it yourself.

Side note:  When sorting by relevance score, which is the default sort
order, changing the query also changes the sort.

----

Note that there *is* a Solr solution for the performance problems of
deep paging ... but cursorMark (the name of the feature) does not
support jumping directly to an arbitrary page number.  If you want page
25000 when using cursorMark, you have to retrieve the first 24999 pages
before you will have the cursor value for page 25000.  But once you HAVE
that value, retrieving page 25000 will be just as fast as page 1, which
is definitely not the case when using start/rows to get pages.

https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Newer versions of Solr also have things like the export handler and
streaming expressions, which are designed to provide REALLY large result
sets without putting major load on the server.  Very large result sets
do still take a lot of TIME, so they're only usable for offline
activities like research and data mining, not live usage in an
application.  But they won't kill the server when they are used.  I do
not know how to use these features, but information is available in the
Solr Reference Guide.

Thanks,
Shawn

Reply via email to