Re: Deep paging in parallel with solr cloud - OutOfMemory

Greg Pendlebury Mon, 17 Mar 2014 16:15:13 -0700

My suspicion is that it won't work in parallel, but we've only just asked
the ops team to start our upgrade to look into it, so I don't have a server
yet to test. The bug identified in SOLR-5875 has put them off though :(


If things pan out as I think they will I suspect we are going to end up
with two implementations here. One for our GUI applications that uses
traditional paging and is capped at some arbitrarily low limit (say 1000
records like Google does). And another for our API users that harvest full
datasets, which will use cursor marks and support only serial harvests that
cannot skip content.

Ta,
Greg



On 18 March 2014 09:44, Mike Hugo <m...@piragua.com> wrote:

> Cursor mark definitely seems like the way to go.  If I can get it to work
> in parallel then that's additional bonus
>
>
> On Mon, Mar 17, 2014 at 5:41 PM, Greg Pendlebury
> <greg.pendleb...@gmail.com>wrote:
>
> > Shouldn't all deep pagination against a cluster use the new cursor mark
> > feature instead of 'start' and 'rows'?
> >
> > 4 or 5 requests still seems a very low limit to be running into an OOM
> > issues though, so perhaps it is both issues combined?
> >
> > Ta,
> > Greg
> >
> >
> >
> > On 18 March 2014 07:49, Mike Hugo <m...@piragua.com> wrote:
> >
> > > Thanks!
> > >
> > >
> > > On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe <sar...@gmail.com> wrote:
> > >
> > > > Mike,
> > > >
> > > > Days.  I plan on making a 4.7.1 release candidate a week from today,
> > and
> > > > assuming nobody finds any problems with the RC, it will be released
> > > roughly
> > > > four days thereafter (three days for voting + one day for release
> > > > propogation to the Apache mirrors): i.e., next Friday-ish.
> > > >
> > > > Steve
> > > >
> > > > On Mar 17, 2014, at 4:40 PM, Mike Hugo <m...@piragua.com> wrote:
> > > >
> > > > > Thanks Steve,
> > > > >
> > > > > That certainly looks like it could be the culprit.  Any word on a
> > > release
> > > > > date for 4.7.1?  Days?  Weeks?  Months?
> > > > >
> > > > > Mike
> > > > >
> > > > >
> > > > > On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe <sar...@gmail.com>
> > wrote:
> > > > >
> > > > >> Hi Mike,
> > > > >>
> > > > >> The OOM you're seeing is likely a result of the bug described in
> > (and
> > > > >> fixed by a commit under) SOLR-5875: <
> > > > >> https://issues.apache.org/jira/browse/SOLR-5875>.
> > > > >>
> > > > >> If you can build from source, it would be great if you could
> confirm
> > > the
> > > > >> fix addresses the issue you're facing.
> > > > >>
> > > > >> This fix will be part of a to-be-released Solr 4.7.1.
> > > > >>
> > > > >> Steve
> > > > >>
> > > > >> On Mar 17, 2014, at 4:14 PM, Mike Hugo <m...@piragua.com> wrote:
> > > > >>
> > > > >>> Hello,
> > > > >>>
> > > > >>> We recently upgraded to Solr Cloud 4.7 (went from a single node
> > Solr
> > > > 4.0
> > > > >>> instance to 3 node Solr 4.7 cluster).
> > > > >>>
> > > > >>> Part of out application does an automated traversal of all
> > documents
> > > > that
> > > > >>> match a specific query.  It does this by iterating through
> results
> > by
> > > > >>> setting the start and rows parameters, starting with start=0 and
> > > > >> rows=1000,
> > > > >>> then start=1000, rows=1000, start = 2000, rows=1000, etc etc.
> > > > >>>
> > > > >>> We do this in parallel fashion with multiple workers on multiple
> > > nodes.
> > > > >>> It's easy to chunk up the work to be done by figuring out how
> many
> > > > total
> > > > >>> results there are and then creating 'chunks' (0-1000, 1000-2000,
> > > > >> 2000-3000)
> > > > >>> and sending each chunk to a worker in a pool of multi-threaded
> > > workers.
> > > > >>>
> > > > >>> This worked well for us with a single server.  However upon
> > upgrading
> > > > to
> > > > >>> solr cloud, we've found that this quickly (within the first 4 or
> 5
> > > > >>> requests) causes an OutOfMemory error on the coordinating node
> that
> > > > >>> receives the query.   I don't fully understand what's going on
> > here,
> > > > but
> > > > >> it
> > > > >>> looks like the coordinating node receives the query and sends it
> to
> > > the
> > > > >>> shard requested.  For example, given:
> > > > >>>
> > > > >>> shards=shard3&sort=id+asc&start=4000&q=*:*&rows=1000
> > > > >>>
> > > > >>> The coordinating node sends this query to shard3:
> > > > >>>
> > > > >>> NOW=1395086719189&shard.url=
> > > > >>>
> > > > >>
> > > >
> > >
> >
> http://shard3_url_goes_here:8080/solr/collection1/&fl=id&sort=id+asc&start=0&q=*:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=5000
> > > > >>>
> > > > >>> Notice the rows parameter is 5000 (start + rows).  If the
> > coordinator
> > > > >> node
> > > > >>> is able to process the result set (which works for the first few
> > > pages,
> > > > >>> after that it will quickly run out of memory), it eventually
> issues
> > > > this
> > > > >>> request back to shard3:
> > > > >>>
> > > > >>> NOW=1395086719189&shard.url=
> > > > >>>
> > > > >>
> > > >
> > >
> >
> http://10.128.215.226:8080/extera-search/gemindex/&start=4000&ids=a..bunch...(1000)..of..doc..ids..go..here&q=*:*&distrib=false&wt=javabin&isShard=true&version=2&rows=1000
> > > > >>>
> > > > >>> and then finally returns the response to the client.
> > > > >>>
> > > > >>> One possible workaround:  We've found that if we issue
> > > non-distributed
> > > > >>> requests to specific shards, that we get performance along the
> same
> > > > lines
> > > > >>> that we did before.  E.g. issue a query with
> > > > shards=shard3&distrib=false
> > > > >>> directly to the url of the shard3 instance, rather than going
> > through
> > > > the
> > > > >>> cloud solr server solrj API.
> > > > >>>
> > > > >>> The other workaround is to adapt to use the new new cursorMark
> > > > >>> functionality.  I've manually tried a few requests and it is
> pretty
> > > > >>> efficient, and doesn't result in the OOM errors on the
> coordinating
> > > > node.
> > > > >>> However, i've only done this in single threaded manner.  I'm
> > > wondering
> > > > if
> > > > >>> there would be a way to get cursor marks for an entire result set
> > at
> > > a
> > > > >>> given page interval, so that they could then be fed to the pool
> of
> > > > >> parallel
> > > > >>> workers to get the results in parallel rather than single
> threaded.
> > >  Is
> > > > >>> there a way to do this so we could process the results in
> parallel?
> > > > >>>
> > > > >>> Any other possible solutions?  Thanks in advance.
> > > > >>>
> > > > >>> Mike
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Deep paging in parallel with solr cloud - OutOfMemory

Reply via email to