Re: Improving performance to return 2000+ documents

Jagdish Nomula Sun, 30 Jun 2013 18:55:18 -0700

Solrconfig.xml has got entries which you can tweak for your use case. One
of them is queryresultwindowsize. You can try using the value of 2000 and
see if it helps improving performance. Please make sure you have enough
memory allocated for queryresultcache.
A combination of sharding and distribution of workload(requesting
2000/number of shards) with an aggregator would be a good way to maximize
performance.


Thanks,

Jagdish


On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> 50M documents, depending on a bunch of things,
> may not be unreasonable for a single node, only
> testing will tell.
>
> But the question I have is whether you should be
> using standard Solr queries for this or building a custom
> component that goes at the base Lucene index
> and "does the right thing". Or even re-indexing your
> entire corpus periodically to add this kind of data.
>
> FWIW,
> Erick
>
>
> On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> >wrote:
>
> > Thanks Erick/Peter.
> >
> > This is an offline process, used by a relevancy engine implemented around
> > solr. The engine computes boost scores for related keywords based on
> > clickstream data.
> > i.e.: say clickstream has: ipad=upc1,upc2,upc3
> > I query solr with keyword: "ipad" (to get 2000 documents) and then make 3
> > individual queries for upc1,upc2,upc3 (which are fast).
> > The data is then used to compute related keywords to "ipad" with their
> > boost values.
> >
> > So, I cannot really replace that, since I need full text search over my
> > dataset to retrieve top 2000 documents.
> >
> > I tried paging: I retrieve 500 solr documents 4 times (0-500,
> 500-1000...),
> > but don't see any improvements.
> >
> >
> > Some questions:
> > 1. Maybe the JVM size might help?
> > This is what I see in the dashboard:
> > Physical Memory 76.2%
> > Swap Space NaN% (don't have any swap space, running on AWS EBS)
> > File Descriptor Count 4.7%
> > JVM-Memory 73.8%
> >
> > Screenshot: http://i.imgur.com/aegKzP6.png
> >
> > 2. Will reducing the shards from 3 to 1 improve performance? (maybe
> > increase the RAM from 30 to 60GB) The problem I will face in that case
> will
> > be fitting 50M documents on 1 machine.
> >
> > Thanks,
> > -Utkarsh
> >
> >
> > On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge <peter.stu...@gmail.com
> > >wrote:
> >
> > > Hello Utkarsh,
> > > This may or may not be relevant for your use-case, but the way we deal
> > with
> > > this scenario is to retrieve the top N documents 5,10,20or100 at a time
> > > (user selectable). We can then page the results, changing the start
> > > parameter to return the next set. This allows us to 'retrieve' millions
> > of
> > > documents - we just do it at the user's leisure, rather than make them
> > wait
> > > for the whole lot in one go.
> > > This works well because users very rarely want to see ALL 2000 (or
> > whatever
> > > number) documents at one time - it's simply too much to take in at one
> > > time.
> > > If your use-case involves an automated or offline procedure (e.g.
> > running a
> > > report or some data-mining op), then presumably it doesn't matter so
> much
> > > it takes a bit longer (as long as it returns in some reasonble time).
> > > Have you looked at doing paging on the client-side - this will hugely
> > > speed-up your search time.
> > > HTH
> > > Peter
> > >
> > >
> > >
> > > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Well, depending on how many docs get served
> > > > from the cache the time will vary. But this is
> > > > just ugly, if you can avoid this use-case it would
> > > > be a Good Thing.
> > > >
> > > > Problem here is that each and every shard must
> > > > assemble the list of 2,000 documents (just ID and
> > > > sort criteria, usually score).
> > > >
> > > > Then the node serving the original request merges
> > > > the sub-lists to pick the top 2,000. Then the node
> > > > sends another request to each shard to get
> > > > the full document. Then the node merges this
> > > > into the full list to return to the user.
> > > >
> > > > Solr really isn't built for this use-case, is it actually
> > > > a compelling situation?
> > > >
> > > > And having your document cache set at 1M is kinda
> > > > high if you have very big documents.
> > > >
> > > > FWIW,
> > > > Erick
> > > >
> > > >
> > > > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar <
> utkarsh2...@gmail.com
> > > > >wrote:
> > > >
> > > > > Also, I don't see a consistent response time from solr, I ran ab
> > again
> > > > and
> > > > > I get this:
> > > > >
> > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > > >
> > > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > "
> > > > >
> > > > >
> > > > > Benchmarking x.amazonaws.com (be patient)
> > > > > Completed 100 requests
> > > > > Completed 200 requests
> > > > > Completed 300 requests
> > > > > Completed 400 requests
> > > > > Completed 500 requests
> > > > > Finished 500 requests
> > > > >
> > > > >
> > > > > Server Software:
> > > > > Server Hostname:       x.amazonaws.com
> > > > > Server Port:            8983
> > > > >
> > > > > Document Path:
> > > > >
> > > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > Document Length:        1538537 bytes
> > > > >
> > > > > Concurrency Level:      10
> > > > > Time taken for tests:   10.858 seconds
> > > > > Complete requests:      500
> > > > > Failed requests:        8
> > > > >    (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
> > > > > Write errors:           0
> > > > > Total transferred:      769297992 bytes
> > > > > HTML transferred:       769268492 bytes
> > > > > Requests per second:    46.05 [#/sec] (mean)
> > > > > Time per request:       217.167 [ms] (mean)
> > > > > Time per request:       21.717 [ms] (mean, across all concurrent
> > > > requests)
> > > > > Transfer rate:          69187.90 [Kbytes/sec] received
> > > > >
> > > > > Connection Times (ms)
> > > > >               min  mean[+/-sd] median   max
> > > > > Connect:        0    0   0.3      0       2
> > > > > Processing:   110  215  72.0    190     497
> > > > > Waiting:       91  180  70.5    152     473
> > > > > Total:        112  216  72.0    191     497
> > > > >
> > > > > Percentage of the requests served within a certain time (ms)
> > > > >   50%    191
> > > > >   66%    225
> > > > >   75%    252
> > > > >   80%    272
> > > > >   90%    319
> > > > >   95%    364
> > > > >   98%    420
> > > > >   99%    453
> > > > >  100%    497 (longest request)
> > > > >
> > > > >
> > > > > Sometimes it takes a lot of time, sometimes its pretty quick.
> > > > >
> > > > > Thanks,
> > > > > -Utkarsh
> > > > >
> > > > >
> > > > > On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar <
> > utkarsh2...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I have a usecase where I need to retrive top 2000 documents
> > matching
> > > a
> > > > > > query.
> > > > > > What are the parameters (in query, solrconfig, schema) I shoud
> look
> > > at
> > > > to
> > > > > > improve this?
> > > > > >
> > > > > > I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with
> > > 30GB
> > > > > > RAM, 8vCPU and 7GB JVM heap size.
> > > > > >
> > > > > > I have documentCache:
> > > > > >   <documentCache class="solr.LRUCache"  size="1000000"
> > > > > > initialSize="1000000"   autowarmCount="0"/>
> > > > > >
> > > > > > allText is a copyField.
> > > > > >
> > > > > > This is the result I get:
> > > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > > > >
> > > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > > "
> > > > > >
> > > > > > Benchmarking x.amazonaws.com (be patient)
> > > > > > Completed 100 requests
> > > > > > Completed 200 requests
> > > > > > Completed 300 requests
> > > > > > Completed 400 requests
> > > > > > Completed 500 requests
> > > > > > Finished 500 requests
> > > > > >
> > > > > >
> > > > > > Server Software:
> > > > > > Server Hostname:        x.amazonaws.com
> > > > > > Server Port:            8983
> > > > > >
> > > > > > Document Path:
> > > > > >
> > > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > > Document Length:        1538537 bytes
> > > > > >
> > > > > > Concurrency Level:      10
> > > > > > Time taken for tests:   35.999 seconds
> > > > > > Complete requests:      500
> > > > > > Failed requests:        21
> > > > > >    (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
> > > > > > Write errors:           0
> > > > > > Non-2xx responses:      2
> > > > > > Total transferred:      766221660 bytes
> > > > > > HTML transferred:       766191806 bytes
> > > > > > Requests per second:    13.89 [#/sec] (mean)
> > > > > > Time per request:       719.981 [ms] (mean)
> > > > > > Time per request:       71.998 [ms] (mean, across all concurrent
> > > > > requests)
> > > > > > Transfer rate:          20785.65 [Kbytes/sec] received
> > > > > >
> > > > > > Connection Times (ms)
> > > > > >               min  mean[+/-sd] median   max
> > > > > > Connect:        0    0   0.6      0       8
> > > > > > Processing:     9  717 2339.6    199   12611
> > > > > > Waiting:        9  635 2233.6    164   12580
> > > > > > Total:          9  718 2339.6    199   12611
> > > > > >
> > > > > > Percentage of the requests served within a certain time (ms)
> > > > > >   50%    199
> > > > > >   66%    236
> > > > > >   75%    263
> > > > > >   80%    281
> > > > > >   90%    548
> > > > > >   95%    838
> > > > > >   98%  12475
> > > > > >   99%  12545
> > > > > >  100%  12611 (longest request)
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > -Utkarsh
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > -Utkarsh
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > -Utkarsh
> >
>



-- 
***Jagdish Nomula*
Sr. Manager Search
Simply Hired, Inc.
370 San Aleso Ave., Ste 200
Sunnyvale, CA 94085

office - 408.400.4700
cell - 408.431.2916
email - jagd...@simplyhired.com <yourem...@simplyhired.com>

www.simplyhired.com

Re: Improving performance to return 2000+ documents

Reply via email to