50M documents, depending on a bunch of things,
may not be unreasonable for a single node, only
testing will tell.

But the question I have is whether you should be
using standard Solr queries for this or building a custom
component that goes at the base Lucene index
and "does the right thing". Or even re-indexing your
entire corpus periodically to add this kind of data.

FWIW,
Erick


On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar <utkarsh2...@gmail.com>wrote:

> Thanks Erick/Peter.
>
> This is an offline process, used by a relevancy engine implemented around
> solr. The engine computes boost scores for related keywords based on
> clickstream data.
> i.e.: say clickstream has: ipad=upc1,upc2,upc3
> I query solr with keyword: "ipad" (to get 2000 documents) and then make 3
> individual queries for upc1,upc2,upc3 (which are fast).
> The data is then used to compute related keywords to "ipad" with their
> boost values.
>
> So, I cannot really replace that, since I need full text search over my
> dataset to retrieve top 2000 documents.
>
> I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
> but don't see any improvements.
>
>
> Some questions:
> 1. Maybe the JVM size might help?
> This is what I see in the dashboard:
> Physical Memory 76.2%
> Swap Space NaN% (don't have any swap space, running on AWS EBS)
> File Descriptor Count 4.7%
> JVM-Memory 73.8%
>
> Screenshot: http://i.imgur.com/aegKzP6.png
>
> 2. Will reducing the shards from 3 to 1 improve performance? (maybe
> increase the RAM from 30 to 60GB) The problem I will face in that case will
> be fitting 50M documents on 1 machine.
>
> Thanks,
> -Utkarsh
>
>
> On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge <peter.stu...@gmail.com
> >wrote:
>
> > Hello Utkarsh,
> > This may or may not be relevant for your use-case, but the way we deal
> with
> > this scenario is to retrieve the top N documents 5,10,20or100 at a time
> > (user selectable). We can then page the results, changing the start
> > parameter to return the next set. This allows us to 'retrieve' millions
> of
> > documents - we just do it at the user's leisure, rather than make them
> wait
> > for the whole lot in one go.
> > This works well because users very rarely want to see ALL 2000 (or
> whatever
> > number) documents at one time - it's simply too much to take in at one
> > time.
> > If your use-case involves an automated or offline procedure (e.g.
> running a
> > report or some data-mining op), then presumably it doesn't matter so much
> > it takes a bit longer (as long as it returns in some reasonble time).
> > Have you looked at doing paging on the client-side - this will hugely
> > speed-up your search time.
> > HTH
> > Peter
> >
> >
> >
> > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <erickerick...@gmail.com
> > >wrote:
> >
> > > Well, depending on how many docs get served
> > > from the cache the time will vary. But this is
> > > just ugly, if you can avoid this use-case it would
> > > be a Good Thing.
> > >
> > > Problem here is that each and every shard must
> > > assemble the list of 2,000 documents (just ID and
> > > sort criteria, usually score).
> > >
> > > Then the node serving the original request merges
> > > the sub-lists to pick the top 2,000. Then the node
> > > sends another request to each shard to get
> > > the full document. Then the node merges this
> > > into the full list to return to the user.
> > >
> > > Solr really isn't built for this use-case, is it actually
> > > a compelling situation?
> > >
> > > And having your document cache set at 1M is kinda
> > > high if you have very big documents.
> > >
> > > FWIW,
> > > Erick
> > >
> > >
> > > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> > > >wrote:
> > >
> > > > Also, I don't see a consistent response time from solr, I ran ab
> again
> > > and
> > > > I get this:
> > > >
> > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > "
> > > >
> > > >
> > > > Benchmarking x.amazonaws.com (be patient)
> > > > Completed 100 requests
> > > > Completed 200 requests
> > > > Completed 300 requests
> > > > Completed 400 requests
> > > > Completed 500 requests
> > > > Finished 500 requests
> > > >
> > > >
> > > > Server Software:
> > > > Server Hostname:       x.amazonaws.com
> > > > Server Port:            8983
> > > >
> > > > Document Path:
> > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > Document Length:        1538537 bytes
> > > >
> > > > Concurrency Level:      10
> > > > Time taken for tests:   10.858 seconds
> > > > Complete requests:      500
> > > > Failed requests:        8
> > > >    (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
> > > > Write errors:           0
> > > > Total transferred:      769297992 bytes
> > > > HTML transferred:       769268492 bytes
> > > > Requests per second:    46.05 [#/sec] (mean)
> > > > Time per request:       217.167 [ms] (mean)
> > > > Time per request:       21.717 [ms] (mean, across all concurrent
> > > requests)
> > > > Transfer rate:          69187.90 [Kbytes/sec] received
> > > >
> > > > Connection Times (ms)
> > > >               min  mean[+/-sd] median   max
> > > > Connect:        0    0   0.3      0       2
> > > > Processing:   110  215  72.0    190     497
> > > > Waiting:       91  180  70.5    152     473
> > > > Total:        112  216  72.0    191     497
> > > >
> > > > Percentage of the requests served within a certain time (ms)
> > > >   50%    191
> > > >   66%    225
> > > >   75%    252
> > > >   80%    272
> > > >   90%    319
> > > >   95%    364
> > > >   98%    420
> > > >   99%    453
> > > >  100%    497 (longest request)
> > > >
> > > >
> > > > Sometimes it takes a lot of time, sometimes its pretty quick.
> > > >
> > > > Thanks,
> > > > -Utkarsh
> > > >
> > > >
> > > > On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar <
> utkarsh2...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I have a usecase where I need to retrive top 2000 documents
> matching
> > a
> > > > > query.
> > > > > What are the parameters (in query, solrconfig, schema) I shoud look
> > at
> > > to
> > > > > improve this?
> > > > >
> > > > > I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with
> > 30GB
> > > > > RAM, 8vCPU and 7GB JVM heap size.
> > > > >
> > > > > I have documentCache:
> > > > >   <documentCache class="solr.LRUCache"  size="1000000"
> > > > > initialSize="1000000"   autowarmCount="0"/>
> > > > >
> > > > > allText is a copyField.
> > > > >
> > > > > This is the result I get:
> > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > "
> > > > >
> > > > > Benchmarking x.amazonaws.com (be patient)
> > > > > Completed 100 requests
> > > > > Completed 200 requests
> > > > > Completed 300 requests
> > > > > Completed 400 requests
> > > > > Completed 500 requests
> > > > > Finished 500 requests
> > > > >
> > > > >
> > > > > Server Software:
> > > > > Server Hostname:        x.amazonaws.com
> > > > > Server Port:            8983
> > > > >
> > > > > Document Path:
> > > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > Document Length:        1538537 bytes
> > > > >
> > > > > Concurrency Level:      10
> > > > > Time taken for tests:   35.999 seconds
> > > > > Complete requests:      500
> > > > > Failed requests:        21
> > > > >    (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
> > > > > Write errors:           0
> > > > > Non-2xx responses:      2
> > > > > Total transferred:      766221660 bytes
> > > > > HTML transferred:       766191806 bytes
> > > > > Requests per second:    13.89 [#/sec] (mean)
> > > > > Time per request:       719.981 [ms] (mean)
> > > > > Time per request:       71.998 [ms] (mean, across all concurrent
> > > > requests)
> > > > > Transfer rate:          20785.65 [Kbytes/sec] received
> > > > >
> > > > > Connection Times (ms)
> > > > >               min  mean[+/-sd] median   max
> > > > > Connect:        0    0   0.6      0       8
> > > > > Processing:     9  717 2339.6    199   12611
> > > > > Waiting:        9  635 2233.6    164   12580
> > > > > Total:          9  718 2339.6    199   12611
> > > > >
> > > > > Percentage of the requests served within a certain time (ms)
> > > > >   50%    199
> > > > >   66%    236
> > > > >   75%    263
> > > > >   80%    281
> > > > >   90%    548
> > > > >   95%    838
> > > > >   98%  12475
> > > > >   99%  12545
> > > > >  100%  12611 (longest request)
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > -Utkarsh
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > -Utkarsh
> > > >
> > >
> >
>
>
>
> --
> Thanks,
> -Utkarsh
>

Reply via email to