50M documents, depending on a bunch of things, may not be unreasonable for a single node, only testing will tell.
But the question I have is whether you should be using standard Solr queries for this or building a custom component that goes at the base Lucene index and "does the right thing". Or even re-indexing your entire corpus periodically to add this kind of data. FWIW, Erick On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar <utkarsh2...@gmail.com>wrote: > Thanks Erick/Peter. > > This is an offline process, used by a relevancy engine implemented around > solr. The engine computes boost scores for related keywords based on > clickstream data. > i.e.: say clickstream has: ipad=upc1,upc2,upc3 > I query solr with keyword: "ipad" (to get 2000 documents) and then make 3 > individual queries for upc1,upc2,upc3 (which are fast). > The data is then used to compute related keywords to "ipad" with their > boost values. > > So, I cannot really replace that, since I need full text search over my > dataset to retrieve top 2000 documents. > > I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), > but don't see any improvements. > > > Some questions: > 1. Maybe the JVM size might help? > This is what I see in the dashboard: > Physical Memory 76.2% > Swap Space NaN% (don't have any swap space, running on AWS EBS) > File Descriptor Count 4.7% > JVM-Memory 73.8% > > Screenshot: http://i.imgur.com/aegKzP6.png > > 2. Will reducing the shards from 3 to 1 improve performance? (maybe > increase the RAM from 30 to 60GB) The problem I will face in that case will > be fitting 50M documents on 1 machine. > > Thanks, > -Utkarsh > > > On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge <peter.stu...@gmail.com > >wrote: > > > Hello Utkarsh, > > This may or may not be relevant for your use-case, but the way we deal > with > > this scenario is to retrieve the top N documents 5,10,20or100 at a time > > (user selectable). We can then page the results, changing the start > > parameter to return the next set. This allows us to 'retrieve' millions > of > > documents - we just do it at the user's leisure, rather than make them > wait > > for the whole lot in one go. > > This works well because users very rarely want to see ALL 2000 (or > whatever > > number) documents at one time - it's simply too much to take in at one > > time. > > If your use-case involves an automated or offline procedure (e.g. > running a > > report or some data-mining op), then presumably it doesn't matter so much > > it takes a bit longer (as long as it returns in some reasonble time). > > Have you looked at doing paging on the client-side - this will hugely > > speed-up your search time. > > HTH > > Peter > > > > > > > > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <erickerick...@gmail.com > > >wrote: > > > > > Well, depending on how many docs get served > > > from the cache the time will vary. But this is > > > just ugly, if you can avoid this use-case it would > > > be a Good Thing. > > > > > > Problem here is that each and every shard must > > > assemble the list of 2,000 documents (just ID and > > > sort criteria, usually score). > > > > > > Then the node serving the original request merges > > > the sub-lists to pick the top 2,000. Then the node > > > sends another request to each shard to get > > > the full document. Then the node merges this > > > into the full list to return to the user. > > > > > > Solr really isn't built for this use-case, is it actually > > > a compelling situation? > > > > > > And having your document cache set at 1M is kinda > > > high if you have very big documents. > > > > > > FWIW, > > > Erick > > > > > > > > > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar <utkarsh2...@gmail.com > > > >wrote: > > > > > > > Also, I don't see a consistent response time from solr, I ran ab > again > > > and > > > > I get this: > > > > > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 " > > > > > > > > > > > > > > http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json > > > > " > > > > > > > > > > > > Benchmarking x.amazonaws.com (be patient) > > > > Completed 100 requests > > > > Completed 200 requests > > > > Completed 300 requests > > > > Completed 400 requests > > > > Completed 500 requests > > > > Finished 500 requests > > > > > > > > > > > > Server Software: > > > > Server Hostname: x.amazonaws.com > > > > Server Port: 8983 > > > > > > > > Document Path: > > > > > > > > > > > > > > /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json > > > > Document Length: 1538537 bytes > > > > > > > > Concurrency Level: 10 > > > > Time taken for tests: 10.858 seconds > > > > Complete requests: 500 > > > > Failed requests: 8 > > > > (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) > > > > Write errors: 0 > > > > Total transferred: 769297992 bytes > > > > HTML transferred: 769268492 bytes > > > > Requests per second: 46.05 [#/sec] (mean) > > > > Time per request: 217.167 [ms] (mean) > > > > Time per request: 21.717 [ms] (mean, across all concurrent > > > requests) > > > > Transfer rate: 69187.90 [Kbytes/sec] received > > > > > > > > Connection Times (ms) > > > > min mean[+/-sd] median max > > > > Connect: 0 0 0.3 0 2 > > > > Processing: 110 215 72.0 190 497 > > > > Waiting: 91 180 70.5 152 473 > > > > Total: 112 216 72.0 191 497 > > > > > > > > Percentage of the requests served within a certain time (ms) > > > > 50% 191 > > > > 66% 225 > > > > 75% 252 > > > > 80% 272 > > > > 90% 319 > > > > 95% 364 > > > > 98% 420 > > > > 99% 453 > > > > 100% 497 (longest request) > > > > > > > > > > > > Sometimes it takes a lot of time, sometimes its pretty quick. > > > > > > > > Thanks, > > > > -Utkarsh > > > > > > > > > > > > On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar < > utkarsh2...@gmail.com > > > > >wrote: > > > > > > > > > Hello, > > > > > > > > > > I have a usecase where I need to retrive top 2000 documents > matching > > a > > > > > query. > > > > > What are the parameters (in query, solrconfig, schema) I shoud look > > at > > > to > > > > > improve this? > > > > > > > > > > I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with > > 30GB > > > > > RAM, 8vCPU and 7GB JVM heap size. > > > > > > > > > > I have documentCache: > > > > > <documentCache class="solr.LRUCache" size="1000000" > > > > > initialSize="1000000" autowarmCount="0"/> > > > > > > > > > > allText is a copyField. > > > > > > > > > > This is the result I get: > > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 " > > > > > > > > > > > > > > > http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json > > > > > " > > > > > > > > > > Benchmarking x.amazonaws.com (be patient) > > > > > Completed 100 requests > > > > > Completed 200 requests > > > > > Completed 300 requests > > > > > Completed 400 requests > > > > > Completed 500 requests > > > > > Finished 500 requests > > > > > > > > > > > > > > > Server Software: > > > > > Server Hostname: x.amazonaws.com > > > > > Server Port: 8983 > > > > > > > > > > Document Path: > > > > > > > > > > > > > > > /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json > > > > > Document Length: 1538537 bytes > > > > > > > > > > Concurrency Level: 10 > > > > > Time taken for tests: 35.999 seconds > > > > > Complete requests: 500 > > > > > Failed requests: 21 > > > > > (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) > > > > > Write errors: 0 > > > > > Non-2xx responses: 2 > > > > > Total transferred: 766221660 bytes > > > > > HTML transferred: 766191806 bytes > > > > > Requests per second: 13.89 [#/sec] (mean) > > > > > Time per request: 719.981 [ms] (mean) > > > > > Time per request: 71.998 [ms] (mean, across all concurrent > > > > requests) > > > > > Transfer rate: 20785.65 [Kbytes/sec] received > > > > > > > > > > Connection Times (ms) > > > > > min mean[+/-sd] median max > > > > > Connect: 0 0 0.6 0 8 > > > > > Processing: 9 717 2339.6 199 12611 > > > > > Waiting: 9 635 2233.6 164 12580 > > > > > Total: 9 718 2339.6 199 12611 > > > > > > > > > > Percentage of the requests served within a certain time (ms) > > > > > 50% 199 > > > > > 66% 236 > > > > > 75% 263 > > > > > 80% 281 > > > > > 90% 548 > > > > > 95% 838 > > > > > 98% 12475 > > > > > 99% 12545 > > > > > 100% 12611 (longest request) > > > > > > > > > > -- > > > > > Thanks, > > > > > -Utkarsh > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks, > > > > -Utkarsh > > > > > > > > > > > > > -- > Thanks, > -Utkarsh >