Hi Shawn, Thanks for the insights into the memory requirements. Looks like cursor approach is going to require a lot of memory for millions of documents. If I run a query that returns only 500K documents still keeping 100K docs per page, I don't see long GC pauses. So it is not really the number of rows per page but the overall number of docs. May be I can reduce the document cache and the field cache. What do you think?
Erick, I was using the streaming approach to get back results from Solr but I was running into some run time exceptions. That bug has been fixed in solr 6.0. But because of some reasons, I won't be able to move to Java 8 and hence I will have to stick to solr 5.5.0. That is the reason I had to switch to the cursor approach. Thanks! On Wed, Apr 12, 2017 at 8:37 PM, Erick Erickson <erickerick...@gmail.com> wrote: > You're missing the point of my comment. Since they already are > docValues, you can use the /export functionality to get the results > back as a _stream_ and avoid all of the overhead of the aggregator > node doing a merge sort and all of that. > > You'll have to do this from SolrJ, but see CloudSolrStream. You can > see examples of its usage in StreamingTest.java. > > this should > 1> complete much, much faster. The design goal is 400K rows/second but YMMV > 2> use vastly less memory on your Solr instances. > 3> only require _one_ query > > Best, > Erick > > On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 4/12/2017 5:19 PM, Chetas Joshi wrote: > >> I am getting back 100K results per page. > >> The fields have docValues enabled and I am getting sorted results based > on "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes). > >> > >> I have a solr Cloud of 80 nodes. There will be one shard that will get > top 100K docs from each shard and apply merge sort. So, the max memory > usage of any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap > memory usage shoot up from 8 GB to 17 GB? > > > > From what I understand, Java overhead for a String object is 56 bytes > > above the actual byte size of the string itself. And each character in > > the string will be two bytes -- Java uses UTF-16 for character > > representation internally. If I'm right about these numbers, it means > > that each of those id values will take 120 bytes -- and that doesn't > > include the size the actual response (xml, json, etc). > > > > I don't know what the overhead for a long is, but you can be sure that > > it's going to take more than eight bytes total memory usage for each one. > > > > Then there is overhead for all the Lucene memory structures required to > > execute the query and gather results, plus Solr memory structures to > > keep track of everything. I have absolutely no idea how much memory > > Lucene and Solr use to accomplish a query, but it's not going to be > > small when you have 200 million documents per shard. > > > > Speaking of Solr memory requirements, under normal query circumstances > > the aggregating node is going to receive at least 100K results from > > *every* shard in the collection, which it will condense down to the > > final result with 100K entries. The behavior during a cursor-based > > request may be more memory-efficient than what I have described, but I > > am unsure whether that is the case. > > > > If the cursor behavior is not more efficient, then each entry in those > > results will contain the uniqueKey value and the score. That's going to > > be many megabytes for every shard. If there are 80 shards, it would > > probably be over a gigabyte for one request. > > > > Thanks, > > Shawn > > >