Some more info, Profiling the heap dump shows "org.apache.lucene.index.ReadOnlySegmentReader" as the biggest object - taking up almost 80% of total memory (6G) - see the attached screen shot for a smaller dump. There is some norms object - not sure where are they coming from as I've omitnorms=true for all indexed records.
I also noticed that if I run a query - let's say generic query that hits 100million records and then follow up with a specific query - which hits only 1 record, the second query causes the increase in heap. Looks like there are few bytes being loaded into memory for each document - I've checked the schema all indexes have omitNorms=true, all caches are commented out - still looking to see what else might put things in memory which don't get collected by GC. I also saw, https://issues.apache.org/jira/browse/SOLR-1111 for Solr 1.4 (which I'm using). Not sure if that can cause any problem. I do use range queries for dates - would that have any effect? Any other ideas? Thanks, -vivek On Thu, May 14, 2009 at 8:38 PM, vivek sar <vivex...@gmail.com> wrote: > Thanks Mark. > > I checked all the items you mentioned, > > 1) I've omitnorms=true for all my indexed fields (stored only fields I > guess doesn't matter) > 2) I've tried commenting out all caches in the solrconfig.xml, but > that doesn't help much > 3) I've tried commenting out the first and new searcher listeners > settings in the solrconfig.xml - the only way that helps is that at > startup time the memory usage doesn't spike up - that's only because > there is no auto-warmer query to run. But, I noticed commenting out > searchers slows down any other queries to Solr. > 4) I don't have any sort or facet in my queries > 5) I'm not sure how to change the "Lucene term interval" from Solr - > is there a way to do that? > > I've been playing around with this memory thing the whole day and have > found that it's the search that's hogging the memory. Any time there > is a search on all the records (800 million) the heap consumption > jumps by 5G. This makes me think there has to be some configuration in > Solr that's causing some terms per document to be loaded in memory. > > I've posted my settings several times on this forum, but no one has > been able to pin point what configuration might be causing this. If > someone is interested I can attach the solrconfig and schema files as > well. Here are the settings again under Query tag, > > <query> > <maxBooleanClauses>1024</maxBooleanClauses> > <enableLazyFieldLoading>true</enableLazyFieldLoading> > <queryResultWindowSize>50</queryResultWindowSize> > <queryResultMaxDocsCached>200</queryResultMaxDocsCached> > <HashDocSet maxSize="3000" loadFactor="0.75"/> > <useColdSearcher>false</useColdSearcher> > <maxWarmingSearchers>2</maxWarmingSearchers> > </query> > > and schema, > > <field name="id" type="long" indexed="true" stored="true" > required="true" omitNorms="true" compressed="false"/> > > <field name="atmps" type="integer" indexed="false" stored="true" > compressed="false"/> > <field name="bcid" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="cmpcd" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="ctry" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="dlt" type="date" indexed="false" stored="true" > default="NOW/HOUR" compressed="false"/> > <field name="dmn" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="eaddr" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="emsg" type="string" indexed="false" stored="true" > compressed="false"/> > <field name="erc" type="string" indexed="false" stored="true" > compressed="false"/> > <field name="evt" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="from" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="lfid" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="lsid" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="prsid" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="rc" type="string" indexed="false" stored="true" > compressed="false"/> > <field name="rmcd" type="string" indexed="false" stored="true" > compressed="false"/> > <field name="rmscd" type="string" indexed="false" stored="true" > compressed="false"/> > <field name="scd" type="string" indexed="true" stored="true" > omitNorms="true" compressed="false"/> > <field name="sip" type="string" indexed="false" stored="true" > compressed="false"/> > <field name="ts" type="date" indexed="true" stored="false" > default="NOW/HOUR" omitNorms="true"/> > > <!-- catchall field, containing all other searchable text fields (implemented > via copyField further on in this schema --> > <field name="all" type="text_ws" indexed="true" stored="false" > omitNorms="true" multiValued="true"/> > > Any help is greatly appreciated. > > Thanks, > -vivek > > On Thu, May 14, 2009 at 6:22 PM, Mark Miller <markrmil...@gmail.com> wrote: >> 800 million docs is on the high side for modern hardware. >> >> If even one field has norms on, your talking almost 800 MB right there. And >> then if another Searcher is brought up well the old one is serving (which >> happens when you update)? Doubled. >> >> Your best bet is to distribute across a couple machines. >> >> To minimize you would want to turn off or down caching, don't facet, don't >> sort, turn off all norms, possibly get at the Lucene term interval and raise >> it. Drop on deck searchers setting. Even then, 800 million...time to >> distribute I'd think. >> >> vivek sar wrote: >>> >>> Some update on this issue, >>> >>> 1) I attached jconsole to my app and monitored the memory usage. >>> During indexing the memory usage goes up and down, which I think is >>> normal. The memory remains around the min heap size (4 G) for >>> indexing, but as soon as I run a search the tenured heap usage jumps >>> up to 6G and remains there. Subsequent searches increases the heap >>> usage even more until it reaches the max (8G) - after which everything >>> (indexing and searching becomes slow). >>> >>> The search query is a very generic one in this case which goes through >>> all the cores (4 of them - 800 million records), finds 400million >>> matches and returns 100 rows. >>> >>> Does the Solr searcher holds up the reference to objects in memory? I >>> couldn't find any settings that would tell me it does, but every >>> search causing heap to go up is definitely suspicious. >>> >>> 2) I ran the jmap histo to get the top objects (this is on a smaller >>> instance with 2 G memory, this is before running search - after >>> running search I wasn't able to run jmap), >>> >>> num #instances #bytes class name >>> ---------------------------------------------- >>> 1: 3890855 222608992 [C >>> 2: 3891673 155666920 java.lang.String >>> 3: 3284341 131373640 org.apache.lucene.index.TermInfo >>> 4: 3334198 106694336 org.apache.lucene.index.Term >>> 5: 271 26286496 [J >>> 6: 16 26273936 [Lorg.apache.lucene.index.Term; >>> 7: 16 26273936 [Lorg.apache.lucene.index.TermInfo; >>> 8: 320512 15384576 >>> org.apache.lucene.index.FreqProxTermsWriter$PostingList >>> 9: 10335 11554136 [I >>> >>> I'm not sure what's the first one (C)? I couldn't profile it to know >>> what all the Strings are being allocated by - any ideas? >>> >>> Any ideas on what Searcher might be holding on and how can we change >>> that behavior? >>> >>> Thanks, >>> -vivek >>>