Thanks Mark. I checked all the items you mentioned,
1) I've omitnorms=true for all my indexed fields (stored only fields I guess doesn't matter) 2) I've tried commenting out all caches in the solrconfig.xml, but that doesn't help much 3) I've tried commenting out the first and new searcher listeners settings in the solrconfig.xml - the only way that helps is that at startup time the memory usage doesn't spike up - that's only because there is no auto-warmer query to run. But, I noticed commenting out searchers slows down any other queries to Solr. 4) I don't have any sort or facet in my queries 5) I'm not sure how to change the "Lucene term interval" from Solr - is there a way to do that? I've been playing around with this memory thing the whole day and have found that it's the search that's hogging the memory. Any time there is a search on all the records (800 million) the heap consumption jumps by 5G. This makes me think there has to be some configuration in Solr that's causing some terms per document to be loaded in memory. I've posted my settings several times on this forum, but no one has been able to pin point what configuration might be causing this. If someone is interested I can attach the solrconfig and schema files as well. Here are the settings again under Query tag, <query> <maxBooleanClauses>1024</maxBooleanClauses> <enableLazyFieldLoading>true</enableLazyFieldLoading> <queryResultWindowSize>50</queryResultWindowSize> <queryResultMaxDocsCached>200</queryResultMaxDocsCached> <HashDocSet maxSize="3000" loadFactor="0.75"/> <useColdSearcher>false</useColdSearcher> <maxWarmingSearchers>2</maxWarmingSearchers> </query> and schema, <field name="id" type="long" indexed="true" stored="true" required="true" omitNorms="true" compressed="false"/> <field name="atmps" type="integer" indexed="false" stored="true" compressed="false"/> <field name="bcid" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="cmpcd" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="ctry" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="dlt" type="date" indexed="false" stored="true" default="NOW/HOUR" compressed="false"/> <field name="dmn" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="eaddr" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="emsg" type="string" indexed="false" stored="true" compressed="false"/> <field name="erc" type="string" indexed="false" stored="true" compressed="false"/> <field name="evt" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="from" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="lfid" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="lsid" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="prsid" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="rc" type="string" indexed="false" stored="true" compressed="false"/> <field name="rmcd" type="string" indexed="false" stored="true" compressed="false"/> <field name="rmscd" type="string" indexed="false" stored="true" compressed="false"/> <field name="scd" type="string" indexed="true" stored="true" omitNorms="true" compressed="false"/> <field name="sip" type="string" indexed="false" stored="true" compressed="false"/> <field name="ts" type="date" indexed="true" stored="false" default="NOW/HOUR" omitNorms="true"/> <!-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema --> <field name="all" type="text_ws" indexed="true" stored="false" omitNorms="true" multiValued="true"/> Any help is greatly appreciated. Thanks, -vivek On Thu, May 14, 2009 at 6:22 PM, Mark Miller <markrmil...@gmail.com> wrote: > 800 million docs is on the high side for modern hardware. > > If even one field has norms on, your talking almost 800 MB right there. And > then if another Searcher is brought up well the old one is serving (which > happens when you update)? Doubled. > > Your best bet is to distribute across a couple machines. > > To minimize you would want to turn off or down caching, don't facet, don't > sort, turn off all norms, possibly get at the Lucene term interval and raise > it. Drop on deck searchers setting. Even then, 800 million...time to > distribute I'd think. > > vivek sar wrote: >> >> Some update on this issue, >> >> 1) I attached jconsole to my app and monitored the memory usage. >> During indexing the memory usage goes up and down, which I think is >> normal. The memory remains around the min heap size (4 G) for >> indexing, but as soon as I run a search the tenured heap usage jumps >> up to 6G and remains there. Subsequent searches increases the heap >> usage even more until it reaches the max (8G) - after which everything >> (indexing and searching becomes slow). >> >> The search query is a very generic one in this case which goes through >> all the cores (4 of them - 800 million records), finds 400million >> matches and returns 100 rows. >> >> Does the Solr searcher holds up the reference to objects in memory? I >> couldn't find any settings that would tell me it does, but every >> search causing heap to go up is definitely suspicious. >> >> 2) I ran the jmap histo to get the top objects (this is on a smaller >> instance with 2 G memory, this is before running search - after >> running search I wasn't able to run jmap), >> >> num #instances #bytes class name >> ---------------------------------------------- >> 1: 3890855 222608992 [C >> 2: 3891673 155666920 java.lang.String >> 3: 3284341 131373640 org.apache.lucene.index.TermInfo >> 4: 3334198 106694336 org.apache.lucene.index.Term >> 5: 271 26286496 [J >> 6: 16 26273936 [Lorg.apache.lucene.index.Term; >> 7: 16 26273936 [Lorg.apache.lucene.index.TermInfo; >> 8: 320512 15384576 >> org.apache.lucene.index.FreqProxTermsWriter$PostingList >> 9: 10335 11554136 [I >> >> I'm not sure what's the first one (C)? I couldn't profile it to know >> what all the Strings are being allocated by - any ideas? >> >> Any ideas on what Searcher might be holding on and how can we change >> that behavior? >> >> Thanks, >> -vivek >> >> >> On Thu, May 14, 2009 at 11:33 AM, vivek sar <vivex...@gmail.com> wrote: >> >>> >>> I don't know if field type has any impact on the memory usage - does it? >>> >>> Our use cases require complete matches, thus there is no need of any >>> analysis in most cases - does it matter in terms of memory usage? >>> >>> Also, is there any default caching used by Solr if I comment out all >>> the caches under query in solrconfig.xml? I also don't have any >>> auto-warming queries. >>> >>> Thanks, >>> -vivek >>> >>> On Wed, May 13, 2009 at 4:24 PM, Erick Erickson <erickerick...@gmail.com> >>> wrote: >>> >>>> >>>> Warning: I'm waaaay out of my competency range when I comment >>>> on SOLR, but I've seen the statement that string fields are NOT >>>> tokenized while text fields are, and I notice that almost all of your >>>> fields >>>> are string type. >>>> >>>> Would someone more knowledgeable than me care to comment on whether >>>> this is at all relevant? Offered in the spirit that sometimes there are >>>> things >>>> so basic that only an amateur can see them <G>.... >>>> >>>> Best >>>> Erick >>>> >>>> On Wed, May 13, 2009 at 4:42 PM, vivek sar <vivex...@gmail.com> wrote: >>>> >>>> >>>>> >>>>> Thanks Otis. >>>>> >>>>> Our use case doesn't require any sorting or faceting. I'm wondering if >>>>> I've configured anything wrong. >>>>> >>>>> I got total of 25 fields (15 are indexed and stored, other 10 are just >>>>> stored). All my fields are basic data type - which I thought are not >>>>> sorted. My id field is unique key. >>>>> >>>>> Is there any field here that might be getting sorted? >>>>> >>>>> <field name="id" type="long" indexed="true" stored="true" >>>>> required="true" omitNorms="true" compressed="false"/> >>>>> >>>>> <field name="atmps" type="integer" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="bcid" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="cmpcd" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="ctry" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="dlt" type="date" indexed="false" stored="true" >>>>> default="NOW/HOUR" compressed="false"/> >>>>> <field name="dmn" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="eaddr" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="emsg" type="string" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="erc" type="string" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="evt" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="from" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="lfid" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="lsid" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="prsid" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="rc" type="string" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="rmcd" type="string" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="rmscd" type="string" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="scd" type="string" indexed="true" stored="true" >>>>> omitNorms="true" compressed="false"/> >>>>> <field name="sip" type="string" indexed="false" stored="true" >>>>> compressed="false"/> >>>>> <field name="ts" type="date" indexed="true" stored="false" >>>>> default="NOW/HOUR" omitNorms="true"/> >>>>> >>>>> >>>>> <!-- catchall field, containing all other searchable text fields >>>>> (implemented >>>>> via copyField further on in this schema --> >>>>> <field name="all" type="text_ws" indexed="true" stored="false" >>>>> omitNorms="true" multiValued="true"/> >>>>> >>>>> Thanks, >>>>> -vivek >>>>> >>>>> On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic >>>>> <otis_gospodne...@yahoo.com> wrote: >>>>> >>>>>> >>>>>> Hi, >>>>>> Some answers: >>>>>> 1) .tii files in the Lucene index. When you sort, all distinct values >>>>>> >>>>> >>>>> for the field(s) used for sorting. Similarly for facet fields. Solr >>>>> caches. >>>>> >>>>>> >>>>>> 2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will >>>>>> >>>>> >>>>> consume during indexing. There is no need to commit every 50K docs >>>>> unless >>>>> you want to trigger snapshot creation. >>>>> >>>>>> >>>>>> 3) see 1) above >>>>>> >>>>>> 1.5 billion docs per instance where each doc is cca 1KB? I doubt >>>>>> that's >>>>>> >>>>> >>>>> going to fly. :) >>>>> >>>>>> >>>>>> Otis >>>>>> -- >>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>>>>> >>>>>> >>>>>> >>>>>> ----- Original Message ---- >>>>>> >>>>>>> >>>>>>> From: vivek sar <vivex...@gmail.com> >>>>>>> To: solr-user@lucene.apache.org >>>>>>> Sent: Wednesday, May 13, 2009 3:04:46 PM >>>>>>> Subject: Solr memory requirements? >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm pretty sure this has been asked before, but I couldn't find a >>>>>>> complete answer in the forum archive. Here are my questions, >>>>>>> >>>>>>> 1) When solr starts up what does it loads up in the memory? Let's say >>>>>>> I've 4 cores with each core 50G in size. When Solr comes up how much >>>>>>> of it would be loaded in memory? >>>>>>> >>>>>>> 2) How much memory is required during index time? If I'm committing >>>>>>> 50K records at a time (1 record = 1KB) using solrj, how much memory >>>>>>> do >>>>>>> I need to give to Solr. >>>>>>> >>>>>>> 3) Is there a minimum memory requirement by Solr to maintain a >>>>>>> certain >>>>>>> size index? Is there any benchmark on this? >>>>>>> >>>>>>> Here are some of my configuration from solrconfig.xml, >>>>>>> >>>>>>> 1) 64 >>>>>>> 2) All the caches (under query tag) are commented out >>>>>>> 3) Few others, >>>>>>> a) true ==> >>>>>>> would this require memory? >>>>>>> b) 50 >>>>>>> c) 200 >>>>>>> d) >>>>>>> e) false >>>>>>> f) 2 >>>>>>> >>>>>>> The problem we are having is following, >>>>>>> >>>>>>> I've given Solr RAM of 6G. As the total index size (all cores >>>>>>> combined) start growing the Solr memory consumption goes up. With >>>>>>> 800 >>>>>>> million documents, I see Solr already taking up all the memory at >>>>>>> startup. After that the commits, searches everything become slow. We >>>>>>> will be having distributed setup with multiple Solr instances (around >>>>>>> 8) on four boxes, but our requirement is to have each Solr instance >>>>>>> at >>>>>>> least maintain around 1.5 billion documents. >>>>>>> >>>>>>> We are trying to see if we can somehow reduce the Solr memory >>>>>>> footprint. If someone can provide a pointer on what parameters affect >>>>>>> memory and what effects it has we can then decide whether we want >>>>>>> that >>>>>>> parameter or not. I'm not sure if there is any minimum Solr >>>>>>> requirement for it to be able maintain large indexes. I've used >>>>>>> Lucene >>>>>>> before and that didn't require anything by default - it used up >>>>>>> memory >>>>>>> only during index and search times - not otherwise. >>>>>>> >>>>>>> Any help is very much appreciated. >>>>>>> >>>>>>> Thanks, >>>>>>> -vivek >>>>>>> >>>>>> >>>>>> > > > -- > - Mark > > http://www.lucidimagination.com > > > >