Some update on this issue, 1) I attached jconsole to my app and monitored the memory usage. During indexing the memory usage goes up and down, which I think is normal. The memory remains around the min heap size (4 G) for indexing, but as soon as I run a search the tenured heap usage jumps up to 6G and remains there. Subsequent searches increases the heap usage even more until it reaches the max (8G) - after which everything (indexing and searching becomes slow).
The search query is a very generic one in this case which goes through all the cores (4 of them - 800 million records), finds 400million matches and returns 100 rows. Does the Solr searcher holds up the reference to objects in memory? I couldn't find any settings that would tell me it does, but every search causing heap to go up is definitely suspicious. 2) I ran the jmap histo to get the top objects (this is on a smaller instance with 2 G memory, this is before running search - after running search I wasn't able to run jmap), num #instances #bytes class name ---------------------------------------------- 1: 3890855 222608992 [C 2: 3891673 155666920 java.lang.String 3: 3284341 131373640 org.apache.lucene.index.TermInfo 4: 3334198 106694336 org.apache.lucene.index.Term 5: 271 26286496 [J 6: 16 26273936 [Lorg.apache.lucene.index.Term; 7: 16 26273936 [Lorg.apache.lucene.index.TermInfo; 8: 320512 15384576 org.apache.lucene.index.FreqProxTermsWriter$PostingList 9: 10335 11554136 [I I'm not sure what's the first one (C)? I couldn't profile it to know what all the Strings are being allocated by - any ideas? Any ideas on what Searcher might be holding on and how can we change that behavior? Thanks, -vivek On Thu, May 14, 2009 at 11:33 AM, vivek sar <vivex...@gmail.com> wrote: > I don't know if field type has any impact on the memory usage - does it? > > Our use cases require complete matches, thus there is no need of any > analysis in most cases - does it matter in terms of memory usage? > > Also, is there any default caching used by Solr if I comment out all > the caches under query in solrconfig.xml? I also don't have any > auto-warming queries. > > Thanks, > -vivek > > On Wed, May 13, 2009 at 4:24 PM, Erick Erickson <erickerick...@gmail.com> > wrote: >> Warning: I'm waaaay out of my competency range when I comment >> on SOLR, but I've seen the statement that string fields are NOT >> tokenized while text fields are, and I notice that almost all of your fields >> are string type. >> >> Would someone more knowledgeable than me care to comment on whether >> this is at all relevant? Offered in the spirit that sometimes there are >> things >> so basic that only an amateur can see them <G>.... >> >> Best >> Erick >> >> On Wed, May 13, 2009 at 4:42 PM, vivek sar <vivex...@gmail.com> wrote: >> >>> Thanks Otis. >>> >>> Our use case doesn't require any sorting or faceting. I'm wondering if >>> I've configured anything wrong. >>> >>> I got total of 25 fields (15 are indexed and stored, other 10 are just >>> stored). All my fields are basic data type - which I thought are not >>> sorted. My id field is unique key. >>> >>> Is there any field here that might be getting sorted? >>> >>> <field name="id" type="long" indexed="true" stored="true" >>> required="true" omitNorms="true" compressed="false"/> >>> >>> <field name="atmps" type="integer" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="bcid" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="cmpcd" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="ctry" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="dlt" type="date" indexed="false" stored="true" >>> default="NOW/HOUR" compressed="false"/> >>> <field name="dmn" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="eaddr" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="emsg" type="string" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="erc" type="string" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="evt" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="from" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="lfid" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="lsid" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="prsid" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="rc" type="string" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="rmcd" type="string" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="rmscd" type="string" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="scd" type="string" indexed="true" stored="true" >>> omitNorms="true" compressed="false"/> >>> <field name="sip" type="string" indexed="false" stored="true" >>> compressed="false"/> >>> <field name="ts" type="date" indexed="true" stored="false" >>> default="NOW/HOUR" omitNorms="true"/> >>> >>> >>> <!-- catchall field, containing all other searchable text fields >>> (implemented >>> via copyField further on in this schema --> >>> <field name="all" type="text_ws" indexed="true" stored="false" >>> omitNorms="true" multiValued="true"/> >>> >>> Thanks, >>> -vivek >>> >>> On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic >>> <otis_gospodne...@yahoo.com> wrote: >>> > >>> > Hi, >>> > Some answers: >>> > 1) .tii files in the Lucene index. When you sort, all distinct values >>> for the field(s) used for sorting. Similarly for facet fields. Solr >>> caches. >>> > 2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will >>> consume during indexing. There is no need to commit every 50K docs unless >>> you want to trigger snapshot creation. >>> > 3) see 1) above >>> > >>> > 1.5 billion docs per instance where each doc is cca 1KB? I doubt that's >>> going to fly. :) >>> > >>> > Otis >>> > -- >>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> > >>> > >>> > >>> > ----- Original Message ---- >>> >> From: vivek sar <vivex...@gmail.com> >>> >> To: solr-user@lucene.apache.org >>> >> Sent: Wednesday, May 13, 2009 3:04:46 PM >>> >> Subject: Solr memory requirements? >>> >> >>> >> Hi, >>> >> >>> >> I'm pretty sure this has been asked before, but I couldn't find a >>> >> complete answer in the forum archive. Here are my questions, >>> >> >>> >> 1) When solr starts up what does it loads up in the memory? Let's say >>> >> I've 4 cores with each core 50G in size. When Solr comes up how much >>> >> of it would be loaded in memory? >>> >> >>> >> 2) How much memory is required during index time? If I'm committing >>> >> 50K records at a time (1 record = 1KB) using solrj, how much memory do >>> >> I need to give to Solr. >>> >> >>> >> 3) Is there a minimum memory requirement by Solr to maintain a certain >>> >> size index? Is there any benchmark on this? >>> >> >>> >> Here are some of my configuration from solrconfig.xml, >>> >> >>> >> 1) 64 >>> >> 2) All the caches (under query tag) are commented out >>> >> 3) Few others, >>> >> a) true ==> >>> >> would this require memory? >>> >> b) 50 >>> >> c) 200 >>> >> d) >>> >> e) false >>> >> f) 2 >>> >> >>> >> The problem we are having is following, >>> >> >>> >> I've given Solr RAM of 6G. As the total index size (all cores >>> >> combined) start growing the Solr memory consumption goes up. With 800 >>> >> million documents, I see Solr already taking up all the memory at >>> >> startup. After that the commits, searches everything become slow. We >>> >> will be having distributed setup with multiple Solr instances (around >>> >> 8) on four boxes, but our requirement is to have each Solr instance at >>> >> least maintain around 1.5 billion documents. >>> >> >>> >> We are trying to see if we can somehow reduce the Solr memory >>> >> footprint. If someone can provide a pointer on what parameters affect >>> >> memory and what effects it has we can then decide whether we want that >>> >> parameter or not. I'm not sure if there is any minimum Solr >>> >> requirement for it to be able maintain large indexes. I've used Lucene >>> >> before and that didn't require anything by default - it used up memory >>> >> only during index and search times - not otherwise. >>> >> >>> >> Any help is very much appreciated. >>> >> >>> >> Thanks, >>> >> -vivek >>> > >>> > >>> >> >