Unfortunately, the terms index (before 4.0) is not RAM efficient -- I wrote about this here:
http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html Every indexed term that's loaded into RAM creates 4 objects (TermInfo, Term, String, char[]), as you see in your profiler output. And each object has a number of fields, the header required by the JRE, GC cost, etc. Do you use a terms index divisor? Setting that to 2 would halve the amount of RAM required but double (on average) the seek time to locate a given term (but, depending on your queries, that seek time may still be a negligible part of overall query time, ie the tradeoff could be very worth it). In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use large parallel arrays instead of separate objects, and, we hold much less in RAM. Simply upgrading to 4.0 and re-indexing will show this gain; however, we have reduced the terms index interval from 128 to 32, so if you want a "fair" comparison you should set this back to 128 for your indexing (or, set a terms index divisor of 4 when opening your readers). Note that [C aren't UTF8 character arrays -- they are UTF16, meaning they always consume 2 bytes per character. But, in 4.0, they are in fact UTF8 arrays, so, depending on your character distribution, this can also be a win (or, in some cases, a loss, which is why we are considering the more efficient BOCU1 encoding by default in LUCENE-1799). I'd be really curious to test the RAM reduction in 4.0 on your terms dict/index -- is there any way I could get a copy of just the tii/tis files in your index? Your index is a great test for Lucene! Mike On Fri, Sep 10, 2010 at 6:46 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Hi all, > > When we run the first query after starting up Solr, memory use goes up from > about 1GB to 15GB and never goes below that level. In debugging a recent OOM > problem I ran jmap with the output appended below. Not surprisingly, given > the size of our indexes, it looks like the TermInfo and Term data structures > which are the in-memory representation of the tii file are taking up most of > the memory. This is running Solr under Tomcat with 16GB allocated to the jvm > and 3 shards each with a tii file of about 600MB. > > Total index size is about 400GB for each shard (we are indexing about 600,000 > full-text books in each shard). > > In interpreting the jmap output, can we assume that the listings for utf8 > character arrays ("[C"), java.lang.String, long int arrays ("[J), and int > arrays ("[i) are all part of the data structures involved in representing the > tii file in memory? > > Tom Burton-West > http://www.hathitrust.org/blogs/large-scale-search > > (jmap output, commas in numbers added) > > num #instances #bytes class name > ---------------------------------------------- > 1: 82,496,803 4,273,137,904 [C > 2: 82,498,673 3,299,946,920 java.lang.String > 3: 27,810,887 1,112,435,480 org.apache.lucene.index.TermInfo > 4: 27,533,080 1,101,323,200 org.apache.lucene.index.TermInfo > 5: 27,115,577 1,084,623,080 org.apache.lucene.index.TermInfo > 6: 27,810,894 889,948,608 org.apache.lucene.index.Term > 7: 27,533,088 881,058,816 org.apache.lucene.index.Term > 8: 27,115,589 867,698,848 org.apache.lucene.index.Term > 9: 148 659,685,520 [J > 10: 2 222,487,072 [Lorg.apache.lucene.index.Term; > 11: 2 222,487,072 [Lorg.apache.lucene.index.TermInfo; > 12: 2 220,264,600 [Lorg.apache.lucene.index.Term; > 13: 2 220,264,600 [Lorg.apache.lucene.index.TermInfo; > 14: 2 216,924,560 [Lorg.apache.lucene.index.Term; > 15: 2 216,924,560 [Lorg.apache.lucene.index.TermInfo; > 16: 737,060 155,114,960 [I > 17: 627,793 35,156,408 java.lang.ref.SoftReference > > > > >