Unfortunately, the terms index (before 4.0) is not RAM efficient -- I
wrote about this here:

    http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html

Every indexed term that's loaded into RAM creates 4 objects (TermInfo,
Term, String, char[]), as you see in your profiler output.  And each
object has a number of fields, the header required by the JRE, GC
cost, etc.

Do you use a terms index divisor?  Setting that to 2 would halve the
amount of RAM required but double (on average) the seek time to locate
a given term (but, depending on your queries, that seek time may still
be a negligible part of overall query time, ie the tradeoff could be
very worth it).

In 4.0, w/ flex indexing, the RAM efficiency is much better -- we use
large parallel arrays instead of separate objects, and, we hold much
less in RAM.  Simply upgrading to 4.0 and re-indexing will show this
gain; however, we have reduced the terms index interval from 128 to
32, so if you want a "fair" comparison you should set this back to 128
for your indexing (or, set a terms index divisor of 4 when opening
your readers).

Note that [C aren't UTF8 character arrays -- they are UTF16, meaning
they always consume 2 bytes per character.  But, in 4.0, they are in
fact UTF8 arrays, so, depending on your character distribution, this
can also be a win (or, in some cases, a loss, which is why we are
considering the more efficient BOCU1 encoding by default in
LUCENE-1799).

I'd be really curious to test the RAM reduction in 4.0 on your terms
dict/index -- is there any way I could get a copy of just the tii/tis
files in your index?  Your index is a great test for Lucene!

Mike

On Fri, Sep 10, 2010 at 6:46 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Hi all,
>
> When we run the first query after starting up Solr, memory use goes up from 
> about 1GB to 15GB and never goes below that level.  In debugging a recent OOM 
> problem I ran jmap with the output appended below.  Not surprisingly, given 
> the size of our indexes, it looks like the TermInfo and Term data structures 
> which are the in-memory representation of the tii file are taking up most of 
> the memory. This is running Solr under Tomcat with 16GB allocated to the jvm 
> and 3 shards each with a tii file of about 600MB.
>
> Total index size is about 400GB for each shard (we are indexing about 600,000 
> full-text books in each shard).
>
> In interpreting the jmap output, can we assume that the listings for utf8 
> character arrays ("[C"), java.lang.String, long int arrays ("[J), and int 
> arrays ("[i) are all part of the data structures involved in representing the 
> tii file in memory?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
> (jmap output, commas in numbers added)
>
> num     #instances         #bytes  class name
> ----------------------------------------------
>   1:      82,496,803     4,273,137,904  [C
>   2:      82,498,673     3,299,946,920  java.lang.String
>   3:      27,810,887     1,112,435,480  org.apache.lucene.index.TermInfo
>   4:      27,533,080     1,101,323,200  org.apache.lucene.index.TermInfo
>   5:      27,115,577     1,084,623,080  org.apache.lucene.index.TermInfo
>   6:      27,810,894      889,948,608  org.apache.lucene.index.Term
>   7:      27,533,088      881,058,816  org.apache.lucene.index.Term
>   8:      27,115,589      867,698,848  org.apache.lucene.index.Term
>   9:           148      659,685,520  [J
>  10:             2      222,487,072  [Lorg.apache.lucene.index.Term;
>  11:             2      222,487,072  [Lorg.apache.lucene.index.TermInfo;
>  12:             2      220,264,600  [Lorg.apache.lucene.index.Term;
>  13:             2      220,264,600  [Lorg.apache.lucene.index.TermInfo;
>  14:             2      216,924,560  [Lorg.apache.lucene.index.Term;
>  15:             2      216,924,560  [Lorg.apache.lucene.index.TermInfo;
>  16:        737,060      155,114,960  [I
>  17:        627,793       35,156,408  java.lang.ref.SoftReference
>
>
>
>
>

Reply via email to