Sorry Mike, Mark, I am confused again...
Yes, I need some more memory for processing ("while FieldCache is being
loaded"), obviously, but it was not main subject...
With StringIndexCache, I have 10 arrays (cardinality of this field is 10)
storing (int) Lucene Document ID.
> Except: as Mark said, you'll also need transient memory = pointer (4
> or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded.
Ok, I see it:
final int[] retArray = new int[reader.maxDoc()];
String[] mterms = new String[reader.maxDoc()+1];
I can't track right now (limited in time), I think mterms is local variable
and will size down to 0...
So that correct formula is... weird one... if you don't want unexpected OOM
or overloaded GC (WeakHashMaps...):
[some heap] + [Non-Tokenized_Field_Count] x [maxdoc] x [4 bytes + 8
bytes]
(for 64-bit)
-Fuad
> -----Original Message-----
> From: Michael McCandless [mailto:[email protected]]
> Sent: November-03-09 5:00 AM
> To: [email protected]
> Subject: Re: Lucene FieldCache memory requirements
>
> On Mon, Nov 2, 2009 at 9:27 PM, Fuad Efendi <[email protected]> wrote:
> > I believe this is correct estimate:
> >
> >> C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
> >>
> >> same as
> >> [String1_Document_Count + ... + String10_Document_Count + ...]
> >> x [4 bytes per DocumentID]
>
> That's right.
>
> Except: as Mark said, you'll also need transient memory = pointer (4
> or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded. After
> it's done being loaded, this sizes down to the number of unique terms.
>
> But, if Lucene did the basic int packing, which really we should do,
> since you only have 10 unique values, with a naive 4 bits per doc
> encoding, you'd only need 1/8th the memory usage. We could do a bit
> better by encoding more than one document at a time...
>
> Mike