RE: Lucene FieldCache memory requirements

Fuad Efendi Mon, 02 Nov 2009 16:38:09 -0800

Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
difference between maxdoc and maxdoc + 1 for such estimate... difference is
between 0.4Gb and 1.2Gb...



So, let's vote ;)

A. [maxdoc] x [8 bytes ~ pointer to String object]

B. [maxdoc] x [8 bytes ~ pointer to Document object]

C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] 
- same as [String1_Document_Count + ... + String10_Document_Count] x [4
bytes ~ DocumentID]

D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]


Please confirm that it is Pointer to Object and not Lucene Document ID... I
hope it is (int) Document ID...





> -----Original Message-----
> From: Mark Miller [mailto:[email protected]]
> Sent: November-02-09 6:52 PM
> To: [email protected]
> Subject: Re: Lucene FieldCache memory requirements
> 
> It also briefly requires more memory than just that - it allocates an
> array the size of maxdoc+1 to hold the unique terms - and then sizes down.
> 
> Possibly we can use the getUnuiqeTermCount method in the flexible
> indexing branch to get rid of that - which is why I was thinking it
> might be a good idea to drop the unsupported exception in that method
> for things like multi reader and just do the work to get the right
> number (currently there is a comment that the user should do that work
> if necessary, making the call unreliable for this).
> 
> Fuad Efendi wrote:
> > Thank you very much Mike,
> >
> > I found it:
> > org.apache.solr.request.SimpleFacets
> > ...
> >         // TODO: future logic could use filters instead of the
fieldcache if
> >         // the number of terms in the field is small enough.
> >         counts = getFieldCacheCounts(searcher, base, field,
offset,limit,
> > mincount, missing, sort, prefix);
> > ...
> >     FieldCache.StringIndex si =
> > FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
> >     final String[] terms = si.lookup;
> >     final int[] termNum = si.order;
> > ...
> >
> >
> > So that 64-bit requires more memory :)
> >
> >
> > Mike, am I right here?
> > [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
> > (64-bit JVM)
> > 1.2Gb RAM for this...
> >
> > Or, may be I am wrong:
> >
> >> For Lucene directly, simple strings would consume an pointer (4 or 8
> >> bytes depending on whether your JRE is 64bit) per doc, and the string
> >> index would consume an int (4 bytes) per doc.
> >>
> >
> > [8 bytes (64bit)] x [number of documents (100mlns)]?
> > 0.8Gb
> >
> > Kind of Map between String and DocSet, saving 4 bytes... "Key" is
String,
> > and "Value" is array of 64-bit pointers to Document. Why 64-bit (for
64-bit
> > JVM)? I always thought it is (int) documentId...
> >
> > Am I right?
> >
> >
> > Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
> >
> >
> >>> Note that for your use case, this is exceptionally wasteful.
> >>>
> > This is probably very common case... I think it should be confirmed by
> > Lucene developers too... FieldCache is warmed anyway, even when we don't
use
> > SOLR...
> >
> >
> > -Fuad
> >
> >
> >
> >
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Michael McCandless [mailto:[email protected]]
> >> Sent: November-02-09 6:00 PM
> >> To: [email protected]
> >> Subject: Re: Lucene FieldCache memory requirements
> >>
> >> OK I think someone who knows how Solr uses the fieldCache for this
> >> type of field will have to pipe up.
> >>
> >> For Lucene directly, simple strings would consume an pointer (4 or 8
> >> bytes depending on whether your JRE is 64bit) per doc, and the string
> >> index would consume an int (4 bytes) per doc.  (Each also consume
> >> negligible (for your case) memory to hold the actual string values).
> >>
> >> Note that for your use case, this is exceptionally wasteful.  If
> >> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
> >> then it'd take much fewer bits to reference the values, since you have
> >> only 10 unique string values.
> >>
> >> Mike
> >>
> >> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi <[email protected]> wrote:
> >>
> >>> I am not using Lucene API directly; I am using SOLR which uses Lucene
> >>> FieldCache for faceting on non-tokenized fields...
> >>> I think this cache will be lazily loaded, until user executes sorted
(by
> >>> this field) SOLR query for all documents *:* - in this case it will be
> >>>
> > fully
> >
> >>> populated...
> >>>
> >>>
> >>>
> >>>> Subject: Re: Lucene FieldCache memory requirements
> >>>>
> >>>> Which FieldCache API are you using?  getStrings?  or getStringIndex
> >>>> (which is used, under the hood, if you sort by this field).
> >>>>
> >>>> Mike
> >>>>
> >>>> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi <[email protected]> wrote:
> >>>>
> >>>>> Any thoughts regarding the subject? I hope FieldCache doesn't use
> >>>>>
> > more
> >
> >>> than
> >>>
> >>>>> 6 bytes per document-field instance... I am too lazy to research
> >>>>>
> > Lucene
> >
> >>>>> source code, I hope someone can provide exact answer... Thanks
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Subject: Lucene FieldCache memory requirements
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>
> >>>>>> Can anyone confirm Lucene FieldCache memory requirements? I have
100
> >>>>>> millions docs with non-tokenized field "country" (10 different
> >>>>>>
> >>> countries);
> >>>
> >>>>> I
> >>>>>
> >>>>>> expect it requires array of ("int", "long"), size of array
> >>>>>>
> > 100,000,000,
> >
> >>>>>> without any impact of "country" field length;
> >>>>>>
> >>>>>> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
> >>>>>>
> >>>>> document
> >>>>>
> >>>>>> ID),  and "long" is pointer to String value...
> >>>>>>
> >>>>>> Am I right, is it 600Mb just for this "country" (indexed,
> >>>>>>
> >>> non-tokenized,
> >>>
> >>>>>> non-boolean) field and 100 millions docs? I need to calculate exact
> >>>>>>
> >>>>> minimum RAM
> >>>>>
> >>>>>> requirements...
> >>>>>>
> >>>>>> I believe it shouldn't depend on cardinality (distribution) of
> >>>>>>
> > field...
> >
> >>>>>> Thanks,
> >>>>>> Fuad
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
> >
> 
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
- Fuad

http://www.linkedin.com/in/liferay

RE: Lucene FieldCache memory requirements

Reply via email to