Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no difference between maxdoc and maxdoc + 1 for such estimate... difference is between 0.4Gb and 1.2Gb...
So, let's vote ;) A. [maxdoc] x [8 bytes ~ pointer to String object] B. [maxdoc] x [8 bytes ~ pointer to Document object] C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] - same as [String1_Document_Count + ... + String10_Document_Count] x [4 bytes ~ DocumentID] D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...] Please confirm that it is Pointer to Object and not Lucene Document ID... I hope it is (int) Document ID... > -----Original Message----- > From: Mark Miller [mailto:markrmil...@gmail.com] > Sent: November-02-09 6:52 PM > To: solr-user@lucene.apache.org > Subject: Re: Lucene FieldCache memory requirements > > It also briefly requires more memory than just that - it allocates an > array the size of maxdoc+1 to hold the unique terms - and then sizes down. > > Possibly we can use the getUnuiqeTermCount method in the flexible > indexing branch to get rid of that - which is why I was thinking it > might be a good idea to drop the unsupported exception in that method > for things like multi reader and just do the work to get the right > number (currently there is a comment that the user should do that work > if necessary, making the call unreliable for this). > > Fuad Efendi wrote: > > Thank you very much Mike, > > > > I found it: > > org.apache.solr.request.SimpleFacets > > ... > > // TODO: future logic could use filters instead of the fieldcache if > > // the number of terms in the field is small enough. > > counts = getFieldCacheCounts(searcher, base, field, offset,limit, > > mincount, missing, sort, prefix); > > ... > > FieldCache.StringIndex si = > > FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); > > final String[] terms = si.lookup; > > final int[] termNum = si.order; > > ... > > > > > > So that 64-bit requires more memory :) > > > > > > Mike, am I right here? > > [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] > > (64-bit JVM) > > 1.2Gb RAM for this... > > > > Or, may be I am wrong: > > > >> For Lucene directly, simple strings would consume an pointer (4 or 8 > >> bytes depending on whether your JRE is 64bit) per doc, and the string > >> index would consume an int (4 bytes) per doc. > >> > > > > [8 bytes (64bit)] x [number of documents (100mlns)]? > > 0.8Gb > > > > Kind of Map between String and DocSet, saving 4 bytes... "Key" is String, > > and "Value" is array of 64-bit pointers to Document. Why 64-bit (for 64-bit > > JVM)? I always thought it is (int) documentId... > > > > Am I right? > > > > > > Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! > > > > > >>> Note that for your use case, this is exceptionally wasteful. > >>> > > This is probably very common case... I think it should be confirmed by > > Lucene developers too... FieldCache is warmed anyway, even when we don't use > > SOLR... > > > > > > -Fuad > > > > > > > > > > > > > > > > > >> -----Original Message----- > >> From: Michael McCandless [mailto:luc...@mikemccandless.com] > >> Sent: November-02-09 6:00 PM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Lucene FieldCache memory requirements > >> > >> OK I think someone who knows how Solr uses the fieldCache for this > >> type of field will have to pipe up. > >> > >> For Lucene directly, simple strings would consume an pointer (4 or 8 > >> bytes depending on whether your JRE is 64bit) per doc, and the string > >> index would consume an int (4 bytes) per doc. (Each also consume > >> negligible (for your case) memory to hold the actual string values). > >> > >> Note that for your use case, this is exceptionally wasteful. If > >> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) > >> then it'd take much fewer bits to reference the values, since you have > >> only 10 unique string values. > >> > >> Mike > >> > >> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi <f...@efendi.ca> wrote: > >> > >>> I am not using Lucene API directly; I am using SOLR which uses Lucene > >>> FieldCache for faceting on non-tokenized fields... > >>> I think this cache will be lazily loaded, until user executes sorted (by > >>> this field) SOLR query for all documents *:* - in this case it will be > >>> > > fully > > > >>> populated... > >>> > >>> > >>> > >>>> Subject: Re: Lucene FieldCache memory requirements > >>>> > >>>> Which FieldCache API are you using? getStrings? or getStringIndex > >>>> (which is used, under the hood, if you sort by this field). > >>>> > >>>> Mike > >>>> > >>>> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi <f...@efendi.ca> wrote: > >>>> > >>>>> Any thoughts regarding the subject? I hope FieldCache doesn't use > >>>>> > > more > > > >>> than > >>> > >>>>> 6 bytes per document-field instance... I am too lazy to research > >>>>> > > Lucene > > > >>>>> source code, I hope someone can provide exact answer... Thanks > >>>>> > >>>>> > >>>>> > >>>>>> Subject: Lucene FieldCache memory requirements > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> > >>>>>> Can anyone confirm Lucene FieldCache memory requirements? I have 100 > >>>>>> millions docs with non-tokenized field "country" (10 different > >>>>>> > >>> countries); > >>> > >>>>> I > >>>>> > >>>>>> expect it requires array of ("int", "long"), size of array > >>>>>> > > 100,000,000, > > > >>>>>> without any impact of "country" field length; > >>>>>> > >>>>>> it requires 600,000,000 bytes: "int" is pointer to document (Lucene > >>>>>> > >>>>> document > >>>>> > >>>>>> ID), and "long" is pointer to String value... > >>>>>> > >>>>>> Am I right, is it 600Mb just for this "country" (indexed, > >>>>>> > >>> non-tokenized, > >>> > >>>>>> non-boolean) field and 100 millions docs? I need to calculate exact > >>>>>> > >>>>> minimum RAM > >>>>> > >>>>>> requirements... > >>>>>> > >>>>>> I believe it shouldn't depend on cardinality (distribution) of > >>>>>> > > field... > > > >>>>>> Thanks, > >>>>>> Fuad > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>> > >>> > > > > > > > > > -- > - Mark > > http://www.lucidimagination.com > > - Fuad http://www.linkedin.com/in/liferay