It also briefly requires more memory than just that - it allocates an array the size of maxdoc+1 to hold the unique terms - and then sizes down.
Possibly we can use the getUnuiqeTermCount method in the flexible indexing branch to get rid of that - which is why I was thinking it might be a good idea to drop the unsupported exception in that method for things like multi reader and just do the work to get the right number (currently there is a comment that the user should do that work if necessary, making the call unreliable for this). Fuad Efendi wrote: > Thank you very much Mike, > > I found it: > org.apache.solr.request.SimpleFacets > ... > // TODO: future logic could use filters instead of the fieldcache if > // the number of terms in the field is small enough. > counts = getFieldCacheCounts(searcher, base, field, offset,limit, > mincount, missing, sort, prefix); > ... > FieldCache.StringIndex si = > FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); > final String[] terms = si.lookup; > final int[] termNum = si.order; > ... > > > So that 64-bit requires more memory :) > > > Mike, am I right here? > [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] > (64-bit JVM) > 1.2Gb RAM for this... > > Or, may be I am wrong: > >> For Lucene directly, simple strings would consume an pointer (4 or 8 >> bytes depending on whether your JRE is 64bit) per doc, and the string >> index would consume an int (4 bytes) per doc. >> > > [8 bytes (64bit)] x [number of documents (100mlns)]? > 0.8Gb > > Kind of Map between String and DocSet, saving 4 bytes... "Key" is String, > and "Value" is array of 64-bit pointers to Document. Why 64-bit (for 64-bit > JVM)? I always thought it is (int) documentId... > > Am I right? > > > Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! > > >>> Note that for your use case, this is exceptionally wasteful. >>> > This is probably very common case... I think it should be confirmed by > Lucene developers too... FieldCache is warmed anyway, even when we don't use > SOLR... > > > -Fuad > > > > > > > > >> -----Original Message----- >> From: Michael McCandless [mailto:luc...@mikemccandless.com] >> Sent: November-02-09 6:00 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Lucene FieldCache memory requirements >> >> OK I think someone who knows how Solr uses the fieldCache for this >> type of field will have to pipe up. >> >> For Lucene directly, simple strings would consume an pointer (4 or 8 >> bytes depending on whether your JRE is 64bit) per doc, and the string >> index would consume an int (4 bytes) per doc. (Each also consume >> negligible (for your case) memory to hold the actual string values). >> >> Note that for your use case, this is exceptionally wasteful. If >> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) >> then it'd take much fewer bits to reference the values, since you have >> only 10 unique string values. >> >> Mike >> >> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi <f...@efendi.ca> wrote: >> >>> I am not using Lucene API directly; I am using SOLR which uses Lucene >>> FieldCache for faceting on non-tokenized fields... >>> I think this cache will be lazily loaded, until user executes sorted (by >>> this field) SOLR query for all documents *:* - in this case it will be >>> > fully > >>> populated... >>> >>> >>> >>>> Subject: Re: Lucene FieldCache memory requirements >>>> >>>> Which FieldCache API are you using? getStrings? or getStringIndex >>>> (which is used, under the hood, if you sort by this field). >>>> >>>> Mike >>>> >>>> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi <f...@efendi.ca> wrote: >>>> >>>>> Any thoughts regarding the subject? I hope FieldCache doesn't use >>>>> > more > >>> than >>> >>>>> 6 bytes per document-field instance... I am too lazy to research >>>>> > Lucene > >>>>> source code, I hope someone can provide exact answer... Thanks >>>>> >>>>> >>>>> >>>>>> Subject: Lucene FieldCache memory requirements >>>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> Can anyone confirm Lucene FieldCache memory requirements? I have 100 >>>>>> millions docs with non-tokenized field "country" (10 different >>>>>> >>> countries); >>> >>>>> I >>>>> >>>>>> expect it requires array of ("int", "long"), size of array >>>>>> > 100,000,000, > >>>>>> without any impact of "country" field length; >>>>>> >>>>>> it requires 600,000,000 bytes: "int" is pointer to document (Lucene >>>>>> >>>>> document >>>>> >>>>>> ID), and "long" is pointer to String value... >>>>>> >>>>>> Am I right, is it 600Mb just for this "country" (indexed, >>>>>> >>> non-tokenized, >>> >>>>>> non-boolean) field and 100 millions docs? I need to calculate exact >>>>>> >>>>> minimum RAM >>>>> >>>>>> requirements... >>>>>> >>>>>> I believe it shouldn't depend on cardinality (distribution) of >>>>>> > field... > >>>>>> Thanks, >>>>>> Fuad >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>> >>> > > > -- - Mark http://www.lucidimagination.com