Re: Lucene FieldCache memory requirements

Mark Miller Mon, 02 Nov 2009 15:52:37 -0800

It also briefly requires more memory than just that - it allocates an
array the size of maxdoc+1 to hold the unique terms - and then sizes down.


Possibly we can use the getUnuiqeTermCount method in the flexible
indexing branch to get rid of that - which is why I was thinking it
might be a good idea to drop the unsupported exception in that method
for things like multi reader and just do the work to get the right
number (currently there is a comment that the user should do that work
if necessary, making the call unreliable for this).

Fuad Efendi wrote:
> Thank you very much Mike,
>
> I found it:
> org.apache.solr.request.SimpleFacets
> ...
>         // TODO: future logic could use filters instead of the fieldcache if
>         // the number of terms in the field is small enough.
>         counts = getFieldCacheCounts(searcher, base, field, offset,limit,
> mincount, missing, sort, prefix);
> ...
>     FieldCache.StringIndex si =
> FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
>     final String[] terms = si.lookup;
>     final int[] termNum = si.order;
> ...
>
>
> So that 64-bit requires more memory :)
>
>
> Mike, am I right here?
> [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
> (64-bit JVM)
> 1.2Gb RAM for this...
>
> Or, may be I am wrong:
>   
>> For Lucene directly, simple strings would consume an pointer (4 or 8
>> bytes depending on whether your JRE is 64bit) per doc, and the string
>> index would consume an int (4 bytes) per doc.
>>     
>
> [8 bytes (64bit)] x [number of documents (100mlns)]? 
> 0.8Gb
>
> Kind of Map between String and DocSet, saving 4 bytes... "Key" is String,
> and "Value" is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
> JVM)? I always thought it is (int) documentId...
>
> Am I right?
>
>
> Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
>
>   
>>> Note that for your use case, this is exceptionally wasteful.  
>>>       
> This is probably very common case... I think it should be confirmed by
> Lucene developers too... FieldCache is warmed anyway, even when we don't use
> SOLR...
>
>  
> -Fuad
>
>
>
>
>
>
>
>   
>> -----Original Message-----
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: November-02-09 6:00 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Lucene FieldCache memory requirements
>>
>> OK I think someone who knows how Solr uses the fieldCache for this
>> type of field will have to pipe up.
>>
>> For Lucene directly, simple strings would consume an pointer (4 or 8
>> bytes depending on whether your JRE is 64bit) per doc, and the string
>> index would consume an int (4 bytes) per doc.  (Each also consume
>> negligible (for your case) memory to hold the actual string values).
>>
>> Note that for your use case, this is exceptionally wasteful.  If
>> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
>> then it'd take much fewer bits to reference the values, since you have
>> only 10 unique string values.
>>
>> Mike
>>
>> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi <f...@efendi.ca> wrote:
>>     
>>> I am not using Lucene API directly; I am using SOLR which uses Lucene
>>> FieldCache for faceting on non-tokenized fields...
>>> I think this cache will be lazily loaded, until user executes sorted (by
>>> this field) SOLR query for all documents *:* - in this case it will be
>>>       
> fully
>   
>>> populated...
>>>
>>>
>>>       
>>>> Subject: Re: Lucene FieldCache memory requirements
>>>>
>>>> Which FieldCache API are you using?  getStrings?  or getStringIndex
>>>> (which is used, under the hood, if you sort by this field).
>>>>
>>>> Mike
>>>>
>>>> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi <f...@efendi.ca> wrote:
>>>>         
>>>>> Any thoughts regarding the subject? I hope FieldCache doesn't use
>>>>>           
> more
>   
>>> than
>>>       
>>>>> 6 bytes per document-field instance... I am too lazy to research
>>>>>           
> Lucene
>   
>>>>> source code, I hope someone can provide exact answer... Thanks
>>>>>
>>>>>
>>>>>           
>>>>>> Subject: Lucene FieldCache memory requirements
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Can anyone confirm Lucene FieldCache memory requirements? I have 100
>>>>>> millions docs with non-tokenized field "country" (10 different
>>>>>>             
>>> countries);
>>>       
>>>>> I
>>>>>           
>>>>>> expect it requires array of ("int", "long"), size of array
>>>>>>             
> 100,000,000,
>   
>>>>>> without any impact of "country" field length;
>>>>>>
>>>>>> it requires 600,000,000 bytes: "int" is pointer to document (Lucene
>>>>>>             
>>>>> document
>>>>>           
>>>>>> ID),  and "long" is pointer to String value...
>>>>>>
>>>>>> Am I right, is it 600Mb just for this "country" (indexed,
>>>>>>             
>>> non-tokenized,
>>>       
>>>>>> non-boolean) field and 100 millions docs? I need to calculate exact
>>>>>>             
>>>>> minimum RAM
>>>>>           
>>>>>> requirements...
>>>>>>
>>>>>> I believe it shouldn't depend on cardinality (distribution) of
>>>>>>             
> field...
>   
>>>>>> Thanks,
>>>>>> Fuad
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>
>>>>>           
>>>
>>>       
>
>
>   


-- 
- Mark

http://www.lucidimagination.com

Re: Lucene FieldCache memory requirements

Reply via email to