I believe this is correct estimate:

> C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
>
>   same as 
> [String1_Document_Count + ... + String10_Document_Count + ...] 
> x [4 bytes per DocumentID]


So, for 100 millions docs we need 400Mb for each(!) non-tokenized field.
Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't
rely on "sizing down" with SOLR faceting features....


I think I finally found the answer...

  /** Expert: Stores term text values and document ordering data. */
  public static class StringIndex {
    ...   
    /** All the term values, in natural order. */
    public final String[] lookup;

    /** For each document, an index into the lookup array. */
    public final int[] order;
    ...
  }



Another API:
  /** Checks the internal cache for an appropriate entry, and if none
   * is found, reads the term values in <code>field</code> and returns an
array
   * of size <code>reader.maxDoc()</code> containing the value each document
   * has in the given field.
   * @param reader  Used to get field values.
   * @param field   Which field contains the strings.
   * @return The values in the given field for each document.
   * @throws IOException  If any error occurs.
   */
  public String[] getStrings (IndexReader reader, String field)
  throws IOException;


Looks similar; cache size is [maxdoc]; however values stored are 8-byte
pointers for 64-bit JVM.


  private Map<Class<?>,Cache> caches;
  private synchronized void init() {
    caches = new HashMap<Class<?>,Cache>(7);
    ...
    caches.put(String.class, new StringCache(this));
    caches.put(StringIndex.class, new StringIndexCache(this));
    ...
  }


StringCache and StringIndexCache use WeakHashMap internally... but objects
won't be ever garbage collected in a "faceted" production system...

SOLR SimpleFacets don't use "getStrings" API, so the hope is memory
requirements are minimized.


However, Lucene may use it internally for some queries (or, for instance, to
get access to a nontokenized cached field without reading index)... to be
safe, use this in your basic memory estimates:


[512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes]


-Fuad



> -----Original Message-----
> From: Fuad Efendi [mailto:f...@efendi.ca]
> Sent: November-02-09 7:37 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Lucene FieldCache memory requirements
> 
> 
> Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
> difference between maxdoc and maxdoc + 1 for such estimate... difference
is
> between 0.4Gb and 1.2Gb...
> 
> 
> So, let's vote ;)
> 
> A. [maxdoc] x [8 bytes ~ pointer to String object]
> 
> B. [maxdoc] x [8 bytes ~ pointer to Document object]
> 
> C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
> - same as [String1_Document_Count + ... + String10_Document_Count] x [4
> bytes ~ DocumentID]
> 
> D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]
> 
> 
> Please confirm that it is Pointer to Object and not Lucene Document ID...
I
> hope it is (int) Document ID...
> 
> 
> 
> 
> 
> > -----Original Message-----
> > From: Mark Miller [mailto:markrmil...@gmail.com]
> > Sent: November-02-09 6:52 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Lucene FieldCache memory requirements
> >
> > It also briefly requires more memory than just that - it allocates an
> > array the size of maxdoc+1 to hold the unique terms - and then sizes
down.
> >
> > Possibly we can use the getUnuiqeTermCount method in the flexible
> > indexing branch to get rid of that - which is why I was thinking it
> > might be a good idea to drop the unsupported exception in that method
> > for things like multi reader and just do the work to get the right
> > number (currently there is a comment that the user should do that work
> > if necessary, making the call unreliable for this).
> >
> > Fuad Efendi wrote:
> > > Thank you very much Mike,
> > >
> > > I found it:
> > > org.apache.solr.request.SimpleFacets
> > > ...
> > >         // TODO: future logic could use filters instead of the
> fieldcache if
> > >         // the number of terms in the field is small enough.
> > >         counts = getFieldCacheCounts(searcher, base, field,
> offset,limit,
> > > mincount, missing, sort, prefix);
> > > ...
> > >     FieldCache.StringIndex si =
> > > FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
> > >     final String[] terms = si.lookup;
> > >     final int[] termNum = si.order;
> > > ...
> > >
> > >
> > > So that 64-bit requires more memory :)
> > >
> > >
> > > Mike, am I right here?
> > > [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents
(100mlns)]
> > > (64-bit JVM)
> > > 1.2Gb RAM for this...
> > >
> > > Or, may be I am wrong:
> > >
> > >> For Lucene directly, simple strings would consume an pointer (4 or 8
> > >> bytes depending on whether your JRE is 64bit) per doc, and the string
> > >> index would consume an int (4 bytes) per doc.
> > >>
> > >
> > > [8 bytes (64bit)] x [number of documents (100mlns)]?
> > > 0.8Gb
> > >
> > > Kind of Map between String and DocSet, saving 4 bytes... "Key" is
> String,
> > > and "Value" is array of 64-bit pointers to Document. Why 64-bit (for
> 64-bit
> > > JVM)? I always thought it is (int) documentId...
> > >
> > > Am I right?
> > >
> > >
> > > Thanks for pointing to
http://issues.apache.org/jira/browse/LUCENE-1990!
> > >
> > >
> > >>> Note that for your use case, this is exceptionally wasteful.
> > >>>
> > > This is probably very common case... I think it should be confirmed by
> > > Lucene developers too... FieldCache is warmed anyway, even when we
don't
> use
> > > SOLR...
> > >
> > >
> > > -Fuad
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > >> Sent: November-02-09 6:00 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: Lucene FieldCache memory requirements
> > >>
> > >> OK I think someone who knows how Solr uses the fieldCache for this
> > >> type of field will have to pipe up.
> > >>
> > >> For Lucene directly, simple strings would consume an pointer (4 or 8
> > >> bytes depending on whether your JRE is 64bit) per doc, and the string
> > >> index would consume an int (4 bytes) per doc.  (Each also consume
> > >> negligible (for your case) memory to hold the actual string values).
> > >>
> > >> Note that for your use case, this is exceptionally wasteful.  If
> > >> Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
> > >> then it'd take much fewer bits to reference the values, since you
have
> > >> only 10 unique string values.
> > >>
> > >> Mike
> > >>
> > >> On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi <f...@efendi.ca> wrote:
> > >>
> > >>> I am not using Lucene API directly; I am using SOLR which uses
Lucene
> > >>> FieldCache for faceting on non-tokenized fields...
> > >>> I think this cache will be lazily loaded, until user executes sorted
> (by
> > >>> this field) SOLR query for all documents *:* - in this case it will
be
> > >>>
> > > fully
> > >
> > >>> populated...
> > >>>
> > >>>
> > >>>
> > >>>> Subject: Re: Lucene FieldCache memory requirements
> > >>>>
> > >>>> Which FieldCache API are you using?  getStrings?  or getStringIndex
> > >>>> (which is used, under the hood, if you sort by this field).
> > >>>>
> > >>>> Mike
> > >>>>
> > >>>> On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi <f...@efendi.ca> wrote:
> > >>>>
> > >>>>> Any thoughts regarding the subject? I hope FieldCache doesn't use
> > >>>>>
> > > more
> > >
> > >>> than
> > >>>
> > >>>>> 6 bytes per document-field instance... I am too lazy to research
> > >>>>>
> > > Lucene
> > >
> > >>>>> source code, I hope someone can provide exact answer... Thanks
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> Subject: Lucene FieldCache memory requirements
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>>
> > >>>>>> Can anyone confirm Lucene FieldCache memory requirements? I have
> 100
> > >>>>>> millions docs with non-tokenized field "country" (10 different
> > >>>>>>
> > >>> countries);
> > >>>
> > >>>>> I
> > >>>>>
> > >>>>>> expect it requires array of ("int", "long"), size of array
> > >>>>>>
> > > 100,000,000,
> > >
> > >>>>>> without any impact of "country" field length;
> > >>>>>>
> > >>>>>> it requires 600,000,000 bytes: "int" is pointer to document
(Lucene
> > >>>>>>
> > >>>>> document
> > >>>>>
> > >>>>>> ID),  and "long" is pointer to String value...
> > >>>>>>
> > >>>>>> Am I right, is it 600Mb just for this "country" (indexed,
> > >>>>>>
> > >>> non-tokenized,
> > >>>
> > >>>>>> non-boolean) field and 100 millions docs? I need to calculate
exact
> > >>>>>>
> > >>>>> minimum RAM
> > >>>>>
> > >>>>>> requirements...
> > >>>>>>
> > >>>>>> I believe it shouldn't depend on cardinality (distribution) of
> > >>>>>>
> > > field...
> > >
> > >>>>>> Thanks,
> > >>>>>> Fuad
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >
> > >
> > >
> >
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> - Fuad
> 
> http://www.linkedin.com/in/liferay
> 



Reply via email to