On Mon, Aug 3, 2009 at 4:56 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Mon, Aug 3, 2009 at 4:18 PM, Stephen Duncan
> Jr<stephen.dun...@gmail.com> wrote:
> > On Mon, Aug 3, 2009 at 2:43 PM, Yonik Seeley <yo...@lucidimagination.com
> >wrote:
> > Hmm, that's a hard thing to sell to the user and my boss, as it makes the
> > query time go from nearly always being sub-second (frequently less than
> 60
> > ms), to ranging up to nearly 4 seconds for a new query not already in the
> > cache.  (My test was with 100 facets being requested, which may be
> > reasonable, as one reason to facet on a full-text field to provide a
> dynamic
> > world-cloud).
>
> Could you possibly profile it to find out what the hotspot is?
> We don't really have a good algorithm for faceting text fields, but it
> would be nice to see what the current bottleneck is.


I'll put in my TODO list to try that out soon.  I'll let you know the
results if I manage.


>
> > I guess I should update my code to use the enum method on all the fields
> > that are likely to risk crossing this line.  Should I be looking at the
> > termInstances property on the fields that are displayed in the
> > FieldValueCache on the stats page, and figuring those on the order of 10
> > million are likely to grow past the limit?
>
> For an index over 16M docs, it's perhaps closer to
> 16M/avg_bytes_per_term*256.
>
> The storage space for terms that aren't "big terms" (which come from
> the fieldCache) is 256 byte arrays, each which can be up to 16MB in
> size.  Every 65536 block of documents shares one of those byte arrays
> (or more if you have more than 16M documents).  So the average
> document can't take up more than 256 bytes in the array.  That doesn't
> mean 256 term instances though... that's the max.  The list is delta
> encoded vints, so if there are many terms, each vint could be bigger.
>
> More details in UnInvertedField after the comment:
>      //
>      // transform intermediate form into the final form, building a
> single byte[]
>      // at a time, and releasing the intermediate byte[]s as we go to avoid
>      // increasing the memory footprint.
>      //
>
> -Yonik
> http://www.lucidimagination.com
>

Ok, a lot of that is going over-my-head for the moment.  I'll try to digest
this info a little further, but for now let's see if my minimal
understanding is correct:

What will cause me to exceed the limit and fail during faceting using the fc
method is if the documents within a block of 65536 combine to take up too
much space.  And this (generally speaking) going to be a function on the
average number of unique terms in the documents?

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Reply via email to