On Mon, Aug 3, 2009 at 4:56 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:
> On Mon, Aug 3, 2009 at 4:18 PM, Stephen Duncan > Jr<stephen.dun...@gmail.com> wrote: > > On Mon, Aug 3, 2009 at 2:43 PM, Yonik Seeley <yo...@lucidimagination.com > >wrote: > > Hmm, that's a hard thing to sell to the user and my boss, as it makes the > > query time go from nearly always being sub-second (frequently less than > 60 > > ms), to ranging up to nearly 4 seconds for a new query not already in the > > cache. (My test was with 100 facets being requested, which may be > > reasonable, as one reason to facet on a full-text field to provide a > dynamic > > world-cloud). > > Could you possibly profile it to find out what the hotspot is? > We don't really have a good algorithm for faceting text fields, but it > would be nice to see what the current bottleneck is. I'll put in my TODO list to try that out soon. I'll let you know the results if I manage. > > > I guess I should update my code to use the enum method on all the fields > > that are likely to risk crossing this line. Should I be looking at the > > termInstances property on the fields that are displayed in the > > FieldValueCache on the stats page, and figuring those on the order of 10 > > million are likely to grow past the limit? > > For an index over 16M docs, it's perhaps closer to > 16M/avg_bytes_per_term*256. > > The storage space for terms that aren't "big terms" (which come from > the fieldCache) is 256 byte arrays, each which can be up to 16MB in > size. Every 65536 block of documents shares one of those byte arrays > (or more if you have more than 16M documents). So the average > document can't take up more than 256 bytes in the array. That doesn't > mean 256 term instances though... that's the max. The list is delta > encoded vints, so if there are many terms, each vint could be bigger. > > More details in UnInvertedField after the comment: > // > // transform intermediate form into the final form, building a > single byte[] > // at a time, and releasing the intermediate byte[]s as we go to avoid > // increasing the memory footprint. > // > > -Yonik > http://www.lucidimagination.com > Ok, a lot of that is going over-my-head for the moment. I'll try to digest this info a little further, but for now let's see if my minimal understanding is correct: What will cause me to exceed the limit and fail during faceting using the fc method is if the documents within a block of 65536 combine to take up too much space. And this (generally speaking) going to be a function on the average number of unique terms in the documents? -- Stephen Duncan Jr www.stephenduncanjr.com