RE: Facets and running out of Heap Space

David Whalen Wed, 10 Oct 2007 15:49:00 -0700

I'll see what I can do about that.

Truthfully, the most important facet we need is the one on
media_type, which has only 4 unique values.  The second
most important one to us is location, which has about 30
unique values.


So, it would seem like we actually need a counter-intuitive
solution.  That's why I thought Field Queries might be the
solution.

Is there some reason to avoid setting multiValued to true
here?  It sounds like it would be the true cure-all....

Thanks again!

dave


  

> -----Original Message-----
> From: Mike Klaas [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 10, 2007 6:20 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Facets and running out of Heap Space
> 
> On 10-Oct-07, at 2:40 PM, David Whalen wrote:
> 
> > Accoriding to Yonik I can't use minDf because I'm faceting 
> on a string 
> > field.  I'm thinking of changing it to a tokenized type so 
> that I can 
> > utilize this setting, but then I'll have to rebuild my entire index.
> >
> > Unless there's some way around that?
> 
> For the fields that matter (many unique values), this is 
> likely result in a performance regression.
> 
> It might be better to try storing less unique data.  For 
> instance, faceting on the blog_url field, or create_date in 
> your schema would case problems (they probably have millions 
> of unique values).
> 
> It would be helpful to know which field is causing the 
> problem.  One way would be to do a sorted query on a 
> quiescent index for each field, and see if there are any 
> suspiciously large jumps in memory usage.
> 
> -Mike
> 
> >
> >
> >
> >> -----Original Message-----
> >> From: Mike Klaas [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, October 10, 2007 4:56 PM
> >> To: solr-user@lucene.apache.org
> >> Cc: stuhood
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >> On 10-Oct-07, at 12:19 PM, David Whalen wrote:
> >>
> >>> It looks now like I can't use facets the way I was hoping
> >> to because
> >>> the memory requirements are impractical.
> >>
> >> I can't remember if this has been mentioned, but upping the
> >> HashDocSet size is one way to reduce memory consumption.
> >> Whether this will work well depends greatly on the
> >> cardinality of your facet sets.  facet.enum.cache.minDf set
> >> high is another option (will not generate a bitset for any
> >> value whose facet set is less that this value).
> >>
> >> Both options have performance implications.
> >>
> >>> So, as an alternative I was thinking I could get counts by doing
> >>> rows=0 and using filter queries.
> >>>
> >>> Is there a reason to think that this might perform better?
> >>> Or, am I simply moving the problem to another step in the process?
> >>
> >> Running one query per unique facet value seems impractical,
> >> if that is what you are suggesting.  Setting minDf to a very
> >> high value should always outperform such an approach.
> >>
> >> -Mike
> >>
> >>> DW
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Stu Hood [mailto:[EMAIL PROTECTED]
> >>>> Sent: Tuesday, October 09, 2007 10:53 PM
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Re: Facets and running out of Heap Space
> >>>>
> >>>>> Using the filter cache method on the things like media type and
> >>>>> location; this will occupy ~2.3MB of memory _per unique value_
> >>>>
> >>>> Mike, how did you calculate that value? I'm trying to tune
> >> my caches,
> >>>> and any equations that could be used to determine some balanced
> >>>> settings would be extremely helpful. I'm in a memory limited
> >>>> environment, so I can't afford to throw a ton of cache at the
> >>>> problem.
> >>>>
> >>>> (I don't want to thread-jack, but I'm also wondering
> >> whether anyone
> >>>> has any notes on how to tune cache sizes for the filterCache,
> >>>> queryResultCache and documentCache).
> >>>>
> >>>> Thanks,
> >>>> Stu
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Mike Klaas <[EMAIL PROTECTED]>
> >>>> Sent: Tuesday, October 9, 2007 9:30pm
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Re: Facets and running out of Heap Space
> >>>>
> >>>> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
> >>>>
> >>>>> (snip)
> >>>>> I'm sure we could stop storing many of these columns,
> >>>> especially  if
> >>>>> someone told me that would make a big difference.
> >>>>
> >>>> I don't think that it would make a difference in memory
> >> consumption,
> >>>> but storage is certainly not necessary for faceting.  
> Extra stored
> >>>> fields can slow down search if they are large (in terms 
> of bytes),
> >>>> but don't really occupy extra memory, unless they are
> >> polluting the
> >>>> doc cache.  Does 'text'
> >>>> need to be stored?
> >>>>>
> >>>>>> what does the LukeReqeust Handler tell you about the #
> >> of distinct
> >>>>>> terms in each field that you facet on?
> >>>>>
> >>>>> Where would I find that?  I could probably estimate that
> >>>> myself on a
> >>>>> per-column basis.  it ranges from 4 distinct values for
> >>>> media_type to
> >>>>> 30-ish for location to 200-ish for country_code to almost
> >>>> 10,000 for
> >>>>> site_id to almost 100,000 for journalist_id.
> >>>>
> >>>> Using the filter cache method on the things like media type and
> >>>> location; this will occupy ~2.3MB of memory _per unique
> >> value_, so it
> >>>> should be a net win for those (although quite close in space
> >>>> requirements for a 30-ary field on your index size).
> >>>>
> >>>> -Mike
> >>>>
> >>>>
> >>
> >>
> >>
> 
> 
>

RE: Facets and running out of Heap Space

Reply via email to