I'll see what I can do about that. Truthfully, the most important facet we need is the one on media_type, which has only 4 unique values. The second most important one to us is location, which has about 30 unique values.
So, it would seem like we actually need a counter-intuitive solution. That's why I thought Field Queries might be the solution. Is there some reason to avoid setting multiValued to true here? It sounds like it would be the true cure-all.... Thanks again! dave > -----Original Message----- > From: Mike Klaas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 10, 2007 6:20 PM > To: solr-user@lucene.apache.org > Subject: Re: Facets and running out of Heap Space > > On 10-Oct-07, at 2:40 PM, David Whalen wrote: > > > Accoriding to Yonik I can't use minDf because I'm faceting > on a string > > field. I'm thinking of changing it to a tokenized type so > that I can > > utilize this setting, but then I'll have to rebuild my entire index. > > > > Unless there's some way around that? > > For the fields that matter (many unique values), this is > likely result in a performance regression. > > It might be better to try storing less unique data. For > instance, faceting on the blog_url field, or create_date in > your schema would case problems (they probably have millions > of unique values). > > It would be helpful to know which field is causing the > problem. One way would be to do a sorted query on a > quiescent index for each field, and see if there are any > suspiciously large jumps in memory usage. > > -Mike > > > > > > > > >> -----Original Message----- > >> From: Mike Klaas [mailto:[EMAIL PROTECTED] > >> Sent: Wednesday, October 10, 2007 4:56 PM > >> To: solr-user@lucene.apache.org > >> Cc: stuhood > >> Subject: Re: Facets and running out of Heap Space > >> > >> On 10-Oct-07, at 12:19 PM, David Whalen wrote: > >> > >>> It looks now like I can't use facets the way I was hoping > >> to because > >>> the memory requirements are impractical. > >> > >> I can't remember if this has been mentioned, but upping the > >> HashDocSet size is one way to reduce memory consumption. > >> Whether this will work well depends greatly on the > >> cardinality of your facet sets. facet.enum.cache.minDf set > >> high is another option (will not generate a bitset for any > >> value whose facet set is less that this value). > >> > >> Both options have performance implications. > >> > >>> So, as an alternative I was thinking I could get counts by doing > >>> rows=0 and using filter queries. > >>> > >>> Is there a reason to think that this might perform better? > >>> Or, am I simply moving the problem to another step in the process? > >> > >> Running one query per unique facet value seems impractical, > >> if that is what you are suggesting. Setting minDf to a very > >> high value should always outperform such an approach. > >> > >> -Mike > >> > >>> DW > >>> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Stu Hood [mailto:[EMAIL PROTECTED] > >>>> Sent: Tuesday, October 09, 2007 10:53 PM > >>>> To: solr-user@lucene.apache.org > >>>> Subject: Re: Facets and running out of Heap Space > >>>> > >>>>> Using the filter cache method on the things like media type and > >>>>> location; this will occupy ~2.3MB of memory _per unique value_ > >>>> > >>>> Mike, how did you calculate that value? I'm trying to tune > >> my caches, > >>>> and any equations that could be used to determine some balanced > >>>> settings would be extremely helpful. I'm in a memory limited > >>>> environment, so I can't afford to throw a ton of cache at the > >>>> problem. > >>>> > >>>> (I don't want to thread-jack, but I'm also wondering > >> whether anyone > >>>> has any notes on how to tune cache sizes for the filterCache, > >>>> queryResultCache and documentCache). > >>>> > >>>> Thanks, > >>>> Stu > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: Mike Klaas <[EMAIL PROTECTED]> > >>>> Sent: Tuesday, October 9, 2007 9:30pm > >>>> To: solr-user@lucene.apache.org > >>>> Subject: Re: Facets and running out of Heap Space > >>>> > >>>> On 9-Oct-07, at 12:36 PM, David Whalen wrote: > >>>> > >>>>> (snip) > >>>>> I'm sure we could stop storing many of these columns, > >>>> especially if > >>>>> someone told me that would make a big difference. > >>>> > >>>> I don't think that it would make a difference in memory > >> consumption, > >>>> but storage is certainly not necessary for faceting. > Extra stored > >>>> fields can slow down search if they are large (in terms > of bytes), > >>>> but don't really occupy extra memory, unless they are > >> polluting the > >>>> doc cache. Does 'text' > >>>> need to be stored? > >>>>> > >>>>>> what does the LukeReqeust Handler tell you about the # > >> of distinct > >>>>>> terms in each field that you facet on? > >>>>> > >>>>> Where would I find that? I could probably estimate that > >>>> myself on a > >>>>> per-column basis. it ranges from 4 distinct values for > >>>> media_type to > >>>>> 30-ish for location to 200-ish for country_code to almost > >>>> 10,000 for > >>>>> site_id to almost 100,000 for journalist_id. > >>>> > >>>> Using the filter cache method on the things like media type and > >>>> location; this will occupy ~2.3MB of memory _per unique > >> value_, so it > >>>> should be a net win for those (although quite close in space > >>>> requirements for a 30-ary field on your index size). > >>>> > >>>> -Mike > >>>> > >>>> > >> > >> > >> > > >