On 10-Oct-07, at 2:40 PM, David Whalen wrote:
Accoriding to Yonik I can't use minDf because I'm faceting
on a string field. I'm thinking of changing it to a tokenized
type so that I can utilize this setting, but then I'll have to
rebuild my entire index.
Unless there's some way around that?
For the fields that matter (many unique values), this is likely
result in a performance regression.
It might be better to try storing less unique data. For instance,
faceting on the blog_url field, or create_date in your schema would
case problems (they probably have millions of unique values).
It would be helpful to know which field is causing the problem. One
way would be to do a sorted query on a quiescent index for each
field, and see if there are any suspiciously large jumps in memory
usage.
-Mike
-----Original Message-----
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 10, 2007 4:56 PM
To: solr-user@lucene.apache.org
Cc: stuhood
Subject: Re: Facets and running out of Heap Space
On 10-Oct-07, at 12:19 PM, David Whalen wrote:
It looks now like I can't use facets the way I was hoping
to because
the memory requirements are impractical.
I can't remember if this has been mentioned, but upping the
HashDocSet size is one way to reduce memory consumption.
Whether this will work well depends greatly on the
cardinality of your facet sets. facet.enum.cache.minDf set
high is another option (will not generate a bitset for any
value whose facet set is less that this value).
Both options have performance implications.
So, as an alternative I was thinking I could get counts by doing
rows=0 and using filter queries.
Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the process?
Running one query per unique facet value seems impractical,
if that is what you are suggesting. Setting minDf to a very
high value should always outperform such an approach.
-Mike
DW
-----Original Message-----
From: Stu Hood [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2007 10:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space
Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_
Mike, how did you calculate that value? I'm trying to tune
my caches,
and any equations that could be used to determine some balanced
settings would be extremely helpful. I'm in a memory limited
environment, so I can't afford to throw a ton of cache at the
problem.
(I don't want to thread-jack, but I'm also wondering
whether anyone
has any notes on how to tune cache sizes for the filterCache,
queryResultCache and documentCache).
Thanks,
Stu
-----Original Message-----
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space
On 9-Oct-07, at 12:36 PM, David Whalen wrote:
(snip)
I'm sure we could stop storing many of these columns,
especially if
someone told me that would make a big difference.
I don't think that it would make a difference in memory
consumption,
but storage is certainly not necessary for faceting. Extra stored
fields can slow down search if they are large (in terms of bytes),
but don't really occupy extra memory, unless they are
polluting the
doc cache. Does 'text'
need to be stored?
what does the LukeReqeust Handler tell you about the #
of distinct
terms in each field that you facet on?
Where would I find that? I could probably estimate that
myself on a
per-column basis. it ranges from 4 distinct values for
media_type to
30-ish for location to 200-ish for country_code to almost
10,000 for
site_id to almost 100,000 for journalist_id.
Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique
value_, so it
should be a net win for those (although quite close in space
requirements for a 30-ary field on your index size).
-Mike