Hi Marcus.

You must supply dates in the format that you are doing now -- ISO-8601 with the 
Z to indicate there is no time-zone offset occurring.  To reduce cardinality to 
the day level instead of to the second that you are currently performing, the 
date you supply can include DateMathParser operations.  So if you supply:  
2009-04-01 20:15:01Z/DAY then this will do what you think it does.  Of course 
you then loose the ability to search based on a granularity finer than a day.  
And the date you get back (i.e. the stored value) is the rounded date; not the 
date prior to rounding.

Yes you will certainly need to re-index.  Since you have architected your 
indexing strategy, only you know how to go about doing that.  By now I'm sure 
you are aware that you cannot update individual fields.  By the way, if your 
current strategy involves periodic updates then you could take the strategy of 
simply waiting until all your data eventually gets re-indexed.  There's no harm 
in some of the dates being rounded and some not -- it's just that until most of 
them are rounded, you have your current problem of sporadic OOM.

~ David
________________________________________
From: Marcus Herou [marcus.he...@tailsweep.com]
Sent: Saturday, April 25, 2009 6:54 AM
To: solr-user@lucene.apache.org
Subject: Date faceting - howto improve performance

Hi.

One of our faceting use-cases:
We are creating trend graphs of how many blog posts that contains a certain
term and groups it by day/week/year etc. with the nice DateMathParser
functions.

The performance degrades really fast and consumes a lot of memory which
forces OOM from time to time
We think it is due the fact that the cardinality of the field publishedDate
in our index is huge, almost equal to the nr of documents in the index.

We need to address that...

Some questions:

1. Can a datefield have other date-formats than the default of yyyy-MM-dd
HH:mm:ssZ ?

2. We are thinking of adding a field to the index which have the format
yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
could perhaps be a string, but the question then is if faceting can be used
?

3. Since we now already have such a huge index, is there a way to add a
field afterwards and apply it to all documents without actually reindexing
the whole shebang ?

4. If the field cannot be a string can we just leave out the
hour/minute/second information and to reduce the cardinality and improve
performance ? Example: 2009-01-01 00:00:00Z

5. I am afraid that we need to reindex everything to get this to work
(negates Q3). We have 8 shards as of current, what would the most efficient
way be to reindexing the whole shebang ? Dump the entire database to disk
(sigh), create many xml file splits and use curl in a
random/hash(numServers) manner on them ?


Kindly

//Marcus







--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Reply via email to