Guys! Thanks for these insights, I think we will head for Lucene level merging strategy (two or more indexes). When merging I guess the second index need to have the same doc ids somehow. This is an internal id in Lucene, not that easy to get hold of right ?
So you are saying the the solr: ExternalFileField + FunctionQuery stuff would not work very well performance wise or what do you mean ? I sure like bleeding edge :) Cheers dudes //Marcus On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > I should emphasize that the PR trick I mentioned is something you'd do at > the Lucene level, outside Solr, and then you'd just slip the modified index > back into Solr. > Of, if you like the bleeding edge, perhaps you can make use of Ning Li's > Solr index merging functionality (patch in JIRA). > > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Otis Gospodnetic <otis_gospodne...@yahoo.com> > > To: solr-user@lucene.apache.org > > Sent: Saturday, April 25, 2009 9:41:45 AM > > Subject: Re: Date faceting - howto improve performance > > > > > > Yes, you could simply round the date, no need for a non-date type field. > > Yes, you can add a field after the fact by making use of ParallelReader > and > > merging (I don't recall the details, search the ML for ParallelReader and > > Andrzej), I remember he once provided the working recipe. > > > > > > Otis -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > > > From: Marcus Herou > > > To: solr-user@lucene.apache.org > > > Sent: Saturday, April 25, 2009 6:54:02 AM > > > Subject: Date faceting - howto improve performance > > > > > > Hi. > > > > > > One of our faceting use-cases: > > > We are creating trend graphs of how many blog posts that contains a > certain > > > term and groups it by day/week/year etc. with the nice DateMathParser > > > functions. > > > > > > The performance degrades really fast and consumes a lot of memory which > > > forces OOM from time to time > > > We think it is due the fact that the cardinality of the field > publishedDate > > > in our index is huge, almost equal to the nr of documents in the index. > > > > > > We need to address that... > > > > > > Some questions: > > > > > > 1. Can a datefield have other date-formats than the default of > yyyy-MM-dd > > > HH:mm:ssZ ? > > > > > > 2. We are thinking of adding a field to the index which have the format > > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it > > > could perhaps be a string, but the question then is if faceting can be > used > > > ? > > > > > > 3. Since we now already have such a huge index, is there a way to add a > > > field afterwards and apply it to all documents without actually > reindexing > > > the whole shebang ? > > > > > > 4. If the field cannot be a string can we just leave out the > > > hour/minute/second information and to reduce the cardinality and > improve > > > performance ? Example: 2009-01-01 00:00:00Z > > > > > > 5. I am afraid that we need to reindex everything to get this to work > > > (negates Q3). We have 8 shards as of current, what would the most > efficient > > > way be to reindexing the whole shebang ? Dump the entire database to > disk > > > (sigh), create many xml file splits and use curl in a > > > random/hash(numServers) manner on them ? > > > > > > > > > Kindly > > > > > > //Marcus > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Marcus Herou CTO and co-founder Tailsweep AB > > > +46702561312 > > > marcus.he...@tailsweep.com > > > http://www.tailsweep.com/ > > > http://blogg.tailsweep.com/ > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/