Re: Date faceting - howto improve performance

Marcus Herou Sat, 25 Apr 2009 08:10:52 -0700

Guys!

Thanks for these insights, I think we will head for Lucene level merging
strategy (two or more indexes).
When merging I guess the second index need to have the same doc ids somehow.
This is an internal id in Lucene, not that easy to get hold of right ?


So you are saying the the solr: ExternalFileField + FunctionQuery stuff
would not work very well performance wise or what do you mean ?

I sure like bleeding edge :)

Cheers dudes

//Marcus




On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

>
> I should emphasize that the PR trick I mentioned is something you'd do at
> the Lucene level, outside Solr, and then you'd just slip the modified index
> back into Solr.
> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
> Solr index merging functionality (patch in JIRA).
>
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, April 25, 2009 9:41:45 AM
> > Subject: Re: Date faceting - howto improve performance
> >
> >
> > Yes, you could simply round the date, no need for a non-date type field.
> > Yes, you can add a field after the fact by making use of ParallelReader
> and
> > merging (I don't recall the details, search the ML for ParallelReader and
> > Andrzej), I remember he once provided the working recipe.
> >
> >
> > Otis --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Marcus Herou
> > > To: solr-user@lucene.apache.org
> > > Sent: Saturday, April 25, 2009 6:54:02 AM
> > > Subject: Date faceting - howto improve performance
> > >
> > > Hi.
> > >
> > > One of our faceting use-cases:
> > > We are creating trend graphs of how many blog posts that contains a
> certain
> > > term and groups it by day/week/year etc. with the nice DateMathParser
> > > functions.
> > >
> > > The performance degrades really fast and consumes a lot of memory which
> > > forces OOM from time to time
> > > We think it is due the fact that the cardinality of the field
> publishedDate
> > > in our index is huge, almost equal to the nr of documents in the index.
> > >
> > > We need to address that...
> > >
> > > Some questions:
> > >
> > > 1. Can a datefield have other date-formats than the default of
> yyyy-MM-dd
> > > HH:mm:ssZ ?
> > >
> > > 2. We are thinking of adding a field to the index which have the format
> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> > > could perhaps be a string, but the question then is if faceting can be
> used
> > > ?
> > >
> > > 3. Since we now already have such a huge index, is there a way to add a
> > > field afterwards and apply it to all documents without actually
> reindexing
> > > the whole shebang ?
> > >
> > > 4. If the field cannot be a string can we just leave out the
> > > hour/minute/second information and to reduce the cardinality and
> improve
> > > performance ? Example: 2009-01-01 00:00:00Z
> > >
> > > 5. I am afraid that we need to reindex everything to get this to work
> > > (negates Q3). We have 8 shards as of current, what would the most
> efficient
> > > way be to reindexing the whole shebang ? Dump the entire database to
> disk
> > > (sigh), create many xml file splits and use curl in a
> > > random/hash(numServers) manner on them ?
> > >
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.he...@tailsweep.com
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Reply via email to