Re: Date faceting - howto improve performance

Marcus Herou Mon, 27 Apr 2009 15:09:58 -0700

Yes that's exactly what I meant.

Think adding "new" fields to a separate index and use ParallelReader at
query time would be something to investigate at SOLR level.
I think I can spend some time creating a patch for this if you think it is a
good idea and if you think it would be merged into the repo haha.
It is not very main stream but I think everyone with more than a million
docs curses alot over that they need to stop the entire service for a couple
of days just to add a field :)


We have 60M rows now and 50 000M index size (shit 800k per doc, man that is
too much) so we are getting into a state where reindexing is starting to
become impossible...

Keep up the fantastic work

//Marcus



On Mon, Apr 27, 2009 at 5:09 PM, Ning Li <ning.li...@gmail.com> wrote:

> You mean doc A and doc B will become one doc after adding index 2 to
> index 1? I don't think this is currently supported either at Lucene
> level or at Solr level. If index 1 has m docs and index 2 has n docs,
> index 1 will have m+n docs after adding index 2 to index 1. Documents
> themselves are not modified by index merge.
>
> Cheers,
> Ning
>
>
> On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou
> <marcus.he...@tailsweep.com> wrote:
> > Hmm looking in the code for the IndexMerger in Solr
> > (org.apache.solr.update.DirectUpdateHandler(2)
> >
> > See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
> > indexes) ?
> >
> > And the test class
> org.apache.solr.client.solrj.MergeIndexesExampleTestBase
> > suggests:
> > add doc A to index1 with id=AAA,name=core1
> > add doc B to index2 with id=BBB,name=core2
> > merge the two indexes into one index which then contains both docs.
> > The resulting index will have 2 docs.
> >
> > Great but in my case I think it should work more like this.
> >
> > add doc A to index1 with id=X,title=blog entry title,description=blog
> entry
> > description
> > add doc B to index2 with id=X,score=1.2
> > somehow add index2 to index1 so id=XX has score=1.2 when searching in
> index1
> > The resulting index should have 1 doc.
> >
> > So this is not really what I want right ?
> >
> > Sorry for being a smart-ass...
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> > On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <
> marcus.he...@tailsweep.com>wrote:
> >
> >> Guys!
> >>
> >> Thanks for these insights, I think we will head for Lucene level merging
> >> strategy (two or more indexes).
> >> When merging I guess the second index need to have the same doc ids
> >> somehow. This is an internal id in Lucene, not that easy to get hold of
> >> right ?
> >>
> >> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> >> would not work very well performance wise or what do you mean ?
> >>
> >> I sure like bleeding edge :)
> >>
> >> Cheers dudes
> >>
> >> //Marcus
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> >> otis_gospodne...@yahoo.com> wrote:
> >>
> >>>
> >>> I should emphasize that the PR trick I mentioned is something you'd do
> at
> >>> the Lucene level, outside Solr, and then you'd just slip the modified
> index
> >>> back into Solr.
> >>> Of, if you like the bleeding edge, perhaps you can make use of Ning
> Li's
> >>> Solr index merging functionality (patch in JIRA).
> >>>
> >>>
> >>> Otis --
> >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>>
> >>>
> >>>
> >>> ----- Original Message ----
> >>> > From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
> >>> > To: solr-user@lucene.apache.org
> >>> > Sent: Saturday, April 25, 2009 9:41:45 AM
> >>> > Subject: Re: Date faceting - howto improve performance
> >>> >
> >>> >
> >>> > Yes, you could simply round the date, no need for a non-date type
> field.
> >>> > Yes, you can add a field after the fact by making use of
> ParallelReader
> >>> and
> >>> > merging (I don't recall the details, search the ML for ParallelReader
> >>> and
> >>> > Andrzej), I remember he once provided the working recipe.
> >>> >
> >>> >
> >>> > Otis --
> >>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>> >
> >>> >
> >>> >
> >>> > ----- Original Message ----
> >>> > > From: Marcus Herou
> >>> > > To: solr-user@lucene.apache.org
> >>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
> >>> > > Subject: Date faceting - howto improve performance
> >>> > >
> >>> > > Hi.
> >>> > >
> >>> > > One of our faceting use-cases:
> >>> > > We are creating trend graphs of how many blog posts that contains a
> >>> certain
> >>> > > term and groups it by day/week/year etc. with the nice
> DateMathParser
> >>> > > functions.
> >>> > >
> >>> > > The performance degrades really fast and consumes a lot of memory
> >>> which
> >>> > > forces OOM from time to time
> >>> > > We think it is due the fact that the cardinality of the field
> >>> publishedDate
> >>> > > in our index is huge, almost equal to the nr of documents in the
> >>> index.
> >>> > >
> >>> > > We need to address that...
> >>> > >
> >>> > > Some questions:
> >>> > >
> >>> > > 1. Can a datefield have other date-formats than the default of
> >>> yyyy-MM-dd
> >>> > > HH:mm:ssZ ?
> >>> > >
> >>> > > 2. We are thinking of adding a field to the index which have the
> >>> format
> >>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a
> date,
> >>> it
> >>> > > could perhaps be a string, but the question then is if faceting can
> be
> >>> used
> >>> > > ?
> >>> > >
> >>> > > 3. Since we now already have such a huge index, is there a way to
> add
> >>> a
> >>> > > field afterwards and apply it to all documents without actually
> >>> reindexing
> >>> > > the whole shebang ?
> >>> > >
> >>> > > 4. If the field cannot be a string can we just leave out the
> >>> > > hour/minute/second information and to reduce the cardinality and
> >>> improve
> >>> > > performance ? Example: 2009-01-01 00:00:00Z
> >>> > >
> >>> > > 5. I am afraid that we need to reindex everything to get this to
> work
> >>> > > (negates Q3). We have 8 shards as of current, what would the most
> >>> efficient
> >>> > > way be to reindexing the whole shebang ? Dump the entire database
> to
> >>> disk
> >>> > > (sigh), create many xml file splits and use curl in a
> >>> > > random/hash(numServers) manner on them ?
> >>> > >
> >>> > >
> >>> > > Kindly
> >>> > >
> >>> > > //Marcus
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Marcus Herou CTO and co-founder Tailsweep AB
> >>> > > +46702561312
> >>> > > marcus.he...@tailsweep.com
> >>> > > http://www.tailsweep.com/
> >>> > > http://blogg.tailsweep.com/
> >>>
> >>>
> >>
> >>
> >> --
> >> Marcus Herou CTO and co-founder Tailsweep AB
> >> +46702561312
> >> marcus.he...@tailsweep.com
> >> http://www.tailsweep.com/
> >> http://blogg.tailsweep.com/
> >>
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.he...@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Reply via email to