Re: Date faceting - howto improve performance

Marcus Herou Sat, 25 Apr 2009 13:04:05 -0700

Hmm looking in the code for the IndexMerger in Solr
(org.apache.solr.update.DirectUpdateHandler(2)


See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
indexes) ?

And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase
suggests:
add doc A to index1 with id=AAA,name=core1
add doc B to index2 with id=BBB,name=core2
merge the two indexes into one index which then contains both docs.
The resulting index will have 2 docs.

Great but in my case I think it should work more like this.

add doc A to index1 with id=X,title=blog entry title,description=blog entry
description
add doc B to index2 with id=X,score=1.2
somehow add index2 to index1 so id=XX has score=1.2 when searching in index1
The resulting index should have 1 doc.

So this is not really what I want right ?

Sorry for being a smart-ass...

Kindly

//Marcus





On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <marcus.he...@tailsweep.com>wrote:

> Guys!
>
> Thanks for these insights, I think we will head for Lucene level merging
> strategy (two or more indexes).
> When merging I guess the second index need to have the same doc ids
> somehow. This is an internal id in Lucene, not that easy to get hold of
> right ?
>
> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> would not work very well performance wise or what do you mean ?
>
> I sure like bleeding edge :)
>
> Cheers dudes
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>>
>> I should emphasize that the PR trick I mentioned is something you'd do at
>> the Lucene level, outside Solr, and then you'd just slip the modified index
>> back into Solr.
>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>> Solr index merging functionality (patch in JIRA).
>>
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>> > Subject: Re: Date faceting - howto improve performance
>> >
>> >
>> > Yes, you could simply round the date, no need for a non-date type field.
>> > Yes, you can add a field after the fact by making use of ParallelReader
>> and
>> > merging (I don't recall the details, search the ML for ParallelReader
>> and
>> > Andrzej), I remember he once provided the working recipe.
>> >
>> >
>> > Otis --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> > > From: Marcus Herou
>> > > To: solr-user@lucene.apache.org
>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>> > > Subject: Date faceting - howto improve performance
>> > >
>> > > Hi.
>> > >
>> > > One of our faceting use-cases:
>> > > We are creating trend graphs of how many blog posts that contains a
>> certain
>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>> > > functions.
>> > >
>> > > The performance degrades really fast and consumes a lot of memory
>> which
>> > > forces OOM from time to time
>> > > We think it is due the fact that the cardinality of the field
>> publishedDate
>> > > in our index is huge, almost equal to the nr of documents in the
>> index.
>> > >
>> > > We need to address that...
>> > >
>> > > Some questions:
>> > >
>> > > 1. Can a datefield have other date-formats than the default of
>> yyyy-MM-dd
>> > > HH:mm:ssZ ?
>> > >
>> > > 2. We are thinking of adding a field to the index which have the
>> format
>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
>> it
>> > > could perhaps be a string, but the question then is if faceting can be
>> used
>> > > ?
>> > >
>> > > 3. Since we now already have such a huge index, is there a way to add
>> a
>> > > field afterwards and apply it to all documents without actually
>> reindexing
>> > > the whole shebang ?
>> > >
>> > > 4. If the field cannot be a string can we just leave out the
>> > > hour/minute/second information and to reduce the cardinality and
>> improve
>> > > performance ? Example: 2009-01-01 00:00:00Z
>> > >
>> > > 5. I am afraid that we need to reindex everything to get this to work
>> > > (negates Q3). We have 8 shards as of current, what would the most
>> efficient
>> > > way be to reindexing the whole shebang ? Dump the entire database to
>> disk
>> > > (sigh), create many xml file splits and use curl in a
>> > > random/hash(numServers) manner on them ?
>> > >
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.he...@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Reply via email to