Re: Date faceting - howto improve performance

Marcus Herou Sat, 25 Apr 2009 08:13:53 -0700

Oh and the indexing strategy is just a stupid random across the shards.

What I asked about was a "Best Practice" of achieving most MB/sec indexing.
I feel that the java-api should be less efficient than something more raw
like curl or so but that is just my hunch.


/M


On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <marcus.he...@tailsweep.com>wrote:

> Guys!
>
> Thanks for these insights, I think we will head for Lucene level merging
> strategy (two or more indexes).
> When merging I guess the second index need to have the same doc ids
> somehow. This is an internal id in Lucene, not that easy to get hold of
> right ?
>
> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> would not work very well performance wise or what do you mean ?
>
> I sure like bleeding edge :)
>
> Cheers dudes
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>>
>> I should emphasize that the PR trick I mentioned is something you'd do at
>> the Lucene level, outside Solr, and then you'd just slip the modified index
>> back into Solr.
>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>> Solr index merging functionality (patch in JIRA).
>>
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>> > Subject: Re: Date faceting - howto improve performance
>> >
>> >
>> > Yes, you could simply round the date, no need for a non-date type field.
>> > Yes, you can add a field after the fact by making use of ParallelReader
>> and
>> > merging (I don't recall the details, search the ML for ParallelReader
>> and
>> > Andrzej), I remember he once provided the working recipe.
>> >
>> >
>> > Otis --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> > > From: Marcus Herou
>> > > To: solr-user@lucene.apache.org
>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>> > > Subject: Date faceting - howto improve performance
>> > >
>> > > Hi.
>> > >
>> > > One of our faceting use-cases:
>> > > We are creating trend graphs of how many blog posts that contains a
>> certain
>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>> > > functions.
>> > >
>> > > The performance degrades really fast and consumes a lot of memory
>> which
>> > > forces OOM from time to time
>> > > We think it is due the fact that the cardinality of the field
>> publishedDate
>> > > in our index is huge, almost equal to the nr of documents in the
>> index.
>> > >
>> > > We need to address that...
>> > >
>> > > Some questions:
>> > >
>> > > 1. Can a datefield have other date-formats than the default of
>> yyyy-MM-dd
>> > > HH:mm:ssZ ?
>> > >
>> > > 2. We are thinking of adding a field to the index which have the
>> format
>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
>> it
>> > > could perhaps be a string, but the question then is if faceting can be
>> used
>> > > ?
>> > >
>> > > 3. Since we now already have such a huge index, is there a way to add
>> a
>> > > field afterwards and apply it to all documents without actually
>> reindexing
>> > > the whole shebang ?
>> > >
>> > > 4. If the field cannot be a string can we just leave out the
>> > > hour/minute/second information and to reduce the cardinality and
>> improve
>> > > performance ? Example: 2009-01-01 00:00:00Z
>> > >
>> > > 5. I am afraid that we need to reindex everything to get this to work
>> > > (negates Q3). We have 8 shards as of current, what would the most
>> efficient
>> > > way be to reindexing the whole shebang ? Dump the entire database to
>> disk
>> > > (sigh), create many xml file splits and use curl in a
>> > > random/hash(numServers) manner on them ?
>> > >
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.he...@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Reply via email to