Yes that's exactly what I meant. Think adding "new" fields to a separate index and use ParallelReader at query time would be something to investigate at SOLR level. I think I can spend some time creating a patch for this if you think it is a good idea and if you think it would be merged into the repo haha. It is not very main stream but I think everyone with more than a million docs curses alot over that they need to stop the entire service for a couple of days just to add a field :)
We have 60M rows now and 50 000M index size (shit 800k per doc, man that is too much) so we are getting into a state where reindexing is starting to become impossible... Keep up the fantastic work //Marcus On Mon, Apr 27, 2009 at 5:09 PM, Ning Li <ning.li...@gmail.com> wrote: > You mean doc A and doc B will become one doc after adding index 2 to > index 1? I don't think this is currently supported either at Lucene > level or at Solr level. If index 1 has m docs and index 2 has n docs, > index 1 will have m+n docs after adding index 2 to index 1. Documents > themselves are not modified by index merge. > > Cheers, > Ning > > > On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou > <marcus.he...@tailsweep.com> wrote: > > Hmm looking in the code for the IndexMerger in Solr > > (org.apache.solr.update.DirectUpdateHandler(2) > > > > See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of > > indexes) ? > > > > And the test class > org.apache.solr.client.solrj.MergeIndexesExampleTestBase > > suggests: > > add doc A to index1 with id=AAA,name=core1 > > add doc B to index2 with id=BBB,name=core2 > > merge the two indexes into one index which then contains both docs. > > The resulting index will have 2 docs. > > > > Great but in my case I think it should work more like this. > > > > add doc A to index1 with id=X,title=blog entry title,description=blog > entry > > description > > add doc B to index2 with id=X,score=1.2 > > somehow add index2 to index1 so id=XX has score=1.2 when searching in > index1 > > The resulting index should have 1 doc. > > > > So this is not really what I want right ? > > > > Sorry for being a smart-ass... > > > > Kindly > > > > //Marcus > > > > > > > > > > > > On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou < > marcus.he...@tailsweep.com>wrote: > > > >> Guys! > >> > >> Thanks for these insights, I think we will head for Lucene level merging > >> strategy (two or more indexes). > >> When merging I guess the second index need to have the same doc ids > >> somehow. This is an internal id in Lucene, not that easy to get hold of > >> right ? > >> > >> So you are saying the the solr: ExternalFileField + FunctionQuery stuff > >> would not work very well performance wise or what do you mean ? > >> > >> I sure like bleeding edge :) > >> > >> Cheers dudes > >> > >> //Marcus > >> > >> > >> > >> > >> > >> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic < > >> otis_gospodne...@yahoo.com> wrote: > >> > >>> > >>> I should emphasize that the PR trick I mentioned is something you'd do > at > >>> the Lucene level, outside Solr, and then you'd just slip the modified > index > >>> back into Solr. > >>> Of, if you like the bleeding edge, perhaps you can make use of Ning > Li's > >>> Solr index merging functionality (patch in JIRA). > >>> > >>> > >>> Otis -- > >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >>> > >>> > >>> > >>> ----- Original Message ---- > >>> > From: Otis Gospodnetic <otis_gospodne...@yahoo.com> > >>> > To: solr-user@lucene.apache.org > >>> > Sent: Saturday, April 25, 2009 9:41:45 AM > >>> > Subject: Re: Date faceting - howto improve performance > >>> > > >>> > > >>> > Yes, you could simply round the date, no need for a non-date type > field. > >>> > Yes, you can add a field after the fact by making use of > ParallelReader > >>> and > >>> > merging (I don't recall the details, search the ML for ParallelReader > >>> and > >>> > Andrzej), I remember he once provided the working recipe. > >>> > > >>> > > >>> > Otis -- > >>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >>> > > >>> > > >>> > > >>> > ----- Original Message ---- > >>> > > From: Marcus Herou > >>> > > To: solr-user@lucene.apache.org > >>> > > Sent: Saturday, April 25, 2009 6:54:02 AM > >>> > > Subject: Date faceting - howto improve performance > >>> > > > >>> > > Hi. > >>> > > > >>> > > One of our faceting use-cases: > >>> > > We are creating trend graphs of how many blog posts that contains a > >>> certain > >>> > > term and groups it by day/week/year etc. with the nice > DateMathParser > >>> > > functions. > >>> > > > >>> > > The performance degrades really fast and consumes a lot of memory > >>> which > >>> > > forces OOM from time to time > >>> > > We think it is due the fact that the cardinality of the field > >>> publishedDate > >>> > > in our index is huge, almost equal to the nr of documents in the > >>> index. > >>> > > > >>> > > We need to address that... > >>> > > > >>> > > Some questions: > >>> > > > >>> > > 1. Can a datefield have other date-formats than the default of > >>> yyyy-MM-dd > >>> > > HH:mm:ssZ ? > >>> > > > >>> > > 2. We are thinking of adding a field to the index which have the > >>> format > >>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a > date, > >>> it > >>> > > could perhaps be a string, but the question then is if faceting can > be > >>> used > >>> > > ? > >>> > > > >>> > > 3. Since we now already have such a huge index, is there a way to > add > >>> a > >>> > > field afterwards and apply it to all documents without actually > >>> reindexing > >>> > > the whole shebang ? > >>> > > > >>> > > 4. If the field cannot be a string can we just leave out the > >>> > > hour/minute/second information and to reduce the cardinality and > >>> improve > >>> > > performance ? Example: 2009-01-01 00:00:00Z > >>> > > > >>> > > 5. I am afraid that we need to reindex everything to get this to > work > >>> > > (negates Q3). We have 8 shards as of current, what would the most > >>> efficient > >>> > > way be to reindexing the whole shebang ? Dump the entire database > to > >>> disk > >>> > > (sigh), create many xml file splits and use curl in a > >>> > > random/hash(numServers) manner on them ? > >>> > > > >>> > > > >>> > > Kindly > >>> > > > >>> > > //Marcus > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > -- > >>> > > Marcus Herou CTO and co-founder Tailsweep AB > >>> > > +46702561312 > >>> > > marcus.he...@tailsweep.com > >>> > > http://www.tailsweep.com/ > >>> > > http://blogg.tailsweep.com/ > >>> > >>> > >> > >> > >> -- > >> Marcus Herou CTO and co-founder Tailsweep AB > >> +46702561312 > >> marcus.he...@tailsweep.com > >> http://www.tailsweep.com/ > >> http://blogg.tailsweep.com/ > >> > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.he...@tailsweep.com > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/ > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/