Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader &.FilterIndex

karsten-solr Wed, 03 Aug 2011 08:40:21 -0700

Hi Erick,

our two "changable" fields are used for linking between documents on 
application level.
>From lucene point of view they are just two searchable fields with stored term 
>vector for one of them.
Our queries will use one of this fields and a couple of fields from the 
"stable" fields.


So the question is really about updating two fields in an existing lucene index 
with more then fifty other fields.

Best regards
  Karsten

P.S. about our "linking between documents":
Our two fields called "outgoingLinks" and "possibleIncomingLinks".

Our source-documents have an abstract and a couple of metadata.
We are using regular expression to find outgoing links in this abstract. This 
means a couple of words, which indicates 
 1. that the author made a reference (like "in my previos work published as 
'Very important Article' in Nature 2010, 12 page 7")
 2. that this reference contains metadata to an other document

Each of this links is transformed to a special key ("2010NaturNr12Page7").
On the other side, we transform the metadata to all possible keys.
This key generation grows with our knowledge of "possible link pattern".
For the lucene indexer this is a black-box: There is a service which produce 
the keys for outgoing and possibleIncoming from our source (xml-)documents, 
this keys must be searchable in lucene/solr.

P.P.S. in Context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

-------- Original-Nachricht --------
> Datum: Wed, 3 Aug 2011 09:57:03 -0400
> Von: Erick Erickson <erickerick...@gmail.com>
> An: solr-user@lucene.apache.org
> Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. 
> ParallelReader &.FilterIndex

> How are these fields used? Because if they're not used for searching, you
> could
> put them in their own core and rebuild that index at your whim, then
> querying that
> core when you need the relationship information.
> 
> If you have a DB backing your system, you could perhaps store the info
> there
> and query that (but I like the second core better <G>)..
> 
> But if you could use a separate index just for the relationships, you
> wouldn't
> have to deal with the slow re-indexing of all the docs...
> 
> Best
> Erick
> 
> On Mon, Aug 1, 2011 at 4:12 AM,  <karsten-s...@gmx.de> wrote:
> > Hi lucene/solr-folk,
> >
> > Issue:
> > Our documents are stable except for two fields which are used for
> linking between the docs. So we like to update this two fields in a batch 
> once a
> month (possible once a week).
> > We can not reindex all docs once a month, because we are using XeLDA in
> some fields for stemming (morphological analysis), and XeLDA is slow. We
> have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two
> changable fields).
> > In the next half year we will migrating our search engine from verity K2
> to solr; so we could wait for solr 4.0
> > (
> > btw any news about
> >
> http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
> > ?
> > ).
> >
> > Solution?
> >
> > Our issue is exactly the purpose of ParallelReader.
> > But Solr do not support ParallelReader (for a good reason:
> >
> http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
> > ).
> > So I see two possible ways to solve our issue:
> > 1. waiting for the new Parallel incremental indexing
> > (
> > https://issues.apache.org/jira/browse/LUCENE-1879
> > ) and hoping that solr will integrate this.
> > Pro:
> >  - nothing to do for us except waiting.
> > Contra:
> >  - I did not found anything of the (old) patch in current trunk.
> >
> > 2. Change lucene index below/without solr in a batch:
> >   a) Each month generate a new index only with our two changed fields
> >      (e.g. with DIH)
> >   b) Use FilterIndex and ParallelReader to mock a correct index
> >   c) “Merge” this mock index to a new Index
> >      (via IndexWriter.addIndexes(IndexReader...) )
> > Pro:
> >  - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
> >   should be a good example, how to do this.
> > Contra:
> >  - relation between DocId and document index order is not an guaranteed
> feature of DIH, (e.g. we will have to split the main index to ensure that
> no merge will occur in/after DIH).
> >  - To run this batch, solr has to be stopped and restarted.
> >  - Even if we know, that our two field should change only for a subset
> of the docs, we nevertheless have to reindex this two fields for all the
> docs.
> >
> > Any comments, hints or tips?
> > Is there a third (better) way to solve our issue?
> > Is there already an working example of the 2. solution?
> > Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?
> >
> > Best regards
> >  Karsten
> >

Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader &.FilterIndex

Reply via email to