Re: solr keep old docs

Chris Hostetter Wed, 28 Dec 2011 10:17:02 -0800

: That said, writing your own update request handler
: that detected this case isn't very difficult,
: extend UpdateRequestProcessorFactory/UpdateRequestProcessor
: and use it as a plugin.


i can't find the thread at the moment, but the general issue that has 
caused people headaches with this type of approach in the past has been 
that the performance of doing a query on every update (to see if the doc 
is already in the index) can slow things down quite a bit -- in your 
usecase it may not be a significant bottleneck, but that's the general 
issue that has come up i nthe past.

If you look at systems (like nutch) that do large scale crawling, they 
treat the crawl phrase independent from the indexing phase precisesly for 
reasons like this -- so the crawler can dedup the documents (by unique 
URL) and eliminate duplication before ever even adding them to the index.

: >> > I wonder why simple the overwrite parameter doesn't work here.
        ...
: >> > 2. overwrite=false and uniqueID exists then newer doc must be skipped
: >> since
: >> > old exists.

that is not what overwrite=false does (or was ever designed to do).  
overwrite=false is a way to tell Solr that you are already certain that 
the documents being added do not exist in the index, therefore Solr can 
save time by not attempting to overwrite an existing document.  It is 
intended for situations where you are bulk loading documents, ie: doing an 
initial build of an index from a system of record (ie: a single pass over 
adatabase that uses the same unique key) or importing documents from a 
new system of record with a completley differnet id space.



-Hoss

Re: solr keep old docs

Reply via email to