: number of records. We collect our data from a number of sources and each
: source produces around 50,000 docs. Each of these document has a "sourceId"
: field indicating the source of the document. Now assuming we're indexing all
: documents from SourceA (sourceId="SourceA"), majority of these docs are
: already in Solr and we don't want to update them. However, there might be

How do you know that the document hasn't changed since the last time you 
indexed it?  if there is a garuntee that documents never change, then why 
would you ever get the same document twice?  (in my experience: if a 
document can be deleted, it can be modified)


If for some reason i can't fathom, docs really do never change, but you 
still get the same doc from the same source repeatedly then i would just 
keep track of every doc you've ever index and ignore any doc whose id is 
in that list when indexing ... you can generate that list from Solr, or 
even query solr for ids in real time before deciding to index them  
(optimizations could be made ... if you sort your docs by ID, then you 
could divide them into chunks, do range queries based on the low/high id 
of hte chunk, prune anythingin the chunk whose id is in the reuslt from 
the query, etc...)



-Hoss

Reply via email to