: number of records. We collect our data from a number of sources and each : source produces around 50,000 docs. Each of these document has a "sourceId" : field indicating the source of the document. Now assuming we're indexing all : documents from SourceA (sourceId="SourceA"), majority of these docs are : already in Solr and we don't want to update them. However, there might be
How do you know that the document hasn't changed since the last time you indexed it? if there is a garuntee that documents never change, then why would you ever get the same document twice? (in my experience: if a document can be deleted, it can be modified) If for some reason i can't fathom, docs really do never change, but you still get the same doc from the same source repeatedly then i would just keep track of every doc you've ever index and ignore any doc whose id is in that list when indexing ... you can generate that list from Solr, or even query solr for ids in real time before deciding to index them (optimizations could be made ... if you sort your docs by ID, then you could divide them into chunks, do range queries based on the low/high id of hte chunk, prune anythingin the chunk whose id is in the reuslt from the query, etc...) -Hoss