Hi all, I've been struggling to find a good way to synchronize Solr with a large number of records. We collect our data from a number of sources and each source produces around 50,000 docs. Each of these document has a "sourceId" field indicating the source of the document. Now assuming we're indexing all documents from SourceA (sourceId="SourceA"), majority of these docs are already in Solr and we don't want to update them. However, there might be some docs in Solr that are not in the and we do want to delete them from the index. So in summary:
1) If a doc is already in Solr, do nothing 2) If a doc is in the batch but not in Solr, index it 3) If a doc is in Solr but not in the batch, remove it from Solr. The trick part is 1) because if not for that requirement, I can just simply delete all documents with sourceId="SourceA" and reindex all documents from SourceA. Any suggestions? Thanks. -- Regards, Cuong Hoang