Hi all,

I've been struggling to find a good way to synchronize Solr with a large
number of records. We collect our data from a number of sources and each
source produces around 50,000 docs. Each of these document has a "sourceId"
field indicating the source of the document. Now assuming we're indexing all
documents from SourceA (sourceId="SourceA"), majority of these docs are
already in Solr and we don't want to update them. However, there might be
some docs in Solr that are not in the and we do want to delete them from the
index. So in summary:

1) If a doc is already in Solr, do nothing
2) If a doc is in the batch but not in Solr, index it
3) If a doc is in Solr but not in the batch, remove it from Solr.

The trick part is 1) because if not for that requirement, I can just simply
delete all documents with sourceId="SourceA" and reindex all documents from
SourceA. Any suggestions?

Thanks.

-- 
Regards,

Cuong Hoang

Reply via email to