Cuong,

I accomplished (in Collex) by attaching a "batch number" to each document. When indexing a batch (or source), a GUID is generated and every document from that batch/source gets that same identifier attached to it. At the end of the indexing run, I delete everything with that source minus documents with that batch number which gets rid of any documents in Solr that were not just (re)indexed. So in your case #1, documents are reindexed with this scheme - so if you truly need to skip a reindexing for some reason (why, though?) you'll need to come up with some other mechanism. [perhaps update could be enhanced to allow ignoring a duplicate id rather than reindexing?]

        Erik


On Sep 14, 2007, at 3:26 AM, climbingrose wrote:

Hi all,

I've been struggling to find a good way to synchronize Solr with a large number of records. We collect our data from a number of sources and each source produces around 50,000 docs. Each of these document has a "sourceId" field indicating the source of the document. Now assuming we're indexing all documents from SourceA (sourceId="SourceA"), majority of these docs are already in Solr and we don't want to update them. However, there might be some docs in Solr that are not in the and we do want to delete them from the
index. So in summary:

1) If a doc is already in Solr, do nothing
2) If a doc is in the batch but not in Solr, index it
3) If a doc is in Solr but not in the batch, remove it from Solr.

The trick part is 1) because if not for that requirement, I can just simply delete all documents with sourceId="SourceA" and reindex all documents from
SourceA. Any suggestions?

Thanks.

--
Regards,

Cuong Hoang

Reply via email to