Re: Synchronize large number of records with Solr

Erik Hatcher Fri, 14 Sep 2007 04:25:26 -0700

Cuong,

I accomplished (in Collex) by attaching a "batch number" to eachdocument. When indexing a batch (or source), a GUID is generated andevery document from that batch/source gets that same identifierattached to it. At the end of the indexing run, I delete everythingwith that source minus documents with that batch number which getsrid of any documents in Solr that were not just (re)indexed. So inyour case #1, documents are reindexed with this scheme - so if youtruly need to skip a reindexing for some reason (why, though?) you'llneed to come up with some other mechanism. [perhaps update could beenhanced to allow ignoring a duplicate id rather than reindexing?]


        Erik


On Sep 14, 2007, at 3:26 AM, climbingrose wrote:

Hi all,
I've been struggling to find a good way to synchronize Solr with alargenumber of records. We collect our data from a number of sources andeachsource produces around 50,000 docs. Each of these document has a"sourceId"field indicating the source of the document. Now assuming we'reindexing alldocuments from SourceA (sourceId="SourceA"), majority of these docsarealready in Solr and we don't want to update them. However, theremight besome docs in Solr that are not in the and we do want to delete themfrom the
index. So in summary:

1) If a doc is already in Solr, do nothing
2) If a doc is in the batch but not in Solr, index it
3) If a doc is in Solr but not in the batch, remove it from Solr.
The trick part is 1) because if not for that requirement, I canjust simplydelete all documents with sourceId="SourceA" and reindex alldocuments from
SourceA. Any suggestions?

Thanks.

--
Regards,

Cuong Hoang

Re: Synchronize large number of records with Solr

Reply via email to