Cuong,
I accomplished (in Collex) by attaching a "batch number" to each
document. When indexing a batch (or source), a GUID is generated and
every document from that batch/source gets that same identifier
attached to it. At the end of the indexing run, I delete everything
with that source minus documents with that batch number which gets
rid of any documents in Solr that were not just (re)indexed. So in
your case #1, documents are reindexed with this scheme - so if you
truly need to skip a reindexing for some reason (why, though?) you'll
need to come up with some other mechanism. [perhaps update could be
enhanced to allow ignoring a duplicate id rather than reindexing?]
Erik
On Sep 14, 2007, at 3:26 AM, climbingrose wrote:
Hi all,
I've been struggling to find a good way to synchronize Solr with a
large
number of records. We collect our data from a number of sources and
each
source produces around 50,000 docs. Each of these document has a
"sourceId"
field indicating the source of the document. Now assuming we're
indexing all
documents from SourceA (sourceId="SourceA"), majority of these docs
are
already in Solr and we don't want to update them. However, there
might be
some docs in Solr that are not in the and we do want to delete them
from the
index. So in summary:
1) If a doc is already in Solr, do nothing
2) If a doc is in the batch but not in Solr, index it
3) If a doc is in Solr but not in the batch, remove it from Solr.
The trick part is 1) because if not for that requirement, I can
just simply
delete all documents with sourceId="SourceA" and reindex all
documents from
SourceA. Any suggestions?
Thanks.
--
Regards,
Cuong Hoang