You could MD4 the parts you care about, store that, fetch it and compare.
If there is a reliable timestamp, you could use that. But that would be
app-dependent.
In general, you need to store some info about each source document
and figure out whether it is new. This get much hairier with a web
spi
: number of records. We collect our data from a number of sources and each
: source produces around 50,000 docs. Each of these document has a "sourceId"
: field indicating the source of the document. Now assuming we're indexing all
: documents from SourceA (sourceId="SourceA"), majority of these d
Hi Erik,
>>So in your case #1, documents are reindexed with this scheme - so if you
>>truly need to skip a reindexing for some reason (why, though?) you'll
>>need to come up with some other mechanism. [perhaps update could be
>>enhanced to allow ignoring a duplicate id rather than reindexing?]
I
Cuong,
I accomplished (in Collex) by attaching a "batch number" to each
document. When indexing a batch (or source), a GUID is generated and
every document from that batch/source gets that same identifier
attached to it. At the end of the indexing run, I delete everything
with that sour
Hi all,
I've been struggling to find a good way to synchronize Solr with a large
number of records. We collect our data from a number of sources and each
source produces around 50,000 docs. Each of these document has a "sourceId"
field indicating the source of the document. Now assuming we're inde