How do you determine a duplicate?

Solr has de-duplication built in and also you may consider hashing
documents on some fields to create a consistent doc id that would be the
same for same documents and let Solr re-write them. Either approach would
reduce or eliminate the possibility of duplicates and save time.


> Hi all,
>
> we are indexing real-time documents from various sources. Since we have
> multiple sources, we encounter quite a number of duplicates which we
> delete
> from the index. This mostly occurs within a short timeframe; deletes of
> older documents may happen, but they do not have a high priority. Search
> results do not need to be exactly reatime (they can be 1 minute or so
> behind), but facet counts should be correct as we use them to visualize
> frequencies in the data. We are now looking for a good commit/merge
> strategy. Any advice?
>
> Thanks and best,
> Peter
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582294.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to