How do you determine a duplicate? Solr has de-duplication built in and also you may consider hashing documents on some fields to create a consistent doc id that would be the same for same documents and let Solr re-write them. Either approach would reduce or eliminate the possibility of duplicates and save time.
> Hi all, > > we are indexing real-time documents from various sources. Since we have > multiple sources, we encounter quite a number of duplicates which we > delete > from the index. This mostly occurs within a short timeframe; deletes of > older documents may happen, but they do not have a high priority. Search > results do not need to be exactly reatime (they can be 1 minute or so > behind), but facet counts should be correct as we use them to visualize > frequencies in the data. We are now looking for a good commit/merge > strategy. Any advice? > > Thanks and best, > Peter > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582294.html > Sent from the Solr - User mailing list archive at Nabble.com. >