I just finished adding support for persisted ("backed" as I call them) bloom-filters in Guava Bloom Filter. Implemented one kind of persisted bloom-filter that works on memory mapped files. I have changed our Solr code so that it uses such a enhanced Guava Bloom Filter. Making sure it is kept up to date and using it when quick "does definitely not exist checks" will help performance.

We do duplicate check also, because we also might get the "same" data from our external provider numerous times. We do it using unique-id feature in Solr where we make sure that if and only if (in practice) a two documents are "the same" they have the same id. We encode most info on the document in its id - including hashes of textual fields. Works like a charm. It is exactly in this case we want to improve performance. Most of the time a document does not already exist when we do this duplicate check (using the unique-id feature), but it just takes relatively long time to verify it, because you have to visit the index. We can get a quick "document with this id does not exist" using bloom-filter on id.

Regards, Per Steffensen

On 03/08/14 03:58, Umesh Prasad wrote:
+1 to Guava's BloomFilter implementation.

You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.

We had a somewhat similar use case.  We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.

There is a nice video of other usage of Update chain

https://www.youtube.com/watch?v=qoq2QEPHefo

Reply via email to