I just finished adding support for persisted ("backed" as I call them)
bloom-filters in Guava Bloom Filter. Implemented one kind of persisted
bloom-filter that works on memory mapped files.
I have changed our Solr code so that it uses such a enhanced Guava Bloom
Filter. Making sure it is kept up to date and using it when quick "does
definitely not exist checks" will help performance.
We do duplicate check also, because we also might get the "same" data
from our external provider numerous times. We do it using unique-id
feature in Solr where we make sure that if and only if (in practice) a
two documents are "the same" they have the same id. We encode most info
on the document in its id - including hashes of textual fields. Works
like a charm. It is exactly in this case we want to improve performance.
Most of the time a document does not already exist when we do this
duplicate check (using the unique-id feature), but it just takes
relatively long time to verify it, because you have to visit the index.
We can get a quick "document with this id does not exist" using
bloom-filter on id.
Regards, Per Steffensen
On 03/08/14 03:58, Umesh Prasad wrote:
+1 to Guava's BloomFilter implementation.
You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.
We had a somewhat similar use case. We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.
There is a nice video of other usage of Update chain
https://www.youtube.com/watch?v=qoq2QEPHefo