Re: Bloom filter

Per Steffensen Mon, 04 Aug 2014 07:57:46 -0700

I just finished adding support for persisted ("backed" as I call them)bloom-filters in Guava Bloom Filter. Implemented one kind of persistedbloom-filter that works on memory mapped files.I have changed our Solr code so that it uses such a enhanced Guava BloomFilter. Making sure it is kept up to date and using it when quick "doesdefinitely not exist checks" will help performance.

We do duplicate check also, because we also might get the "same" datafrom our external provider numerous times. We do it using unique-idfeature in Solr where we make sure that if and only if (in practice) atwo documents are "the same" they have the same id. We encode most infoon the document in its id - including hashes of textual fields. Workslike a charm. It is exactly in this case we want to improve performance.Most of the time a document does not already exist when we do thisduplicate check (using the unique-id feature), but it just takesrelatively long time to verify it, because you have to visit the index.We can get a quick "document with this id does not exist" usingbloom-filter on id.


Regards, Per Steffensen

On 03/08/14 03:58, Umesh Prasad wrote:

+1 to Guava's BloomFilter implementation.

You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.

We had a somewhat similar use case.  We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.

There is a nice video of other usage of Update chain

https://www.youtube.com/watch?v=qoq2QEPHefo

Re: Bloom filter

Reply via email to