date:20241012

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

2024-10-12 Thread Gary Gregory

Hi Claude,

The MD file was likely added after M2 was released and would explain why it
didn't make it to the site (yet).

In the MD, you write:

"Bloom filter comparisons are extremely fast taking on approximately five
(5) machine instructions for the simple comparison."

I think this needs clarification: Are we talking about 5 Java byte codes?
We certainly can't make claims as to how many CPU specific assembly
instructions this would correspond to, aside from guestimating a range. Did
you look at javap output to measure this? Or, maybe we should only talk
about our Java implementation and talk about Java statements or expressions.

Still reading...

TY,
Gary

On Sat, Oct 12, 2024, 5:07 AM Claude Warren  wrote:

> > ...
> > We can delay MultidimensionalBloomFilter unless introducing it later
> would
> > break binary compatibility. How beneficial would introducing this
> interface
> > for users?
> >
> > Gary
> >
> >
> > On Sun, Oct 6, 2024, 10:58 AM Claude Warren  wrote:
> >
> > > This is starting tondelve into the realm of multidimensional bloom
> > filters.
> > ...
> > > We could define an interface MultidimensionalBloomFilter that could be
> > > implemented by any Bloom filter that comprises multiple filters and
> givr
> > it
> > > the method flatten().  But this is a can of worms I would like to keep
> > > sealed for a bit longer.
> > >
> > > Claude
> >
>
> I have been meaning to answer this question for a week now but time always
> seems to get away and it is a rather long discussion.
>
> Multidimensional Bloom filters are basically any collection of Bloom
> filters. However they have very different capabilities and usages. I wrote
> an introduction to Multidimensional Bloom filters (
>
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> )
> but I can’t seem to find it online. Perhaps it does not get published to
> the site, though the site descriptions are out of date.
>
> I identify two categories of Bloom filters: Gatekeeper and
> Representational.
>
> Gatekeeper is the standard use where the key for an item is hashed and
> placed in a Bloom filter and when an expensive lookup is required the Bloom
> filter is consulted first to see if the lookup should be performed.
>
> Representational filters encode properties of objects and are then used to
> find objects with the same properties. This use case is found in genomics,
> encrypted database searching, and some proposals for performing pattern
> matching in the presence of wildcards. In the representational case the
> Bloom filter is associated with a data record of some sort.
>
> Multidimensional Bloom filters can contain either category.
>
> As a couple of examples:
>
> Example 1: Tracking ageing records on disk
>
> A streaming data system tracks active records by storing them on disk for
> an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
> layered Bloom filter (a type of Multidimensional Bloom filter) is
> constructed and a new layer added whenever one becomes full (the max number
> of items have been added) or 5 minutes has elapsed (therefore they must
> have timestamps associated with them). Bloom filters are removed whenever
> their timestamp expires.
>
> When a record is to be added the layered Bloom filter is checked to see if
> the object is on the disk. If not, it is added. If so, the disk is searched
> for the record. Note, that if each Bloom filter in the layered Bloom filter
> is associated with a separate file then identifying the layers that match
> the data identifies which files to search further reducing the search time.
> In this case the layers are gatekeeper filters.
>
> Example 2: locating potential pattern matches in the presence of wildcards
>
> A system of patterns is stored where each pattern may have a wildcard
> matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
> digits 0-9. When a pattern is added to the system every pair of characters
> in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> the pattern. The Bloom filter is then placed in a multidimensional Bloom
> filter.
>
> Queries are then textual strings for which potential matches are located.
> So the query is hashed in the same way and each Bloom filter in the
> multidimensional Bloom filter is checked to see if it matches. If it does
> then the associated pattern is checked to see if it matches.
>
> So the string “how now brown cow” would yield a Bloom filter that was the
> combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> “co” thus it would match the Bloom filter for the phrase. This is similar
> to how some genomic searching works.
>
> Example 3: Locating objects with specific properties
>
> In this example assume a number of objects with common propertie

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

2024-10-12 Thread Gary Gregory

In the MD:

" This index arises from the observation that no target can match a filter
with a lower hamming value."

Shouldn't "hamming" be capitalized?

Gary

On Sat, Oct 12, 2024, 5:07 AM Claude Warren  wrote:

> > ...
> > We can delay MultidimensionalBloomFilter unless introducing it later
> would
> > break binary compatibility. How beneficial would introducing this
> interface
> > for users?
> >
> > Gary
> >
> >
> > On Sun, Oct 6, 2024, 10:58 AM Claude Warren  wrote:
> >
> > > This is starting tondelve into the realm of multidimensional bloom
> > filters.
> > ...
> > > We could define an interface MultidimensionalBloomFilter that could be
> > > implemented by any Bloom filter that comprises multiple filters and
> givr
> > it
> > > the method flatten().  But this is a can of worms I would like to keep
> > > sealed for a bit longer.
> > >
> > > Claude
> >
>
> I have been meaning to answer this question for a week now but time always
> seems to get away and it is a rather long discussion.
>
> Multidimensional Bloom filters are basically any collection of Bloom
> filters. However they have very different capabilities and usages. I wrote
> an introduction to Multidimensional Bloom filters (
>
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> )
> but I can’t seem to find it online. Perhaps it does not get published to
> the site, though the site descriptions are out of date.
>
> I identify two categories of Bloom filters: Gatekeeper and
> Representational.
>
> Gatekeeper is the standard use where the key for an item is hashed and
> placed in a Bloom filter and when an expensive lookup is required the Bloom
> filter is consulted first to see if the lookup should be performed.
>
> Representational filters encode properties of objects and are then used to
> find objects with the same properties. This use case is found in genomics,
> encrypted database searching, and some proposals for performing pattern
> matching in the presence of wildcards. In the representational case the
> Bloom filter is associated with a data record of some sort.
>
> Multidimensional Bloom filters can contain either category.
>
> As a couple of examples:
>
> Example 1: Tracking ageing records on disk
>
> A streaming data system tracks active records by storing them on disk for
> an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
> layered Bloom filter (a type of Multidimensional Bloom filter) is
> constructed and a new layer added whenever one becomes full (the max number
> of items have been added) or 5 minutes has elapsed (therefore they must
> have timestamps associated with them). Bloom filters are removed whenever
> their timestamp expires.
>
> When a record is to be added the layered Bloom filter is checked to see if
> the object is on the disk. If not, it is added. If so, the disk is searched
> for the record. Note, that if each Bloom filter in the layered Bloom filter
> is associated with a separate file then identifying the layers that match
> the data identifies which files to search further reducing the search time.
> In this case the layers are gatekeeper filters.
>
> Example 2: locating potential pattern matches in the presence of wildcards
>
> A system of patterns is stored where each pattern may have a wildcard
> matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
> digits 0-9. When a pattern is added to the system every pair of characters
> in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> the pattern. The Bloom filter is then placed in a multidimensional Bloom
> filter.
>
> Queries are then textual strings for which potential matches are located.
> So the query is hashed in the same way and each Bloom filter in the
> multidimensional Bloom filter is checked to see if it matches. If it does
> then the associated pattern is checked to see if it matches.
>
> So the string “how now brown cow” would yield a Bloom filter that was the
> combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> “co” thus it would match the Bloom filter for the phrase. This is similar
> to how some genomic searching works.
>
> Example 3: Locating objects with specific properties
>
> In this example assume a number of objects with common properties names. We
> construct a Bloom filter for each object. The Bloom filter comprises the
> hashes of the values of the properties (potentially with different seeds to
> differentiate between the property names). We write the objects into a data
> storage and associate the storage with their Bloom filters. The Bloom
> filter is then placed into a multidimensional Bloom filter.
>
> If we want to find all the objects where X=five and Y=blue and Z=yesterday
> we simply generate the hashes and create the Bloom f

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

2024-10-12 Thread Gary Gregory

Hi,

The MD refers several time to "10K filters". Where does this limit come
from? If there is a baseline CPU and RAM combination this is based on, the
document should state it IMO. Or, is it based on the width of a Java int or
a Java long?

TY,
Gary

On Sat, Oct 12, 2024, 5:07 AM Claude Warren  wrote:

> > ...
> > We can delay MultidimensionalBloomFilter unless introducing it later
> would
> > break binary compatibility. How beneficial would introducing this
> interface
> > for users?
> >
> > Gary
> >
> >
> > On Sun, Oct 6, 2024, 10:58 AM Claude Warren  wrote:
> >
> > > This is starting tondelve into the realm of multidimensional bloom
> > filters.
> > ...
> > > We could define an interface MultidimensionalBloomFilter that could be
> > > implemented by any Bloom filter that comprises multiple filters and
> givr
> > it
> > > the method flatten().  But this is a can of worms I would like to keep
> > > sealed for a bit longer.
> > >
> > > Claude
> >
>
> I have been meaning to answer this question for a week now but time always
> seems to get away and it is a rather long discussion.
>
> Multidimensional Bloom filters are basically any collection of Bloom
> filters. However they have very different capabilities and usages. I wrote
> an introduction to Multidimensional Bloom filters (
>
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> )
> but I can’t seem to find it online. Perhaps it does not get published to
> the site, though the site descriptions are out of date.
>
> I identify two categories of Bloom filters: Gatekeeper and
> Representational.
>
> Gatekeeper is the standard use where the key for an item is hashed and
> placed in a Bloom filter and when an expensive lookup is required the Bloom
> filter is consulted first to see if the lookup should be performed.
>
> Representational filters encode properties of objects and are then used to
> find objects with the same properties. This use case is found in genomics,
> encrypted database searching, and some proposals for performing pattern
> matching in the presence of wildcards. In the representational case the
> Bloom filter is associated with a data record of some sort.
>
> Multidimensional Bloom filters can contain either category.
>
> As a couple of examples:
>
> Example 1: Tracking ageing records on disk
>
> A streaming data system tracks active records by storing them on disk for
> an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
> layered Bloom filter (a type of Multidimensional Bloom filter) is
> constructed and a new layer added whenever one becomes full (the max number
> of items have been added) or 5 minutes has elapsed (therefore they must
> have timestamps associated with them). Bloom filters are removed whenever
> their timestamp expires.
>
> When a record is to be added the layered Bloom filter is checked to see if
> the object is on the disk. If not, it is added. If so, the disk is searched
> for the record. Note, that if each Bloom filter in the layered Bloom filter
> is associated with a separate file then identifying the layers that match
> the data identifies which files to search further reducing the search time.
> In this case the layers are gatekeeper filters.
>
> Example 2: locating potential pattern matches in the presence of wildcards
>
> A system of patterns is stored where each pattern may have a wildcard
> matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
> digits 0-9. When a pattern is added to the system every pair of characters
> in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> the pattern. The Bloom filter is then placed in a multidimensional Bloom
> filter.
>
> Queries are then textual strings for which potential matches are located.
> So the query is hashed in the same way and each Bloom filter in the
> multidimensional Bloom filter is checked to see if it matches. If it does
> then the associated pattern is checked to see if it matches.
>
> So the string “how now brown cow” would yield a Bloom filter that was the
> combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> “co” thus it would match the Bloom filter for the phrase. This is similar
> to how some genomic searching works.
>
> Example 3: Locating objects with specific properties
>
> In this example assume a number of objects with common properties names. We
> construct a Bloom filter for each object. The Bloom filter comprises the
> hashes of the values of the properties (potentially with different seeds to
> differentiate between the property names). We write the objects into a data
> storage and associate the storage with their Bloom filters. The Bloom
> filter is then placed into a multidimensional Bloom filter.
>
> If we want to find all the objects wh

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

2024-10-12 Thread Gary Gregory

I read in the MD:

"...he data in transit and at rest is encrypted or at least strongly hashed"

What does "strongly hashed" mean here? Is this part of the Bloom Filter
vernacular? Shouldn't we define this somewhere?

TY,
Gary

On Sat, Oct 12, 2024, 5:07 AM Claude Warren  wrote:

> > ...
> > We can delay MultidimensionalBloomFilter unless introducing it later
> would
> > break binary compatibility. How beneficial would introducing this
> interface
> > for users?
> >
> > Gary
> >
> >
> > On Sun, Oct 6, 2024, 10:58 AM Claude Warren  wrote:
> >
> > > This is starting tondelve into the realm of multidimensional bloom
> > filters.
> > ...
> > > We could define an interface MultidimensionalBloomFilter that could be
> > > implemented by any Bloom filter that comprises multiple filters and
> givr
> > it
> > > the method flatten().  But this is a can of worms I would like to keep
> > > sealed for a bit longer.
> > >
> > > Claude
> >
>
> I have been meaning to answer this question for a week now but time always
> seems to get away and it is a rather long discussion.
>
> Multidimensional Bloom filters are basically any collection of Bloom
> filters. However they have very different capabilities and usages. I wrote
> an introduction to Multidimensional Bloom filters (
>
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> )
> but I can’t seem to find it online. Perhaps it does not get published to
> the site, though the site descriptions are out of date.
>
> I identify two categories of Bloom filters: Gatekeeper and
> Representational.
>
> Gatekeeper is the standard use where the key for an item is hashed and
> placed in a Bloom filter and when an expensive lookup is required the Bloom
> filter is consulted first to see if the lookup should be performed.
>
> Representational filters encode properties of objects and are then used to
> find objects with the same properties. This use case is found in genomics,
> encrypted database searching, and some proposals for performing pattern
> matching in the presence of wildcards. In the representational case the
> Bloom filter is associated with a data record of some sort.
>
> Multidimensional Bloom filters can contain either category.
>
> As a couple of examples:
>
> Example 1: Tracking ageing records on disk
>
> A streaming data system tracks active records by storing them on disk for
> an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
> layered Bloom filter (a type of Multidimensional Bloom filter) is
> constructed and a new layer added whenever one becomes full (the max number
> of items have been added) or 5 minutes has elapsed (therefore they must
> have timestamps associated with them). Bloom filters are removed whenever
> their timestamp expires.
>
> When a record is to be added the layered Bloom filter is checked to see if
> the object is on the disk. If not, it is added. If so, the disk is searched
> for the record. Note, that if each Bloom filter in the layered Bloom filter
> is associated with a separate file then identifying the layers that match
> the data identifies which files to search further reducing the search time.
> In this case the layers are gatekeeper filters.
>
> Example 2: locating potential pattern matches in the presence of wildcards
>
> A system of patterns is stored where each pattern may have a wildcard
> matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
> digits 0-9. When a pattern is added to the system every pair of characters
> in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> the pattern. The Bloom filter is then placed in a multidimensional Bloom
> filter.
>
> Queries are then textual strings for which potential matches are located.
> So the query is hashed in the same way and each Bloom filter in the
> multidimensional Bloom filter is checked to see if it matches. If it does
> then the associated pattern is checked to see if it matches.
>
> So the string “how now brown cow” would yield a Bloom filter that was the
> combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> “co” thus it would match the Bloom filter for the phrase. This is similar
> to how some genomic searching works.
>
> Example 3: Locating objects with specific properties
>
> In this example assume a number of objects with common properties names. We
> construct a Bloom filter for each object. The Bloom filter comprises the
> hashes of the values of the properties (potentially with different seeds to
> differentiate between the property names). We write the objects into a data
> storage and associate the storage with their Bloom filters. The Bloom
> filter is then placed into a multidimensional Bloom filter.
>
> If we want to find all the objects where X=five and Y=blue an

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

2024-10-12 Thread Gary Gregory

Thank you for explaining multidimensional filters here. It's not clear to
me yet if we need to introduce the new interface. I'll have to reread the
great docs you wrote 😀

The search() vs. locate() names are a bit confusing. Are these BF naming
conventions? If not, and if the difference is the exactitude of the match,
then using "exact" in the method name might be better. For example, Java
uses this naming in
https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/Math.html#addExact(long,long)

An alternative would be to define a MatchType enum with 2 values EXACT and
... whatever the other name should be. I find enums easier to read than a
boolean but that's also a possibility. Then you'd have a single search
method with a MatchType argument (or a boolean). Just a thought.

TY!
Gary

On Sat, Oct 12, 2024, 5:07 AM Claude Warren  wrote:

> > ...
> > We can delay MultidimensionalBloomFilter unless introducing it later
> would
> > break binary compatibility. How beneficial would introducing this
> interface
> > for users?
> >
> > Gary
> >
> >
> > On Sun, Oct 6, 2024, 10:58 AM Claude Warren  wrote:
> >
> > > This is starting tondelve into the realm of multidimensional bloom
> > filters.
> > ...
> > > We could define an interface MultidimensionalBloomFilter that could be
> > > implemented by any Bloom filter that comprises multiple filters and
> givr
> > it
> > > the method flatten().  But this is a can of worms I would like to keep
> > > sealed for a bit longer.
> > >
> > > Claude
> >
>
> I have been meaning to answer this question for a week now but time always
> seems to get away and it is a rather long discussion.
>
> Multidimensional Bloom filters are basically any collection of Bloom
> filters. However they have very different capabilities and usages. I wrote
> an introduction to Multidimensional Bloom filters (
>
> https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md
> )
> but I can’t seem to find it online. Perhaps it does not get published to
> the site, though the site descriptions are out of date.
>
> I identify two categories of Bloom filters: Gatekeeper and
> Representational.
>
> Gatekeeper is the standard use where the key for an item is hashed and
> placed in a Bloom filter and when an expensive lookup is required the Bloom
> filter is consulted first to see if the lookup should be performed.
>
> Representational filters encode properties of objects and are then used to
> find objects with the same properties. This use case is found in genomics,
> encrypted database searching, and some proposals for performing pattern
> matching in the presence of wildcards. In the representational case the
> Bloom filter is associated with a data record of some sort.
>
> Multidimensional Bloom filters can contain either category.
>
> As a couple of examples:
>
> Example 1: Tracking ageing records on disk
>
> A streaming data system tracks active records by storing them on disk for
> an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
> layered Bloom filter (a type of Multidimensional Bloom filter) is
> constructed and a new layer added whenever one becomes full (the max number
> of items have been added) or 5 minutes has elapsed (therefore they must
> have timestamps associated with them). Bloom filters are removed whenever
> their timestamp expires.
>
> When a record is to be added the layered Bloom filter is checked to see if
> the object is on the disk. If not, it is added. If so, the disk is searched
> for the record. Note, that if each Bloom filter in the layered Bloom filter
> is associated with a separate file then identifying the layers that match
> the data identifies which files to search further reducing the search time.
> In this case the layers are gatekeeper filters.
>
> Example 2: locating potential pattern matches in the presence of wildcards
>
> A system of patterns is stored where each pattern may have a wildcard
> matching one or more characters. Let’s assume the wildcard is ‘*’ and the
> patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
> digits 0-9. When a pattern is added to the system every pair of characters
> in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
> the pattern. The Bloom filter is then placed in a multidimensional Bloom
> filter.
>
> Queries are then textual strings for which potential matches are located.
> So the query is hashed in the same way and each Bloom filter in the
> multidimensional Bloom filter is checked to see if it matches. If it does
> then the associated pattern is checked to see if it matches.
>
> So the string “how now brown cow” would yield a Bloom filter that was the
> combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
> pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
> “co” thus it would match the Bloom filter for the phrase. This is similar
> to how some genomic s

[collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

2024-10-12 Thread Claude Warren

> ...
> We can delay MultidimensionalBloomFilter unless introducing it later would
> break binary compatibility. How beneficial would introducing this interface
> for users?
>
> Gary
>
>
> On Sun, Oct 6, 2024, 10:58 AM Claude Warren  wrote:
>
> > This is starting tondelve into the realm of multidimensional bloom
> filters.
> ...
> > We could define an interface MultidimensionalBloomFilter that could be
> > implemented by any Bloom filter that comprises multiple filters and givr
> it
> > the method flatten().  But this is a can of worms I would like to keep
> > sealed for a bit longer.
> >
> > Claude
>

I have been meaning to answer this question for a week now but time always
seems to get away and it is a rather long discussion.

Multidimensional Bloom filters are basically any collection of Bloom
filters. However they have very different capabilities and usages. I wrote
an introduction to Multidimensional Bloom filters (
https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md)
but I can’t seem to find it online. Perhaps it does not get published to
the site, though the site descriptions are out of date.

I identify two categories of Bloom filters: Gatekeeper and Representational.

Gatekeeper is the standard use where the key for an item is hashed and
placed in a Bloom filter and when an expensive lookup is required the Bloom
filter is consulted first to see if the lookup should be performed.

Representational filters encode properties of objects and are then used to
find objects with the same properties. This use case is found in genomics,
encrypted database searching, and some proposals for performing pattern
matching in the presence of wildcards. In the representational case the
Bloom filter is associated with a data record of some sort.

Multidimensional Bloom filters can contain either category.

As a couple of examples:

Example 1: Tracking ageing records on disk

A streaming data system tracks active records by storing them on disk for
an hour. Every 5 minutes it flushes any record that is older than 1 hour. A
layered Bloom filter (a type of Multidimensional Bloom filter) is
constructed and a new layer added whenever one becomes full (the max number
of items have been added) or 5 minutes has elapsed (therefore they must
have timestamps associated with them). Bloom filters are removed whenever
their timestamp expires.

When a record is to be added the layered Bloom filter is checked to see if
the object is on the disk. If not, it is added. If so, the disk is searched
for the record. Note, that if each Bloom filter in the layered Bloom filter
is associated with a separate file then identifying the layers that match
the data identifies which files to search further reducing the search time.
In this case the layers are gatekeeper filters.

Example 2: locating potential pattern matches in the presence of wildcards

A system of patterns is stored where each pattern may have a wildcard
matching one or more characters. Let’s assume the wildcard is ‘*’ and the
patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and
digits 0-9. When a pattern is added to the system every pair of characters
in [A-Z0-9] is hashed and added to a Bloom filter that is associated with
the pattern. The Bloom filter is then placed in a multidimensional Bloom
filter.

Queries are then textual strings for which potential matches are located.
So the query is hashed in the same way and each Bloom filter in the
multidimensional Bloom filter is checked to see if it matches. If it does
then the associated pattern is checked to see if it matches.

So the string “how now brown cow” would yield a Bloom filter that was the
combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The
pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”,
“co” thus it would match the Bloom filter for the phrase. This is similar
to how some genomic searching works.

Example 3: Locating objects with specific properties

In this example assume a number of objects with common properties names. We
construct a Bloom filter for each object. The Bloom filter comprises the
hashes of the values of the properties (potentially with different seeds to
differentiate between the property names). We write the objects into a data
storage and associate the storage with their Bloom filters. The Bloom
filter is then placed into a multidimensional Bloom filter.

If we want to find all the objects where X=five and Y=blue and Z=yesterday
we simply generate the hashes and create the Bloom filter and then search
the multidimensional Bloom filter for matches. This is the solution found
in searching encrypted data, where the clear text values are hashed and the
encrypted data written to storage.

So we have multidimensional Bloom filters that store gatekeeper type
filters (example 1) and others that store representational (example 3).
Example 2 seems to fall somewhere on the continuum be

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

[collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)

6 matches

Site Navigation

Mail list logo

Footer information