Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)
Hi Claude, The MD file was likely added after M2 was released and would explain why it didn't make it to the site (yet). In the MD, you write: "Bloom filter comparisons are extremely fast taking on approximately five (5) machine instructions for the simple comparison." I think this needs clarification: Are we talking about 5 Java byte codes? We certainly can't make claims as to how many CPU specific assembly instructions this would correspond to, aside from guestimating a range. Did you look at javap output to measure this? Or, maybe we should only talk about our Java implementation and talk about Java statements or expressions. Still reading... TY, Gary On Sat, Oct 12, 2024, 5:07 AM Claude Warren wrote: > > ... > > We can delay MultidimensionalBloomFilter unless introducing it later > would > > break binary compatibility. How beneficial would introducing this > interface > > for users? > > > > Gary > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren wrote: > > > > > This is starting tondelve into the realm of multidimensional bloom > > filters. > > ... > > > We could define an interface MultidimensionalBloomFilter that could be > > > implemented by any Bloom filter that comprises multiple filters and > givr > > it > > > the method flatten(). But this is a can of worms I would like to keep > > > sealed for a bit longer. > > > > > > Claude > > > > I have been meaning to answer this question for a week now but time always > seems to get away and it is a rather long discussion. > > Multidimensional Bloom filters are basically any collection of Bloom > filters. However they have very different capabilities and usages. I wrote > an introduction to Multidimensional Bloom filters ( > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > ) > but I can’t seem to find it online. Perhaps it does not get published to > the site, though the site descriptions are out of date. > > I identify two categories of Bloom filters: Gatekeeper and > Representational. > > Gatekeeper is the standard use where the key for an item is hashed and > placed in a Bloom filter and when an expensive lookup is required the Bloom > filter is consulted first to see if the lookup should be performed. > > Representational filters encode properties of objects and are then used to > find objects with the same properties. This use case is found in genomics, > encrypted database searching, and some proposals for performing pattern > matching in the presence of wildcards. In the representational case the > Bloom filter is associated with a data record of some sort. > > Multidimensional Bloom filters can contain either category. > > As a couple of examples: > > Example 1: Tracking ageing records on disk > > A streaming data system tracks active records by storing them on disk for > an hour. Every 5 minutes it flushes any record that is older than 1 hour. A > layered Bloom filter (a type of Multidimensional Bloom filter) is > constructed and a new layer added whenever one becomes full (the max number > of items have been added) or 5 minutes has elapsed (therefore they must > have timestamps associated with them). Bloom filters are removed whenever > their timestamp expires. > > When a record is to be added the layered Bloom filter is checked to see if > the object is on the disk. If not, it is added. If so, the disk is searched > for the record. Note, that if each Bloom filter in the layered Bloom filter > is associated with a separate file then identifying the layers that match > the data identifies which files to search further reducing the search time. > In this case the layers are gatekeeper filters. > > Example 2: locating potential pattern matches in the presence of wildcards > > A system of patterns is stored where each pattern may have a wildcard > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and > digits 0-9. When a pattern is added to the system every pair of characters > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > the pattern. The Bloom filter is then placed in a multidimensional Bloom > filter. > > Queries are then textual strings for which potential matches are located. > So the query is hashed in the same way and each Bloom filter in the > multidimensional Bloom filter is checked to see if it matches. If it does > then the associated pattern is checked to see if it matches. > > So the string “how now brown cow” would yield a Bloom filter that was the > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > “co” thus it would match the Bloom filter for the phrase. This is similar > to how some genomic searching works. > > Example 3: Locating objects with specific properties > > In this example assume a number of objects with common propertie
Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)
In the MD: " This index arises from the observation that no target can match a filter with a lower hamming value." Shouldn't "hamming" be capitalized? Gary On Sat, Oct 12, 2024, 5:07 AM Claude Warren wrote: > > ... > > We can delay MultidimensionalBloomFilter unless introducing it later > would > > break binary compatibility. How beneficial would introducing this > interface > > for users? > > > > Gary > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren wrote: > > > > > This is starting tondelve into the realm of multidimensional bloom > > filters. > > ... > > > We could define an interface MultidimensionalBloomFilter that could be > > > implemented by any Bloom filter that comprises multiple filters and > givr > > it > > > the method flatten(). But this is a can of worms I would like to keep > > > sealed for a bit longer. > > > > > > Claude > > > > I have been meaning to answer this question for a week now but time always > seems to get away and it is a rather long discussion. > > Multidimensional Bloom filters are basically any collection of Bloom > filters. However they have very different capabilities and usages. I wrote > an introduction to Multidimensional Bloom filters ( > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > ) > but I can’t seem to find it online. Perhaps it does not get published to > the site, though the site descriptions are out of date. > > I identify two categories of Bloom filters: Gatekeeper and > Representational. > > Gatekeeper is the standard use where the key for an item is hashed and > placed in a Bloom filter and when an expensive lookup is required the Bloom > filter is consulted first to see if the lookup should be performed. > > Representational filters encode properties of objects and are then used to > find objects with the same properties. This use case is found in genomics, > encrypted database searching, and some proposals for performing pattern > matching in the presence of wildcards. In the representational case the > Bloom filter is associated with a data record of some sort. > > Multidimensional Bloom filters can contain either category. > > As a couple of examples: > > Example 1: Tracking ageing records on disk > > A streaming data system tracks active records by storing them on disk for > an hour. Every 5 minutes it flushes any record that is older than 1 hour. A > layered Bloom filter (a type of Multidimensional Bloom filter) is > constructed and a new layer added whenever one becomes full (the max number > of items have been added) or 5 minutes has elapsed (therefore they must > have timestamps associated with them). Bloom filters are removed whenever > their timestamp expires. > > When a record is to be added the layered Bloom filter is checked to see if > the object is on the disk. If not, it is added. If so, the disk is searched > for the record. Note, that if each Bloom filter in the layered Bloom filter > is associated with a separate file then identifying the layers that match > the data identifies which files to search further reducing the search time. > In this case the layers are gatekeeper filters. > > Example 2: locating potential pattern matches in the presence of wildcards > > A system of patterns is stored where each pattern may have a wildcard > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and > digits 0-9. When a pattern is added to the system every pair of characters > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > the pattern. The Bloom filter is then placed in a multidimensional Bloom > filter. > > Queries are then textual strings for which potential matches are located. > So the query is hashed in the same way and each Bloom filter in the > multidimensional Bloom filter is checked to see if it matches. If it does > then the associated pattern is checked to see if it matches. > > So the string “how now brown cow” would yield a Bloom filter that was the > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > “co” thus it would match the Bloom filter for the phrase. This is similar > to how some genomic searching works. > > Example 3: Locating objects with specific properties > > In this example assume a number of objects with common properties names. We > construct a Bloom filter for each object. The Bloom filter comprises the > hashes of the values of the properties (potentially with different seeds to > differentiate between the property names). We write the objects into a data > storage and associate the storage with their Bloom filters. The Bloom > filter is then placed into a multidimensional Bloom filter. > > If we want to find all the objects where X=five and Y=blue and Z=yesterday > we simply generate the hashes and create the Bloom f
Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)
Hi, The MD refers several time to "10K filters". Where does this limit come from? If there is a baseline CPU and RAM combination this is based on, the document should state it IMO. Or, is it based on the width of a Java int or a Java long? TY, Gary On Sat, Oct 12, 2024, 5:07 AM Claude Warren wrote: > > ... > > We can delay MultidimensionalBloomFilter unless introducing it later > would > > break binary compatibility. How beneficial would introducing this > interface > > for users? > > > > Gary > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren wrote: > > > > > This is starting tondelve into the realm of multidimensional bloom > > filters. > > ... > > > We could define an interface MultidimensionalBloomFilter that could be > > > implemented by any Bloom filter that comprises multiple filters and > givr > > it > > > the method flatten(). But this is a can of worms I would like to keep > > > sealed for a bit longer. > > > > > > Claude > > > > I have been meaning to answer this question for a week now but time always > seems to get away and it is a rather long discussion. > > Multidimensional Bloom filters are basically any collection of Bloom > filters. However they have very different capabilities and usages. I wrote > an introduction to Multidimensional Bloom filters ( > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > ) > but I can’t seem to find it online. Perhaps it does not get published to > the site, though the site descriptions are out of date. > > I identify two categories of Bloom filters: Gatekeeper and > Representational. > > Gatekeeper is the standard use where the key for an item is hashed and > placed in a Bloom filter and when an expensive lookup is required the Bloom > filter is consulted first to see if the lookup should be performed. > > Representational filters encode properties of objects and are then used to > find objects with the same properties. This use case is found in genomics, > encrypted database searching, and some proposals for performing pattern > matching in the presence of wildcards. In the representational case the > Bloom filter is associated with a data record of some sort. > > Multidimensional Bloom filters can contain either category. > > As a couple of examples: > > Example 1: Tracking ageing records on disk > > A streaming data system tracks active records by storing them on disk for > an hour. Every 5 minutes it flushes any record that is older than 1 hour. A > layered Bloom filter (a type of Multidimensional Bloom filter) is > constructed and a new layer added whenever one becomes full (the max number > of items have been added) or 5 minutes has elapsed (therefore they must > have timestamps associated with them). Bloom filters are removed whenever > their timestamp expires. > > When a record is to be added the layered Bloom filter is checked to see if > the object is on the disk. If not, it is added. If so, the disk is searched > for the record. Note, that if each Bloom filter in the layered Bloom filter > is associated with a separate file then identifying the layers that match > the data identifies which files to search further reducing the search time. > In this case the layers are gatekeeper filters. > > Example 2: locating potential pattern matches in the presence of wildcards > > A system of patterns is stored where each pattern may have a wildcard > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and > digits 0-9. When a pattern is added to the system every pair of characters > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > the pattern. The Bloom filter is then placed in a multidimensional Bloom > filter. > > Queries are then textual strings for which potential matches are located. > So the query is hashed in the same way and each Bloom filter in the > multidimensional Bloom filter is checked to see if it matches. If it does > then the associated pattern is checked to see if it matches. > > So the string “how now brown cow” would yield a Bloom filter that was the > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > “co” thus it would match the Bloom filter for the phrase. This is similar > to how some genomic searching works. > > Example 3: Locating objects with specific properties > > In this example assume a number of objects with common properties names. We > construct a Bloom filter for each object. The Bloom filter comprises the > hashes of the values of the properties (potentially with different seeds to > differentiate between the property names). We write the objects into a data > storage and associate the storage with their Bloom filters. The Bloom > filter is then placed into a multidimensional Bloom filter. > > If we want to find all the objects wh
Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)
I read in the MD: "...he data in transit and at rest is encrypted or at least strongly hashed" What does "strongly hashed" mean here? Is this part of the Bloom Filter vernacular? Shouldn't we define this somewhere? TY, Gary On Sat, Oct 12, 2024, 5:07 AM Claude Warren wrote: > > ... > > We can delay MultidimensionalBloomFilter unless introducing it later > would > > break binary compatibility. How beneficial would introducing this > interface > > for users? > > > > Gary > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren wrote: > > > > > This is starting tondelve into the realm of multidimensional bloom > > filters. > > ... > > > We could define an interface MultidimensionalBloomFilter that could be > > > implemented by any Bloom filter that comprises multiple filters and > givr > > it > > > the method flatten(). But this is a can of worms I would like to keep > > > sealed for a bit longer. > > > > > > Claude > > > > I have been meaning to answer this question for a week now but time always > seems to get away and it is a rather long discussion. > > Multidimensional Bloom filters are basically any collection of Bloom > filters. However they have very different capabilities and usages. I wrote > an introduction to Multidimensional Bloom filters ( > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > ) > but I can’t seem to find it online. Perhaps it does not get published to > the site, though the site descriptions are out of date. > > I identify two categories of Bloom filters: Gatekeeper and > Representational. > > Gatekeeper is the standard use where the key for an item is hashed and > placed in a Bloom filter and when an expensive lookup is required the Bloom > filter is consulted first to see if the lookup should be performed. > > Representational filters encode properties of objects and are then used to > find objects with the same properties. This use case is found in genomics, > encrypted database searching, and some proposals for performing pattern > matching in the presence of wildcards. In the representational case the > Bloom filter is associated with a data record of some sort. > > Multidimensional Bloom filters can contain either category. > > As a couple of examples: > > Example 1: Tracking ageing records on disk > > A streaming data system tracks active records by storing them on disk for > an hour. Every 5 minutes it flushes any record that is older than 1 hour. A > layered Bloom filter (a type of Multidimensional Bloom filter) is > constructed and a new layer added whenever one becomes full (the max number > of items have been added) or 5 minutes has elapsed (therefore they must > have timestamps associated with them). Bloom filters are removed whenever > their timestamp expires. > > When a record is to be added the layered Bloom filter is checked to see if > the object is on the disk. If not, it is added. If so, the disk is searched > for the record. Note, that if each Bloom filter in the layered Bloom filter > is associated with a separate file then identifying the layers that match > the data identifies which files to search further reducing the search time. > In this case the layers are gatekeeper filters. > > Example 2: locating potential pattern matches in the presence of wildcards > > A system of patterns is stored where each pattern may have a wildcard > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and > digits 0-9. When a pattern is added to the system every pair of characters > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > the pattern. The Bloom filter is then placed in a multidimensional Bloom > filter. > > Queries are then textual strings for which potential matches are located. > So the query is hashed in the same way and each Bloom filter in the > multidimensional Bloom filter is checked to see if it matches. If it does > then the associated pattern is checked to see if it matches. > > So the string “how now brown cow” would yield a Bloom filter that was the > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > “co” thus it would match the Bloom filter for the phrase. This is similar > to how some genomic searching works. > > Example 3: Locating objects with specific properties > > In this example assume a number of objects with common properties names. We > construct a Bloom filter for each object. The Bloom filter comprises the > hashes of the values of the properties (potentially with different seeds to > differentiate between the property names). We write the objects into a data > storage and associate the storage with their Bloom filters. The Bloom > filter is then placed into a multidimensional Bloom filter. > > If we want to find all the objects where X=five and Y=blue an
Re: [collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)
Thank you for explaining multidimensional filters here. It's not clear to me yet if we need to introduce the new interface. I'll have to reread the great docs you wrote 😀 The search() vs. locate() names are a bit confusing. Are these BF naming conventions? If not, and if the difference is the exactitude of the match, then using "exact" in the method name might be better. For example, Java uses this naming in https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/Math.html#addExact(long,long) An alternative would be to define a MatchType enum with 2 values EXACT and ... whatever the other name should be. I find enums easier to read than a boolean but that's also a possibility. Then you'd have a single search method with a MatchType argument (or a boolean). Just a thought. TY! Gary On Sat, Oct 12, 2024, 5:07 AM Claude Warren wrote: > > ... > > We can delay MultidimensionalBloomFilter unless introducing it later > would > > break binary compatibility. How beneficial would introducing this > interface > > for users? > > > > Gary > > > > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren wrote: > > > > > This is starting tondelve into the realm of multidimensional bloom > > filters. > > ... > > > We could define an interface MultidimensionalBloomFilter that could be > > > implemented by any Bloom filter that comprises multiple filters and > givr > > it > > > the method flatten(). But this is a can of worms I would like to keep > > > sealed for a bit longer. > > > > > > Claude > > > > I have been meaning to answer this question for a week now but time always > seems to get away and it is a rather long discussion. > > Multidimensional Bloom filters are basically any collection of Bloom > filters. However they have very different capabilities and usages. I wrote > an introduction to Multidimensional Bloom filters ( > > https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md > ) > but I can’t seem to find it online. Perhaps it does not get published to > the site, though the site descriptions are out of date. > > I identify two categories of Bloom filters: Gatekeeper and > Representational. > > Gatekeeper is the standard use where the key for an item is hashed and > placed in a Bloom filter and when an expensive lookup is required the Bloom > filter is consulted first to see if the lookup should be performed. > > Representational filters encode properties of objects and are then used to > find objects with the same properties. This use case is found in genomics, > encrypted database searching, and some proposals for performing pattern > matching in the presence of wildcards. In the representational case the > Bloom filter is associated with a data record of some sort. > > Multidimensional Bloom filters can contain either category. > > As a couple of examples: > > Example 1: Tracking ageing records on disk > > A streaming data system tracks active records by storing them on disk for > an hour. Every 5 minutes it flushes any record that is older than 1 hour. A > layered Bloom filter (a type of Multidimensional Bloom filter) is > constructed and a new layer added whenever one becomes full (the max number > of items have been added) or 5 minutes has elapsed (therefore they must > have timestamps associated with them). Bloom filters are removed whenever > their timestamp expires. > > When a record is to be added the layered Bloom filter is checked to see if > the object is on the disk. If not, it is added. If so, the disk is searched > for the record. Note, that if each Bloom filter in the layered Bloom filter > is associated with a separate file then identifying the layers that match > the data identifies which files to search further reducing the search time. > In this case the layers are gatekeeper filters. > > Example 2: locating potential pattern matches in the presence of wildcards > > A system of patterns is stored where each pattern may have a wildcard > matching one or more characters. Let’s assume the wildcard is ‘*’ and the > patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and > digits 0-9. When a pattern is added to the system every pair of characters > in [A-Z0-9] is hashed and added to a Bloom filter that is associated with > the pattern. The Bloom filter is then placed in a multidimensional Bloom > filter. > > Queries are then textual strings for which potential matches are located. > So the query is hashed in the same way and each Bloom filter in the > multidimensional Bloom filter is checked to see if it matches. If it does > then the associated pattern is checked to see if it matches. > > So the string “how now brown cow” would yield a Bloom filter that was the > combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The > pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, > “co” thus it would match the Bloom filter for the phrase. This is similar > to how some genomic s
[collections] Multidimensional Bloom Filters (was BloomFilterExtractor.flatten)
> ... > We can delay MultidimensionalBloomFilter unless introducing it later would > break binary compatibility. How beneficial would introducing this interface > for users? > > Gary > > > On Sun, Oct 6, 2024, 10:58 AM Claude Warren wrote: > > > This is starting tondelve into the realm of multidimensional bloom > filters. > ... > > We could define an interface MultidimensionalBloomFilter that could be > > implemented by any Bloom filter that comprises multiple filters and givr > it > > the method flatten(). But this is a can of worms I would like to keep > > sealed for a bit longer. > > > > Claude > I have been meaning to answer this question for a week now but time always seems to get away and it is a rather long discussion. Multidimensional Bloom filters are basically any collection of Bloom filters. However they have very different capabilities and usages. I wrote an introduction to Multidimensional Bloom filters ( https://github.com/apache/commons-collections/blob/master/src/site/markdown/bloomFilters/multidimensional.md) but I can’t seem to find it online. Perhaps it does not get published to the site, though the site descriptions are out of date. I identify two categories of Bloom filters: Gatekeeper and Representational. Gatekeeper is the standard use where the key for an item is hashed and placed in a Bloom filter and when an expensive lookup is required the Bloom filter is consulted first to see if the lookup should be performed. Representational filters encode properties of objects and are then used to find objects with the same properties. This use case is found in genomics, encrypted database searching, and some proposals for performing pattern matching in the presence of wildcards. In the representational case the Bloom filter is associated with a data record of some sort. Multidimensional Bloom filters can contain either category. As a couple of examples: Example 1: Tracking ageing records on disk A streaming data system tracks active records by storing them on disk for an hour. Every 5 minutes it flushes any record that is older than 1 hour. A layered Bloom filter (a type of Multidimensional Bloom filter) is constructed and a new layer added whenever one becomes full (the max number of items have been added) or 5 minutes has elapsed (therefore they must have timestamps associated with them). Bloom filters are removed whenever their timestamp expires. When a record is to be added the layered Bloom filter is checked to see if the object is on the disk. If not, it is added. If so, the disk is searched for the record. Note, that if each Bloom filter in the layered Bloom filter is associated with a separate file then identifying the layers that match the data identifies which files to search further reducing the search time. In this case the layers are gatekeeper filters. Example 2: locating potential pattern matches in the presence of wildcards A system of patterns is stored where each pattern may have a wildcard matching one or more characters. Let’s assume the wildcard is ‘*’ and the patterns are restricted to upper case Latin characters A-Z, space, ‘*’, and digits 0-9. When a pattern is added to the system every pair of characters in [A-Z0-9] is hashed and added to a Bloom filter that is associated with the pattern. The Bloom filter is then placed in a multidimensional Bloom filter. Queries are then textual strings for which potential matches are located. So the query is hashed in the same way and each Bloom filter in the multidimensional Bloom filter is checked to see if it matches. If it does then the associated pattern is checked to see if it matches. So the string “how now brown cow” would yield a Bloom filter that was the combination of the hashes of “ho”, “ow”, “no”, “br”, “wn”, and “co”. The pattern “how now * cow” would be indexed with hashes of “ho”, “ow” “no”, “co” thus it would match the Bloom filter for the phrase. This is similar to how some genomic searching works. Example 3: Locating objects with specific properties In this example assume a number of objects with common properties names. We construct a Bloom filter for each object. The Bloom filter comprises the hashes of the values of the properties (potentially with different seeds to differentiate between the property names). We write the objects into a data storage and associate the storage with their Bloom filters. The Bloom filter is then placed into a multidimensional Bloom filter. If we want to find all the objects where X=five and Y=blue and Z=yesterday we simply generate the hashes and create the Bloom filter and then search the multidimensional Bloom filter for matches. This is the solution found in searching encrypted data, where the clear text values are hashed and the encrypted data written to storage. So we have multidimensional Bloom filters that store gatekeeper type filters (example 1) and others that store representational (example 3). Example 2 seems to fall somewhere on the continuum be