Thanks DuyHai,
I think the trouble of bloom filter on all row keys & column names is
memory usage. However, if a CF has only hundreds of columns per row, the
number of total columns will be much fewer, so the bloom filter is possible
for this condition, right? Is there a good way to adjust bloom filter's
property between row keys and row keys+column names automatically or by
user's config?
Thanks,
Philo Yang
2014-09-15 2:45 GMT+08:00 DuyHai Doan :
> Hello Philo
>
> Building bloom filter for column names (what you call column key) is
> technically possible but very expensive in term of memory usage.
>
> The approximate formula to calculate space required by bloom filter can
> be found on slide 27 here:
> http://fr.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees
>
> false positive chance = 0.6185 * m/n where m = number of bits for the
> filter and n = number of distinct keys
>
> For example, if you want to index 1 million of rows, each having 100 000
> columns on average, it will end up indexing 100 billions of keys (row keys
> & column names) with bloom filter.
>
> By applying the above formula, m ≈ 4.8 * 10^11 bits ≈ 60Gb to allocate in
> RAM just for bloom filter on all row keys & column names ...
>
> Regards
>
> Duy Hai DOAN
>
> On Sun, Sep 14, 2014 at 11:22 AM, Philo Yang wrote:
>
>> Hi all,
>>
>> After reading some docs, I find that bloom filter is built on row keys,
>> not on column key. Can anyone tell me what is considered for not building
>> bloom filter on column key? Is it a good idea to offer a table property
>> option between row key and primary key for what boolm filter is built on?
>>
>> Thanks,
>> Philo Yang
>>
>>
>