Re: Confused about get_slice SliceRange behavior with bloom filter

Aditya Narayan Mon, 14 Feb 2011 04:52:52 -0800

Thanks for the clarifications..

On Mon, Feb 14, 2011 at 6:13 PM, Sylvain Lebresne <sylv...@datastax.com>wrote:


> On Mon, Feb 14, 2011 at 11:27 AM, Aditya Narayan <ady...@gmail.com> wrote:
>
>> Thanks Sylvain,
>>
>> I guess I might have misunderstood the meaning of column_index_size_in_kb,
>> My previous understanding about that was: it is the threshold size for a row
>> to pass, after which its columns will be indexed.
>>
>
> It is the size of the index 'bucket'. But given that there is no point to
> have an index with only one entry, it is true that it is also the threshold
> after wich row start to be indexed.
>
>
>>
>> If I have understood it correctly, it implies the size of the "blocks
>> (containing columns) that are kept together on the same index". So if you
>> make that high, a large no of columns will need to be deseralized for a
>> single column access, in that block. And it you make it lower than optimal
>> than indexes size will grow up, right?
>>
>
> yes
>
>
>> So I guess we should vary that depending on the size of our columns and
>> not the size of rows !? I have valueless columns for my usecase.
>
>
> Yes it depends mainly on the size of your columns. But if you have big
> rows, even with very tiny columns, you may still not want to put a too small
> value there. In general I would really make careful tests with your workload
> before changing the value of column_index_size_in_kb to see if it does make
> a difference. Not sure there is much to gain here.
>
> --
> Sylvain
>
>
>>
>>
>>
>>
>> On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne 
>> <sylv...@datastax.com>wrote:
>>
>>> As said by aaron, if the whole row is under 64k, it won't matter. But
>>> since you spoke of very wide row, I'll assume the whole will be much more
>>> than 64k.
>>>
>>> If so, the row is indexed by block (of 64k, configurable). Then the read
>>> performance depends on how many of those block are needed for the query,
>>> since each block potentially means a seek (potentially because some block
>>> could happen to be sequential on disk). So if the columns you ask for are
>>> really randomly distributed, then yes, the biggest the row is, the biggest
>>> the chance is to have to hit many blocks and the biggest the chance is for
>>> these block to be far apart on disk.
>>>
>>> --
>>> Sylvain
>>>
>>> On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan <ady...@gmail.com>wrote:
>>>
>>>> Jonathan,
>>>> If I ask for around 150-200 columns (totally random not sequential) from
>>>> a very wide row that contains more than a million or even more columns 
>>>> then,
>>>> is the read performance of the SliceQuery operation affected by or "depends
>>>> on the length of the row" ?? (For my use case, I would use the column names
>>>> list for this SliceQuery operation).
>>>>
>>>>
>>>> Thanks
>>>> Aditya
>>>>
>>>>
>>>> On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis <jbel...@gmail.com>wrote:
>>>>
>>>>> On Sun, Feb 13, 2011 at 12:37 AM, E S <tr1skl...@yahoo.com> wrote:
>>>>> > I've gotten myself really confused by
>>>>> > http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping
>>>>> someone can
>>>>> > help me understand what the io behavior of this operation would be.
>>>>> >
>>>>> > When I do a get_slice for a column range, will it seek to every
>>>>> SSTable?  I had
>>>>> > thought that it would use the bloom filter on the row key so that it
>>>>> would only
>>>>> > do a seek to SSTables that have a very high probability of containing
>>>>> columns
>>>>> > for that row.
>>>>>
>>>>> Yes.
>>>>>
>>>>> > In the linked doc above, it seems to say that it is only used for
>>>>> > exact column names.  Am I misunderstanding this?
>>>>>
>>>>> Yes.  You may be confusing multi-row behavior with multi-column.
>>>>>
>>>>> > On a related note, if instead of using a SliceRange I provide an
>>>>> explicit list
>>>>> > of columns, will I have to read all SSTables that have values for the
>>>>> columns
>>>>>
>>>>> Yes.
>>>>>
>>>>> > or is it smart enough to stop after finding a value from the most
>>>>> recent
>>>>> > SSTable?
>>>>>
>>>>> There is no way to know which value is most recent without having to
>>>>> read it first.
>>>>>
>>>>> --
>>>>> Jonathan Ellis
>>>>> Project Chair, Apache Cassandra
>>>>> co-founder of DataStax, the source for professional Cassandra support
>>>>> http://www.datastax.com
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Confused about get_slice SliceRange behavior with bloom filter

Reply via email to