Thanks for the clarifications.. On Mon, Feb 14, 2011 at 6:13 PM, Sylvain Lebresne <sylv...@datastax.com>wrote:
> On Mon, Feb 14, 2011 at 11:27 AM, Aditya Narayan <ady...@gmail.com> wrote: > >> Thanks Sylvain, >> >> I guess I might have misunderstood the meaning of column_index_size_in_kb, >> My previous understanding about that was: it is the threshold size for a row >> to pass, after which its columns will be indexed. >> > > It is the size of the index 'bucket'. But given that there is no point to > have an index with only one entry, it is true that it is also the threshold > after wich row start to be indexed. > > >> >> If I have understood it correctly, it implies the size of the "blocks >> (containing columns) that are kept together on the same index". So if you >> make that high, a large no of columns will need to be deseralized for a >> single column access, in that block. And it you make it lower than optimal >> than indexes size will grow up, right? >> > > yes > > >> So I guess we should vary that depending on the size of our columns and >> not the size of rows !? I have valueless columns for my usecase. > > > Yes it depends mainly on the size of your columns. But if you have big > rows, even with very tiny columns, you may still not want to put a too small > value there. In general I would really make careful tests with your workload > before changing the value of column_index_size_in_kb to see if it does make > a difference. Not sure there is much to gain here. > > -- > Sylvain > > >> >> >> >> >> On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne >> <sylv...@datastax.com>wrote: >> >>> As said by aaron, if the whole row is under 64k, it won't matter. But >>> since you spoke of very wide row, I'll assume the whole will be much more >>> than 64k. >>> >>> If so, the row is indexed by block (of 64k, configurable). Then the read >>> performance depends on how many of those block are needed for the query, >>> since each block potentially means a seek (potentially because some block >>> could happen to be sequential on disk). So if the columns you ask for are >>> really randomly distributed, then yes, the biggest the row is, the biggest >>> the chance is to have to hit many blocks and the biggest the chance is for >>> these block to be far apart on disk. >>> >>> -- >>> Sylvain >>> >>> On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan <ady...@gmail.com>wrote: >>> >>>> Jonathan, >>>> If I ask for around 150-200 columns (totally random not sequential) from >>>> a very wide row that contains more than a million or even more columns >>>> then, >>>> is the read performance of the SliceQuery operation affected by or "depends >>>> on the length of the row" ?? (For my use case, I would use the column names >>>> list for this SliceQuery operation). >>>> >>>> >>>> Thanks >>>> Aditya >>>> >>>> >>>> On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis <jbel...@gmail.com>wrote: >>>> >>>>> On Sun, Feb 13, 2011 at 12:37 AM, E S <tr1skl...@yahoo.com> wrote: >>>>> > I've gotten myself really confused by >>>>> > http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping >>>>> someone can >>>>> > help me understand what the io behavior of this operation would be. >>>>> > >>>>> > When I do a get_slice for a column range, will it seek to every >>>>> SSTable? I had >>>>> > thought that it would use the bloom filter on the row key so that it >>>>> would only >>>>> > do a seek to SSTables that have a very high probability of containing >>>>> columns >>>>> > for that row. >>>>> >>>>> Yes. >>>>> >>>>> > In the linked doc above, it seems to say that it is only used for >>>>> > exact column names. Am I misunderstanding this? >>>>> >>>>> Yes. You may be confusing multi-row behavior with multi-column. >>>>> >>>>> > On a related note, if instead of using a SliceRange I provide an >>>>> explicit list >>>>> > of columns, will I have to read all SSTables that have values for the >>>>> columns >>>>> >>>>> Yes. >>>>> >>>>> > or is it smart enough to stop after finding a value from the most >>>>> recent >>>>> > SSTable? >>>>> >>>>> There is no way to know which value is most recent without having to >>>>> read it first. >>>>> >>>>> -- >>>>> Jonathan Ellis >>>>> Project Chair, Apache Cassandra >>>>> co-founder of DataStax, the source for professional Cassandra support >>>>> http://www.datastax.com >>>>> >>>> >>>> >>> >> >