Hi Benedict,
> On Sep 8, 2014, at 4:01 AM, Benedict Elliott Smith
> wrote:
>
> Hi Mark,
>
> This specific heuristic is unlikely to be applied, as (if I've understood
> it correctly) it has a very narrow window of utility to only those updates
> that hit *exactly* the same clustering columns (cql rows) *and *data
> columns,
Well, method #1 indeed applies to updates that pertain to exactly the same
columns, like you said. That is to say, if you go back to the products example,
if you end up updating e.g (price, title) every day and therefore end up with a
dozen SSTables, where in one SSTable you have all properties for the product
(merchant info, category, image, .. ) and in almost all other you have an
update for just (price, title) - this helps you avoid I/O and CPU costs by only
having to consider 2 SSTables (index chunk segments are almost always cached).
Method #2 does not have this requirement. It works even if you don’t have the
exact same set of columns in the row on each SSTable, however you are limited
to the first 31 distinct ones on per SSTable basis.
> and is not trivial to maintain (either cpu- or memory-wise).
> However two variants on this heuristic are already being considered for
> inclusion as part of the new sstable format we're introducing in
> CASSANDRA-7447 <https://issues.apache.org/jira/browse/CASSANDRA-7447>,
> which is an extension of the the heuristic that is already applied at the
> whole sstable level.
>
Very interesting. We have made some similar choices (albeit differently, and
for different reasons — e.g timestamp/ttls compression, per file layout index,
etc). The described heuristics, in practice and for the workloads and update
and retrieval patterns I described, are extremely effective. They are also
particularly ‘cheap’ (turned out) to compute and store, and add very little
overhead to the GET logic impl. Again, those are configured on a per CF basis
(either method is selected) - so we only do this where we know it makes sense
to do it.
Thank you for the link; some great ideas there. This looks like a great update
to C*.
> (1) Per partition, we will store the maximum timestamp (whether or not this
> sits in the hash index / key cache, or in the clustering column index part,
> is an open question)
> - this permits us to stop looking at files once we have a complete set of
> the data we expect to return, i.e. all selected fields for the complete set
> of selected rows
>
> (2) Per clustering row, we may store a enough information to construct the
> max timestamp for the row
> - this permits us to stop looking at data pages if we know we have all
> selected fields for a given row only
>
>
>
>
> On Sun, Sep 7, 2014 at 11:30 PM, Mark Papadakis
> wrote:
>
>> Greetings,
>>
>> This heuristic helps us eliminate unnecessary I/O in certain workloads and
>> datasets, by often many orders of magnitude. This is description of the
>> problems we faced and how we dealt with it — I am pretty certain this can
>> be easily implemented on C*, albeit will likely require a new SSTable
>> format that can support the semantics described below.
>>
>> # Example
>> One of our services, a price comparison service, has many millions of
>> products in our datastore, and we access over 100+ rows on a single page
>> request (almost all of them in 2-3 MultiGets - those are executed in
>> parallel in our datastore implementation). This is fine, and it rarely
>> takes more than 100ms to get back responses from all those requests.
>>
>> Because we, in practice, need to update all key=>value rows multiple times
>> a day (merchants tend to update their price every few hours for some odd
>> reason), it means that a key’s columns exist in multiple(almost always all)
>> SSTables of a CF, and so, we almost always have to merge the final value
>> for each key from all those many SSTables, as opposed to only need to
>> access a single SSTable to do that.
>>
>> In fact, for most CFs of this service, we need to merge most of their
>> SSTables to get the final CF, because of that same reason (rows update very
>> frequently, as opposed to say, a ‘users’ CF where you typically only set
>> the row once on creation and very infrequently after that ).
>> Surely, there must have been ways to exploit this pattern and access and
>> update semantics. (there are).
>>
>> Our SSTables are partitioned into chunks. One of those chunks is the index
>> chunk which holds distinctKeyId:u64 => offset:u32, sorted by distinctKeyId.
>> We have a map which allows us to map distinctKeyId:u64=> data chunk
>> region(skip list), so that this offset is an offset into the respectiv