So, looking more closely at your proposal I realise what you are trying to do. 
The thing that threw me was your mention of lists and other collections. This 
will likely not work as there is no index that is possible to define on a list 
(or other collection) within a single sstable - a list is defined over the 
whole on-disk contents, so the index is undefined within a given sstable.

Tuple and UDT are encoded inefficiently if there are many null fields, but this 
is a very localised change, affecting just one class. You should take a look at 
Columns.Serializer for code you can lift for encoding and decoding sparse 
subsets of fields.

It might be that this can be switched on or off per sstable with a header flag 
bit so that there is no additional cost for datasets that would not benefit. 
Likely we can also migrate to vint encoding for the component sizes also (and 
either 1 or 0 bytes for fixed width values), no doubt saving a lot of space 
over the status quo, even for small UDT with few null entries.

Essentially at this point we’re talking about pushing through storage 
optimisations applied elsewhere to tuples and UDT, which is a very 
uncontroversial change.

> On 6 Sep 2022, at 07:28, Benedict <benedictatapa...@icloud.com> wrote:
> 
> I agree a Jira would suffice, and if visibility there required a DISCUSS 
> thread or simply a notice sent to the list.
> 
> While we’re here though, while I don’t have a lot of time to engage in 
> discussion it’s unclear to me what advantage this encoding scheme brings. It 
> might be worth outlining what algorithmic advantage you foresee for what data 
> distributions in which collection types.
> 
>> On 6 Sep 2022, at 07:16, Claude Warren via dev <dev@cassandra.apache.org> 
>> wrote:
>> 
>> I am just learning the ropes here so perhaps it is not CEP worthy.  That 
>> being said, It felt like there was a lot of information to put into and 
>> track in a ticket, particularly when I expected discussion about how to best 
>> encode, changes to the algorithms etc.  It feels like it would be difficult 
>> to track. But if that is standard for this project I will move the 
>> information there.
>> 
>> As to the benchmarking, I had thought that usage and performance measures 
>> should be included.  Thank you for calling out the subset of data selected 
>> query as being of particular importance.
>> 
>> Claude
>> 
>>>> On 06/09/2022 03:11, Abe Ratnofsky wrote:
>>> Looking at this link: 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization
>>> 
>>> Do you have any plans to include benchmarks in your test plan? It would be 
>>> useful to include disk usage / read performance / write performance 
>>> comparisons with the new encodings, particularly for sparse collections 
>>> where a subset of data is selected out of a collection.
>>> 
>>> I do wonder whether this is CEP-worthy. The CEP says that the changes will 
>>> not impact existing users, will be backwards compatible, and overall is an 
>>> efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
>>> significant user-facing or changes that cut across multiple subsystems”. 
>>> Any reason why a Jira isn’t sufficient?
>>> 
>>> Abe
>>> 
>>>>> On Sep 5, 2022, at 1:57 AM, Claude Warren via dev 
>>>>> <dev@cassandra.apache.org> wrote:
>>>> 
>>>> I have just posted a CEP  covering an Enhancement for Sparse Data 
>>>> Serialzation.  This is in response to CASSANDRA-8959
>>>> 
>>>> I look forward to responses.
>>>> 
>>>> 
> 

Reply via email to