Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

Benedict Thu, 08 Sep 2022 01:13:59 -0700

I was referring to Column*s.*Serializer, which has serializeSubset methods.


> On 8 Sep 2022, at 07:07, Claude Warren via dev <dev@cassandra.apache.org> 
> wrote:
> 
> I have looked through the code mentioned.  What I found in the 
> ColumnSerializer was the use of VInt encoding.  Are you proposing switching 
> directly to VInt encoding for sizes rather than one of the other encodings?  
> Using a -2 as the first length to signal that the new encoding is in use so 
> that existing encodings can be read unchanged?
> 
> 
>> On 06/09/2022 16:37, Benedict wrote:
>> So, looking more closely at your proposal I realise what you are trying to 
>> do. The thing that threw me was your mention of lists and other collections. 
>> This will likely not work as there is no index that is possible to define on 
>> a list (or other collection) within a single sstable - a list is defined 
>> over the whole on-disk contents, so the index is undefined within a given 
>> sstable.
>> 
>> Tuple and UDT are encoded inefficiently if there are many null fields, but 
>> this is a very localised change, affecting just one class. You should take a 
>> look at Columns.Serializer for code you can lift for encoding and decoding 
>> sparse subsets of fields.
>> 
>> It might be that this can be switched on or off per sstable with a header 
>> flag bit so that there is no additional cost for datasets that would not 
>> benefit. Likely we can also migrate to vint encoding for the component sizes 
>> also (and either 1 or 0 bytes for fixed width values), no doubt saving a lot 
>> of space over the status quo, even for small UDT with few null entries.
>> 
>> Essentially at this point we’re talking about pushing through storage 
>> optimisations applied elsewhere to tuples and UDT, which is a very 
>> uncontroversial change.
>> 
>>>> On 6 Sep 2022, at 07:28, Benedict <benedictatapa...@icloud.com> wrote:
>>> 
>>> I agree a Jira would suffice, and if visibility there required a DISCUSS 
>>> thread or simply a notice sent to the list.
>>> 
>>> While we’re here though, while I don’t have a lot of time to engage in 
>>> discussion it’s unclear to me what advantage this encoding scheme brings. 
>>> It might be worth outlining what algorithmic advantage you foresee for what 
>>> data distributions in which collection types.
>>> 
>>>> On 6 Sep 2022, at 07:16, Claude Warren via dev <dev@cassandra.apache.org> 
>>>> wrote:
>>>> 
>>>> I am just learning the ropes here so perhaps it is not CEP worthy.  That 
>>>> being said, It felt like there was a lot of information to put into and 
>>>> track in a ticket, particularly when I expected discussion about how to 
>>>> best encode, changes to the algorithms etc.  It feels like it would be 
>>>> difficult to track. But if that is standard for this project I will move 
>>>> the information there.
>>>> 
>>>> As to the benchmarking, I had thought that usage and performance measures 
>>>> should be included.  Thank you for calling out the subset of data selected 
>>>> query as being of particular importance.
>>>> 
>>>> Claude
>>>> 
>>>>>> On 06/09/2022 03:11, Abe Ratnofsky wrote:
>>>>> Looking at this link: 
>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization
>>>>> 
>>>>> Do you have any plans to include benchmarks in your test plan? It would 
>>>>> be useful to include disk usage / read performance / write performance 
>>>>> comparisons with the new encodings, particularly for sparse collections 
>>>>> where a subset of data is selected out of a collection.
>>>>> 
>>>>> I do wonder whether this is CEP-worthy. The CEP says that the changes 
>>>>> will not impact existing users, will be backwards compatible, and overall 
>>>>> is an efficiency improvement. The CEP guidelines say a CEP is encouraged 
>>>>> “for significant user-facing or changes that cut across multiple 
>>>>> subsystems”. Any reason why a Jira isn’t sufficient?
>>>>> 
>>>>> Abe
>>>>> 
>>>>>>> On Sep 5, 2022, at 1:57 AM, Claude Warren via dev 
>>>>>>> <dev@cassandra.apache.org> wrote:
>>>>>> I have just posted a CEP  covering an Enhancement for Sparse Data 
>>>>>> Serialzation.  This is in response to CASSANDRA-8959
>>>>>> 
>>>>>> I look forward to responses.
>>>>>> 
>>>>>>

Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

Reply via email to