So, looking more closely at your proposal I realise what you are trying to do. The thing that threw me was your mention of lists and other collections. This will likely not work as there is no index that is possible to define on a list (or other collection) within a single sstable - a list is defined over the whole on-disk contents, so the index is undefined within a given sstable.
Tuple and UDT are encoded inefficiently if there are many null fields, but this is a very localised change, affecting just one class. You should take a look at Columns.Serializer for code you can lift for encoding and decoding sparse subsets of fields. It might be that this can be switched on or off per sstable with a header flag bit so that there is no additional cost for datasets that would not benefit. Likely we can also migrate to vint encoding for the component sizes also (and either 1 or 0 bytes for fixed width values), no doubt saving a lot of space over the status quo, even for small UDT with few null entries. Essentially at this point we’re talking about pushing through storage optimisations applied elsewhere to tuples and UDT, which is a very uncontroversial change. > On 6 Sep 2022, at 07:28, Benedict <benedictatapa...@icloud.com> wrote: > > I agree a Jira would suffice, and if visibility there required a DISCUSS > thread or simply a notice sent to the list. > > While we’re here though, while I don’t have a lot of time to engage in > discussion it’s unclear to me what advantage this encoding scheme brings. It > might be worth outlining what algorithmic advantage you foresee for what data > distributions in which collection types. > >> On 6 Sep 2022, at 07:16, Claude Warren via dev <dev@cassandra.apache.org> >> wrote: >> >> I am just learning the ropes here so perhaps it is not CEP worthy. That >> being said, It felt like there was a lot of information to put into and >> track in a ticket, particularly when I expected discussion about how to best >> encode, changes to the algorithms etc. It feels like it would be difficult >> to track. But if that is standard for this project I will move the >> information there. >> >> As to the benchmarking, I had thought that usage and performance measures >> should be included. Thank you for calling out the subset of data selected >> query as being of particular importance. >> >> Claude >> >>>> On 06/09/2022 03:11, Abe Ratnofsky wrote: >>> Looking at this link: >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization >>> >>> Do you have any plans to include benchmarks in your test plan? It would be >>> useful to include disk usage / read performance / write performance >>> comparisons with the new encodings, particularly for sparse collections >>> where a subset of data is selected out of a collection. >>> >>> I do wonder whether this is CEP-worthy. The CEP says that the changes will >>> not impact existing users, will be backwards compatible, and overall is an >>> efficiency improvement. The CEP guidelines say a CEP is encouraged “for >>> significant user-facing or changes that cut across multiple subsystems”. >>> Any reason why a Jira isn’t sufficient? >>> >>> Abe >>> >>>>> On Sep 5, 2022, at 1:57 AM, Claude Warren via dev >>>>> <dev@cassandra.apache.org> wrote: >>>> >>>> I have just posted a CEP covering an Enhancement for Sparse Data >>>> Serialzation. This is in response to CASSANDRA-8959 >>>> >>>> I look forward to responses. >>>> >>>> >