I was referring to Column*s.*Serializer, which has serializeSubset methods.
> On 8 Sep 2022, at 07:07, Claude Warren via dev <dev@cassandra.apache.org> > wrote: > > I have looked through the code mentioned. What I found in the > ColumnSerializer was the use of VInt encoding. Are you proposing switching > directly to VInt encoding for sizes rather than one of the other encodings? > Using a -2 as the first length to signal that the new encoding is in use so > that existing encodings can be read unchanged? > > >> On 06/09/2022 16:37, Benedict wrote: >> So, looking more closely at your proposal I realise what you are trying to >> do. The thing that threw me was your mention of lists and other collections. >> This will likely not work as there is no index that is possible to define on >> a list (or other collection) within a single sstable - a list is defined >> over the whole on-disk contents, so the index is undefined within a given >> sstable. >> >> Tuple and UDT are encoded inefficiently if there are many null fields, but >> this is a very localised change, affecting just one class. You should take a >> look at Columns.Serializer for code you can lift for encoding and decoding >> sparse subsets of fields. >> >> It might be that this can be switched on or off per sstable with a header >> flag bit so that there is no additional cost for datasets that would not >> benefit. Likely we can also migrate to vint encoding for the component sizes >> also (and either 1 or 0 bytes for fixed width values), no doubt saving a lot >> of space over the status quo, even for small UDT with few null entries. >> >> Essentially at this point we’re talking about pushing through storage >> optimisations applied elsewhere to tuples and UDT, which is a very >> uncontroversial change. >> >>>> On 6 Sep 2022, at 07:28, Benedict <benedictatapa...@icloud.com> wrote: >>> >>> I agree a Jira would suffice, and if visibility there required a DISCUSS >>> thread or simply a notice sent to the list. >>> >>> While we’re here though, while I don’t have a lot of time to engage in >>> discussion it’s unclear to me what advantage this encoding scheme brings. >>> It might be worth outlining what algorithmic advantage you foresee for what >>> data distributions in which collection types. >>> >>>> On 6 Sep 2022, at 07:16, Claude Warren via dev <dev@cassandra.apache.org> >>>> wrote: >>>> >>>> I am just learning the ropes here so perhaps it is not CEP worthy. That >>>> being said, It felt like there was a lot of information to put into and >>>> track in a ticket, particularly when I expected discussion about how to >>>> best encode, changes to the algorithms etc. It feels like it would be >>>> difficult to track. But if that is standard for this project I will move >>>> the information there. >>>> >>>> As to the benchmarking, I had thought that usage and performance measures >>>> should be included. Thank you for calling out the subset of data selected >>>> query as being of particular importance. >>>> >>>> Claude >>>> >>>>>> On 06/09/2022 03:11, Abe Ratnofsky wrote: >>>>> Looking at this link: >>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization >>>>> >>>>> Do you have any plans to include benchmarks in your test plan? It would >>>>> be useful to include disk usage / read performance / write performance >>>>> comparisons with the new encodings, particularly for sparse collections >>>>> where a subset of data is selected out of a collection. >>>>> >>>>> I do wonder whether this is CEP-worthy. The CEP says that the changes >>>>> will not impact existing users, will be backwards compatible, and overall >>>>> is an efficiency improvement. The CEP guidelines say a CEP is encouraged >>>>> “for significant user-facing or changes that cut across multiple >>>>> subsystems”. Any reason why a Jira isn’t sufficient? >>>>> >>>>> Abe >>>>> >>>>>>> On Sep 5, 2022, at 1:57 AM, Claude Warren via dev >>>>>>> <dev@cassandra.apache.org> wrote: >>>>>> I have just posted a CEP covering an Enhancement for Sparse Data >>>>>> Serialzation. This is in response to CASSANDRA-8959 >>>>>> >>>>>> I look forward to responses. >>>>>> >>>>>>