Re: [DISCUSS] CEP-20: Dynamic Data Masking
Is there enough support here for VIEWS to be the implementation strategy for displaying masking functions? It seems to me the view would have to store the query and apply a where clause to it, so the same PK would be in play. It has data leaking properties. It has more use cases as it can be used to * construct views that filter out sensitive columns * apply transforms to convert units of measure Are there more thoughts along this line?
Re: [DISCUSS] LWT UPDATE semantics with + and - when null
I like this approach. However, in light of some of the discussions on view and the like perhaps the function is (column value as returned by select ) + 42 So a null counter column becomes 0 before the update calculation is applied. Then any null can be considered null unless addressed by IfNull(), or zeroIfNull() Any operation on null returns null. I think this follows what would be expected by most users in most cases. On 31/08/2022 11:55, Andrés de la Peña wrote: I think I'd prefer 2), the SQL behaviour. We could also get the convenience of 3) by adding CQL functions such as "ifNull(column, default)" or "zeroIfNull(column)", as it's done by other dbs. So we could do things like "UPDATE ... SET name = zeroIfNull(name) + 42". On Wed, 31 Aug 2022 at 04:54, Caleb Rackliffe wrote: Also +1 on the SQL behavior here. I was uneasy w/ coercing to "" / 0 / 1 (depending on the type) in our previous discussion, but for some reason didn't bring up the SQL analog :-| On Tue, Aug 30, 2022 at 5:38 PM Benedict wrote: I’m a bit torn here, as consistency with counters is important. But they are a unique eventually consistent data type, and I am inclined to default standard numeric types to behave as SQL does, since they write a new value rather than a “delta” It is far from optimal to have divergent behaviours, but also suboptimal to diverge from relational algebra, and probably special casing counters is the least bad outcome IMO. On 30 Aug 2022, at 22:52, David Capwell wrote: 4.1 added the ability for LWT to support "UPDATE ... SET name = name + 42", but we never really fleshed out with the larger community what the semantics should be in the case where the column or row are NULL; I opened up https://issues.apache.org/jira/browse/CASSANDRA-17857 for this issue. As I see it there are 3 possible outcomes: 1) fail the query 2) null + 42 = null (matches SQL) 3) null + 42 == 0 + 42 = 42 (matches counters) In SQL you get NULL (option 2), but CQL counters treat NULL as 0 (option 3) meaning we already do not match SQL (though counters are not a standard SQL type so might not be applicable). Personally I lean towards option 3 as the "zero" for addition and subtraction is 0 (1 for multiplication and division). So looking for feedback so we can update in CASSANDRA-17857 before 4.1 release.
[DISCUSS] CEP-23: Enhancement for Sparse Data Serialization
I have just posted a CEP covering an Enhancement for Sparse Data Serialzation. This is in response to CASSANDRA-8959 I look forward to responses.
Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization
I am just learning the ropes here so perhaps it is not CEP worthy. That being said, It felt like there was a lot of information to put into and track in a ticket, particularly when I expected discussion about how to best encode, changes to the algorithms etc. It feels like it would be difficult to track. But if that is standard for this project I will move the information there. As to the benchmarking, I had thought that usage and performance measures should be included. Thank you for calling out the subset of data selected query as being of particular importance. Claude On 06/09/2022 03:11, Abe Ratnofsky wrote: Looking at this link: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization Do you have any plans to include benchmarks in your test plan? It would be useful to include disk usage / read performance / write performance comparisons with the new encodings, particularly for sparse collections where a subset of data is selected out of a collection. I do wonder whether this is CEP-worthy. The CEP says that the changes will not impact existing users, will be backwards compatible, and overall is an efficiency improvement. The CEP guidelines say a CEP is encouraged “for significant user-facing or changes that cut across multiple subsystems”. Any reason why a Jira isn’t sufficient? Abe On Sep 5, 2022, at 1:57 AM, Claude Warren via dev wrote: I have just posted a CEP covering an Enhancement for Sparse Data Serialzation. This is in response to CASSANDRA-8959 I look forward to responses.
Re: [DISCUSS] CEP-20: Dynamic Data Masking
My vote is B On 07/09/2022 13:12, Benedict wrote: I’m not convinced there’s been adequate resolution over which approach is adopted. I know you have expressed a preference for the table schema approach, but the weight of other opinion so far appears to be against this approach - even if it is broadly adopted by other databases. I will note that Postgres does not adopt this approach, it has a more sophisticated security label approach that has not been proposed by anybody so far. I think extra weight should be given to the implementer’s preference, so while I personally do not like the table schema approach, I am happy to accept this is an industry norm, and leave the decision to you. However, we should ensure the community as a whole endorses this. I think an indicative poll should be undertaken first, eg: A) We should implement the table schema approach, as proposed B) We should prefer the view approach, but I am not opposed to the implementor selecting the table schema approach for this CEP C) We should NOT implement the table schema approach, and should implement the view approach D) We should NOT implement the table schema approach, and should implement some other scheme (or not implement this feature) Where my vote is B On 7 Sep 2022, at 12:50, Andrés de la Peña wrote: If nobody has more concerns regarding the CEP I will start the vote tomorrow. On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña wrote: Is there enough support here for VIEWS to be the implementation strategy for displaying masking functions? I'm not sure that views should be "the" strategy for masking functions. We have multiple approaches here: 1) CQL functions only. Users can decide to use the masking functions on their own will. I think most dbs allow this pattern of usage, which is quite straightforward. Obviously, it doesn't allow admins to decide enforce users seeing only masked data. Nevertheless, it's still useful for trusted database users generating masked data that will be consumed by the end users of the application. 2) Masking functions attached to specific columns. This way the same queries will see different data (masked or not) depending on the permissions of the user running the query. It has the advantage of not requiring to change the queries that users with different permissions run. The downside is that users would need to query the schema if they need to know whether a column is masked, unless we change the names of the returned columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support applying the masking function to columns on the base table, and some of them also allow to apply masking to views. 3) Masking functions as part of projected views. This ways users might need to query the view appropriate for their permissions instead of the base table. This might mean changing the queries if the masking policy is changed by the admin. MySQL recommends this approach on a blog entry, although it's not part of its main documentation for data masking, and the implementation has security issues. Some of the other databases offering the approach 2) as their main option also support masking on view columns. Each approach has its own advantages and limitations, and I don't think we necessarily have to choose. The CEP proposes implementing 1) and 2), but no one impedes us to also have 3) if we get to have projected views. However, I think that projected views is a new general-purpose feature with its own complexities, so it would deserve its own CEP, if someone is willing to work on the implementation. On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev wrote: Is there enough support here for VIEWS to be the implementation strategy for displaying masking functions? It seems to me the view would have to store the query and apply a where clause to it, so the same PK would be in play. It has data leaking properties. It has more use cases as it can be used to * construct views that filter out sensitive columns * apply transforms to convert units of measure Are there more thoughts along this line?
Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization
I have looked through the code mentioned. What I found in the ColumnSerializer was the use of VInt encoding. Are you proposing switching directly to VInt encoding for sizes rather than one of the other encodings? Using a -2 as the first length to signal that the new encoding is in use so that existing encodings can be read unchanged? On 06/09/2022 16:37, Benedict wrote: So, looking more closely at your proposal I realise what you are trying to do. The thing that threw me was your mention of lists and other collections. This will likely not work as there is no index that is possible to define on a list (or other collection) within a single sstable - a list is defined over the whole on-disk contents, so the index is undefined within a given sstable. Tuple and UDT are encoded inefficiently if there are many null fields, but this is a very localised change, affecting just one class. You should take a look at Columns.Serializer for code you can lift for encoding and decoding sparse subsets of fields. It might be that this can be switched on or off per sstable with a header flag bit so that there is no additional cost for datasets that would not benefit. Likely we can also migrate to vint encoding for the component sizes also (and either 1 or 0 bytes for fixed width values), no doubt saving a lot of space over the status quo, even for small UDT with few null entries. Essentially at this point we’re talking about pushing through storage optimisations applied elsewhere to tuples and UDT, which is a very uncontroversial change. On 6 Sep 2022, at 07:28, Benedict wrote: I agree a Jira would suffice, and if visibility there required a DISCUSS thread or simply a notice sent to the list. While we’re here though, while I don’t have a lot of time to engage in discussion it’s unclear to me what advantage this encoding scheme brings. It might be worth outlining what algorithmic advantage you foresee for what data distributions in which collection types. On 6 Sep 2022, at 07:16, Claude Warren via dev wrote: I am just learning the ropes here so perhaps it is not CEP worthy. That being said, It felt like there was a lot of information to put into and track in a ticket, particularly when I expected discussion about how to best encode, changes to the algorithms etc. It feels like it would be difficult to track. But if that is standard for this project I will move the information there. As to the benchmarking, I had thought that usage and performance measures should be included. Thank you for calling out the subset of data selected query as being of particular importance. Claude On 06/09/2022 03:11, Abe Ratnofsky wrote: Looking at this link: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization Do you have any plans to include benchmarks in your test plan? It would be useful to include disk usage / read performance / write performance comparisons with the new encodings, particularly for sparse collections where a subset of data is selected out of a collection. I do wonder whether this is CEP-worthy. The CEP says that the changes will not impact existing users, will be backwards compatible, and overall is an efficiency improvement. The CEP guidelines say a CEP is encouraged “for significant user-facing or changes that cut across multiple subsystems”. Any reason why a Jira isn’t sufficient? Abe On Sep 5, 2022, at 1:57 AM, Claude Warren via dev wrote: I have just posted a CEP covering an Enhancement for Sparse Data Serialzation. This is in response to CASSANDRA-8959 I look forward to responses.
Committer needed for Deprecate Throwables.propagate usage
I made the necessary fixes to remove the deprecated Throwables.propagate calls. However, I need a committer to review. https://issues.apache.org/jira/browse/CASSANDRA-14218 Thank you, Claude