Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-31 Thread Claude Warren via dev
Is there enough support here for VIEWS to be the implementation strategy 
for displaying masking functions?


It seems to me the view would have to store the query and apply a where 
clause to it, so the same PK would be in play.


It has data leaking properties.

It has more use cases as it can be used to

 * construct views that filter out sensitive columns
 * apply transforms to convert units of measure

Are there more thoughts along this line?


Re: [DISCUSS] LWT UPDATE semantics with + and - when null

2022-08-31 Thread Claude Warren via dev
I like this approach.  However, in light of some of the discussions on 
view and the like perhaps the function is  (column value as returned by 
select ) + 42


So a null counter column becomes 0 before the update calculation is applied.

Then any null can be considered null unless addressed by IfNull(), or 
zeroIfNull()


Any operation on null returns null.

I think this follows what would be expected by most users in most cases.


On 31/08/2022 11:55, Andrés de la Peña wrote:
I think I'd prefer 2), the SQL behaviour. We could also get the 
convenience of 3) by adding CQL functions such as "ifNull(column, 
default)" or "zeroIfNull(column)", as it's done by other dbs. So we 
could do things like "UPDATE ... SET name = zeroIfNull(name) + 42".


On Wed, 31 Aug 2022 at 04:54, Caleb Rackliffe 
 wrote:


Also +1 on the SQL behavior here. I was uneasy w/ coercing to "" /
0 / 1 (depending on the type) in our previous discussion, but for
some reason didn't bring up the SQL analog :-|

On Tue, Aug 30, 2022 at 5:38 PM Benedict  wrote:

I’m a bit torn here, as consistency with counters is
important. But they are a unique eventually consistent data
type, and I am inclined to default standard numeric types to
behave as SQL does, since they write a new value rather than a
“delta”

It is far from optimal to have divergent behaviours, but also
suboptimal to diverge from relational algebra, and probably
special casing counters is the least bad outcome IMO.



On 30 Aug 2022, at 22:52, David Capwell 
wrote:


4.1 added the ability for LWT to support "UPDATE ... SET name
= name + 42", but we never really fleshed out with the larger
community what the semantics should be in the case where the
column or row are NULL; I opened up
https://issues.apache.org/jira/browse/CASSANDRA-17857 for
this issue.

As I see it there are 3 possible outcomes:
1) fail the query
2) null + 42 = null (matches SQL)
3) null + 42 == 0 + 42 = 42 (matches counters)

In SQL you get NULL (option 2), but CQL counters treat NULL
as 0 (option 3) meaning we already do not match SQL (though
counters are not a standard SQL type so might not be
applicable).  Personally I lean towards option 3 as the
"zero" for addition and subtraction is 0 (1 for
multiplication and division).

So looking for feedback so we can update in CASSANDRA-17857
before 4.1 release.



[DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

2022-09-05 Thread Claude Warren via dev
I have just posted a CEP  covering an Enhancement for Sparse Data 
Serialzation.  This is in response to CASSANDRA-8959


I look forward to responses.




Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

2022-09-05 Thread Claude Warren via dev
I am just learning the ropes here so perhaps it is not CEP worthy.  That 
being said, It felt like there was a lot of information to put into and 
track in a ticket, particularly when I expected discussion about how to 
best encode, changes to the algorithms etc.  It feels like it would be 
difficult to track. But if that is standard for this project I will move 
the information there.


As to the benchmarking, I had thought that usage and performance 
measures should be included.  Thank you for calling out the subset of 
data selected query as being of particular importance.


Claude

On 06/09/2022 03:11, Abe Ratnofsky wrote:

Looking at this link: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization

Do you have any plans to include benchmarks in your test plan? It would be 
useful to include disk usage / read performance / write performance comparisons 
with the new encodings, particularly for sparse collections where a subset of 
data is selected out of a collection.

I do wonder whether this is CEP-worthy. The CEP says that the changes will not 
impact existing users, will be backwards compatible, and overall is an 
efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
significant user-facing or changes that cut across multiple subsystems”. Any 
reason why a Jira isn’t sufficient?

Abe


On Sep 5, 2022, at 1:57 AM, Claude Warren via dev  
wrote:

I have just posted a CEP  covering an Enhancement for Sparse Data Serialzation. 
 This is in response to CASSANDRA-8959

I look forward to responses.




Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Claude Warren via dev

My vote is B

On 07/09/2022 13:12, Benedict wrote:
I’m not convinced there’s been adequate resolution over which approach 
is adopted. I know you have expressed a preference for the table 
schema approach, but the weight of other opinion so far appears to be 
against this approach - even if it is broadly adopted by other 
databases. I will note that Postgres does not adopt this approach, it 
has a more sophisticated security label approach that has not been 
proposed by anybody so far.


I think extra weight should be given to the implementer’s preference, 
so while I personally do not like the table schema approach, I am 
happy to accept this is an industry norm, and leave the decision to you.


However, we should ensure the community as a whole endorses this. I 
think an indicative poll should be undertaken first, eg:


A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to the 
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should 
implement the view approach
D) We should NOT implement the table schema approach, and should 
implement some other scheme (or not implement this feature)


Where my vote is B


On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:


If nobody has more concerns regarding the CEP I will start the vote 
tomorrow.


On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
 wrote:


Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?


I'm not sure that views should be "the" strategy for masking
functions. We have multiple approaches here:

1) CQL functions only. Users can decide to use the masking
functions on their own will. I think most dbs allow this pattern
of usage, which is quite straightforward. Obviously, it doesn't
allow admins to decide enforce users seeing only masked data.
Nevertheless, it's still useful for trusted database users
generating masked data that will be consumed by the end users of
the application.

2) Masking functions attached to specific columns. This way the
same queries will see different data (masked or not) depending on
the permissions of the user running the query. It has the
advantage of not requiring to change the queries that users with
different permissions run. The downside is that users would need
to query the schema if they need to know whether a column is
masked, unless we change the names of the returned columns. This
is the approach offered by Azure/SQL Server, PostgreSQL, IBM Db2,
Oracle, MariaDB/MaxScale and SnowFlake. All these databases
support applying the masking function to columns on the base
table, and some of them also allow to apply masking to views.

3) Masking functions as part of projected views. This ways users
might need to query the view appropriate for their permissions
instead of the base table. This might mean changing the queries
if the masking policy is changed by the admin. MySQL recommends
this approach on a blog entry, although it's not part of its main
documentation for data masking, and the implementation has
security issues. Some of the other databases offering the
approach 2) as their main option also support masking on view
columns.

Each approach has its own advantages and limitations, and I don't
think we necessarily have to choose. The CEP proposes
implementing 1) and 2), but no one impedes us to also have 3) if
we get to have projected views. However, I think that projected
views is a new general-purpose feature with its own complexities,
so it would deserve its own CEP, if someone is willing to work on
the implementation.



On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev
 wrote:

Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?

It seems to me the view would have to store the query and
apply a where clause to it, so the same PK would be in play.

It has data leaking properties.

It has more use cases as it can be used to

  * construct views that filter out sensitive columns
  * apply transforms to convert units of measure

Are there more thoughts along this line?


Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

2022-09-07 Thread Claude Warren via dev
I have looked through the code mentioned.  What I found in the 
ColumnSerializer was the use of VInt encoding.  Are you proposing 
switching directly to VInt encoding for sizes rather than one of the 
other encodings?  Using a -2 as the first length to signal that the new 
encoding is in use so that existing encodings can be read unchanged?



On 06/09/2022 16:37, Benedict wrote:

So, looking more closely at your proposal I realise what you are trying to do. 
The thing that threw me was your mention of lists and other collections. This 
will likely not work as there is no index that is possible to define on a list 
(or other collection) within a single sstable - a list is defined over the 
whole on-disk contents, so the index is undefined within a given sstable.

Tuple and UDT are encoded inefficiently if there are many null fields, but this 
is a very localised change, affecting just one class. You should take a look at 
Columns.Serializer for code you can lift for encoding and decoding sparse 
subsets of fields.

It might be that this can be switched on or off per sstable with a header flag 
bit so that there is no additional cost for datasets that would not benefit. 
Likely we can also migrate to vint encoding for the component sizes also (and 
either 1 or 0 bytes for fixed width values), no doubt saving a lot of space 
over the status quo, even for small UDT with few null entries.

Essentially at this point we’re talking about pushing through storage 
optimisations applied elsewhere to tuples and UDT, which is a very 
uncontroversial change.


On 6 Sep 2022, at 07:28, Benedict  wrote:

I agree a Jira would suffice, and if visibility there required a DISCUSS 
thread or simply a notice sent to the list.

While we’re here though, while I don’t have a lot of time to engage in 
discussion it’s unclear to me what advantage this encoding scheme brings. It 
might be worth outlining what algorithmic advantage you foresee for what data 
distributions in which collection types.


On 6 Sep 2022, at 07:16, Claude Warren via dev  wrote:

I am just learning the ropes here so perhaps it is not CEP worthy.  That being 
said, It felt like there was a lot of information to put into and track in a 
ticket, particularly when I expected discussion about how to best encode, 
changes to the algorithms etc.  It feels like it would be difficult to track. 
But if that is standard for this project I will move the information there.

As to the benchmarking, I had thought that usage and performance measures 
should be included.  Thank you for calling out the subset of data selected 
query as being of particular importance.

Claude


On 06/09/2022 03:11, Abe Ratnofsky wrote:

Looking at this link: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization

Do you have any plans to include benchmarks in your test plan? It would be 
useful to include disk usage / read performance / write performance comparisons 
with the new encodings, particularly for sparse collections where a subset of 
data is selected out of a collection.

I do wonder whether this is CEP-worthy. The CEP says that the changes will not 
impact existing users, will be backwards compatible, and overall is an 
efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
significant user-facing or changes that cut across multiple subsystems”. Any 
reason why a Jira isn’t sufficient?

Abe


On Sep 5, 2022, at 1:57 AM, Claude Warren via dev  
wrote:

I have just posted a CEP  covering an Enhancement for Sparse Data Serialzation. 
 This is in response to CASSANDRA-8959

I look forward to responses.




Committer needed for Deprecate Throwables.propagate usage

2022-09-20 Thread Claude Warren via dev
I made the necessary fixes to remove the deprecated Throwables.propagate 
calls.  However, I need a committer to review.


https://issues.apache.org/jira/browse/CASSANDRA-14218

Thank you,

Claude