Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
Is it typical for a masking feature to make no effort to prevent unmasking? I’m 
just struggling to see the value of this without such mechanisms. Otherwise 
it’s just a default formatter, and we should consider renaming the feature IMO

> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
> 
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to prevent 
> malicious users with SELECT permissions to indirectly guess the real value of 
> the masked value. This can easily be done by just trying values on the WHERE 
> clause of SELECT queries. DDM would not be a replacement for proper 
> column-level permissions.
> 
> The data served by the database is usually consumed by applications that 
> present this data to end users. These end users are not necessarily the users 
> directly connecting to the database. With DDM, it would be easy for 
> applications to mask sensitive data that is going to be consumed by the end 
> users. However, the users directly connecting to the database should be 
> trusted, provided that they have the right SELECT permissions.
> 
> In other words, DDM doesn't directly protect the data, but it eases the 
> production of protected data.
> 
> Said that, we could later go one step ahead and add a way to prevent 
> untrusted users from inferring the masked data. That could be done adding a 
> new permission required to use certain columns on WHERE clauses, different to 
> the current SELECT permission. That would play especially well with 
> column-level permissions, which is something that we still have pending. 
> 
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>> Applying this should prevent querying on a field, else you could leak its 
>>> contents, surely?
>> 
>> In theory, yes.  Although I could see folks doing something like this:
>> 
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>> 
>> In this case, the rows containing the masked key column(s) could be filtered 
>> on without revealing the actual data.  But again, that's probably better for 
>> a "phase 2" of the implementation.
>> 
>>> Agreed on not being a queryable field. That would also preclude secondary 
>>> indexing, right?
>> 
>> Yes, that's my thought as well. 
>> 
>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker  
>>> wrote:
>>> Agreed on not being a queryable field. That would also preclude secondary 
>>> indexing, right? 
>>> 
 On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
 Applying this should prevent querying on a field, else you could leak its 
 contents, surely? This pretty much prohibits using it in a clustering key, 
 and a partition key with the ordered partitioner - but probably also a 
 hashed partitioner since we do not use a cryptographic hash and the hash 
 function is well defined.
 
 We probably also need to ensure that any ALLOW FILTERING queries on such a 
 field are disabled.
 
 Plausibly the data could be cryptographically jumbled before using it in a 
 primary key component (or permitting filtering), but it is probably easier 
 and safer to exclude for now…
 
>> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>> 
> 
> Some thoughts on this one:
> 
> In a prior job, we'd give app teams access to a single keyspace, and two 
> roles: a read-write role and a read-only role.  In some cases, a 
> "privileged" application role was also requested.  Depending on the 
> requirements, I could see the UNMASK permission being applied to the RW 
> or privileged roles.  But if there's a problem on the table and the 
> operators go in to investigate, they will likely use a SUPERUSER account, 
> and they'll see that data.
> 
> How hard would it be for SUPERUSERs to *not* automatically get the UNMASK 
> permission?
> 
> I'll also echo the concerns around masking primary key components.  It's 
> highly likely that certain personal data properties would be used as a 
> partition or clustering key (ex: range query for people born within a 
> certain timeframe).  In addition to the "breaks existing" concern, I'm 
> curious about the challenges around getting that to work with the current 
> primary key implementation.
> 
> Does this first implementation only apply to payload (non-key) columns?  
> The examples in the CEP currently do not show primary key components 
> being masked. 
> 
> Thanks,
> 
> Aaron
> 
> 
>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo  
>> wrote:
>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña  
>> wrote:
 One thought: The way the CEP is currently written, it is only possible 
 to mask a column one way. You can only define one masking function for 
 a column, and since you use the original column name, you could only 
 return one versio

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Claude Warren, Jr via dev
This seems to me to be a client display filter, applied at the last moment
as data are streaming back to the client.  It has no impact on any keys,
queries or secondary internal index or materialized view.  It simply
prevents the display from showing the complete value.  It does not preclude
determining what some values are by building carefully crafted queries.





On Wed, Aug 24, 2022 at 8:40 AM Benedict  wrote:

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications that
> present this data to end users. These end users are not necessarily the
> users directly connecting to the database. With DDM, it would be easy for
> applications to mask sensitive data that is going to be consumed by the end
> users. However, the users directly connecting to the database should be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases the
> production of protected data.
>
> Said that, we could later go one step ahead and add a way to prevent
> untrusted users from inferring the masked data. That could be done adding a
> new permission required to use certain columns on WHERE clauses, different
> to the current SELECT permission. That would play especially well with
> column-level permissions, which is something that we still have pending.
>
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>>> contents, surely?
>>>
>>
>> In theory, yes.  Although I could see folks doing something like this:
>>
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>>
>> In this case, the rows containing the masked key column(s) could be
>> filtered on without revealing the actual data.  But again, that's probably
>> better for a "phase 2" of the implementation.
>>
>> Agreed on not being a queryable field. That would also preclude secondary
>>> indexing, right?
>>
>>
>> Yes, that's my thought as well.
>>
>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>> wrote:
>>
>>> Agreed on not being a queryable field. That would also preclude
>>> secondary indexing, right?
>>>
>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
 its contents, surely? This pretty much prohibits using it in a clustering
 key, and a partition key with the ordered partitioner - but probably also a
 hashed partitioner since we do not use a cryptographic hash and the hash
 function is well defined.

 We probably also need to ensure that any ALLOW FILTERING queries on
 such a field are disabled.

 Plausibly the data could be cryptographically jumbled before using it
 in a primary key component (or permitting filtering), but it is probably
 easier and safer to exclude for now…

 On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:

 
 Some thoughts on this one:

 In a prior job, we'd give app teams access to a single keyspace, and
 two roles: a read-write role and a read-only role.  In some cases, a
 "privileged" application role was also requested.  Depending on the
 requirements, I could see the UNMASK permission being applied to the RW or
 privileged roles.  But if there's a problem on the table and the operators
 go in to investigate, they will likely use a SUPERUSER account, and they'll
 see that data.

 How hard would it be for SUPERUSERs to *not* automatically get the
 UNMASK permission?

 I'll also echo the concerns around masking primary key components.
 It's highly likely that certain personal data properties would be used as a
 partition or clustering key (ex: range query for people born within a
 certain timeframe).  In addition to the "breaks existing" concern, I'm
 curious about the challenges around getting that to work with the current
 primary key implementation.

 Does this first implementation only apply to payload (non-key)
 columns?  The examples in the CEP currently do not show primary key
 components being masked.

 Thanks,

 Aaron


 On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
 wrote:

> 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benjamin Lerer
>
> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO


The security that Dynamic Data Masking is bringing is related to how you
make use of the feature. It is somehow the same with passwords. If you use
a weak password it does not bring much security.
Masking a field like people's gender is useless because you will be able to
determine its value in one query. On the other hand masking credit card
numbers makes a lot of sense as it will complicate the life of the person
trying to have access to it and the queries needed to reach the information
will leave some clear traces in the audit log.

Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good way
to protect sensitive data like credit card numbers or passwords.


Le mer. 24 août 2022 à 09:40, Benedict  a écrit :

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications that
> present this data to end users. These end users are not necessarily the
> users directly connecting to the database. With DDM, it would be easy for
> applications to mask sensitive data that is going to be consumed by the end
> users. However, the users directly connecting to the database should be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases the
> production of protected data.
>
> Said that, we could later go one step ahead and add a way to prevent
> untrusted users from inferring the masked data. That could be done adding a
> new permission required to use certain columns on WHERE clauses, different
> to the current SELECT permission. That would play especially well with
> column-level permissions, which is something that we still have pending.
>
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>>> contents, surely?
>>>
>>
>> In theory, yes.  Although I could see folks doing something like this:
>>
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>>
>> In this case, the rows containing the masked key column(s) could be
>> filtered on without revealing the actual data.  But again, that's probably
>> better for a "phase 2" of the implementation.
>>
>> Agreed on not being a queryable field. That would also preclude secondary
>>> indexing, right?
>>
>>
>> Yes, that's my thought as well.
>>
>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>> wrote:
>>
>>> Agreed on not being a queryable field. That would also preclude
>>> secondary indexing, right?
>>>
>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
 its contents, surely? This pretty much prohibits using it in a clustering
 key, and a partition key with the ordered partitioner - but probably also a
 hashed partitioner since we do not use a cryptographic hash and the hash
 function is well defined.

 We probably also need to ensure that any ALLOW FILTERING queries on
 such a field are disabled.

 Plausibly the data could be cryptographically jumbled before using it
 in a primary key component (or permitting filtering), but it is probably
 easier and safer to exclude for now…

 On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:

 
 Some thoughts on this one:

 In a prior job, we'd give app teams access to a single keyspace, and
 two roles: a read-write role and a read-only role.  In some cases, a
 "privileged" application role was also requested.  Depending on the
 requirements, I could see the UNMASK permission being applied to the RW or
 privileged roles.  But if there's a problem on the table and the operators
 go in to investigate, they will likely use a SUPERUSER account, and they'll
 see that data.

 How hard would it be for SUPERUSERs to *not* automatically get the
 UNMASK permission?

 I'll also echo the concerns around masking primary key components.
 It's highly likely that certain personal data properties would be use

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Claude Warren, Jr via dev
The PCI DSS Standard v4_0

requires
that credit card numbers stored on the system must be "rendered
unreadable", thus this proposal is _NOT_ a good way to protect credit card
numbers.  In fact, for any critically sensitive data this is not an
appropriate solution.  However, there seems to be agreement that it is
appropriate for obfuscating some data in some queries by some users.



On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:

> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>> renaming the feature IMO
>
>
> The security that Dynamic Data Masking is bringing is related to how you
> make use of the feature. It is somehow the same with passwords. If you use
> a weak password it does not bring much security.
> Masking a field like people's gender is useless because you will be able
> to determine its value in one query. On the other hand masking credit card
> numbers makes a lot of sense as it will complicate the life of the person
> trying to have access to it and the queries needed to reach the information
> will leave some clear traces in the audit log.
>
> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good way
> to protect sensitive data like credit card numbers or passwords.
>
>
> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>
>> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>> renaming the feature IMO
>>
>> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>>
>> 
>> As mentioned in the CEP document, dynamic data masking doesn't try to
>> prevent malicious users with SELECT permissions to indirectly guess the
>> real value of the masked value. This can easily be done by just trying
>> values on the WHERE clause of SELECT queries. DDM would not be a
>> replacement for proper column-level permissions.
>>
>> The data served by the database is usually consumed by applications that
>> present this data to end users. These end users are not necessarily the
>> users directly connecting to the database. With DDM, it would be easy for
>> applications to mask sensitive data that is going to be consumed by the end
>> users. However, the users directly connecting to the database should be
>> trusted, provided that they have the right SELECT permissions.
>>
>> In other words, DDM doesn't directly protect the data, but it eases the
>> production of protected data.
>>
>> Said that, we could later go one step ahead and add a way to prevent
>> untrusted users from inferring the masked data. That could be done adding a
>> new permission required to use certain columns on WHERE clauses, different
>> to the current SELECT permission. That would play especially well with
>> column-level permissions, which is something that we still have pending.
>>
>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>
>>> Applying this should prevent querying on a field, else you could leak
 its contents, surely?

>>>
>>> In theory, yes.  Although I could see folks doing something like this:
>>>
>>> SELECT COUNT(*) FROM patients
>>> WHERE year_of_birth = 2002
>>> AND date_of_birth >= '2002-04-01'
>>> AND date_of_birth < '2002-11-01';
>>>
>>> In this case, the rows containing the masked key column(s) could be
>>> filtered on without revealing the actual data.  But again, that's probably
>>> better for a "phase 2" of the implementation.
>>>
>>> Agreed on not being a queryable field. That would also preclude
 secondary indexing, right?
>>>
>>>
>>> Yes, that's my thought as well.
>>>
>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
>>> de...@chen-becker.org> wrote:
>>>
 Agreed on not being a queryable field. That would also preclude
 secondary indexing, right?

 On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:

> Applying this should prevent querying on a field, else you could leak
> its contents, surely? This pretty much prohibits using it in a clustering
> key, and a partition key with the ordered partitioner - but probably also 
> a
> hashed partitioner since we do not use a cryptographic hash and the hash
> function is well defined.
>
> We probably also need to ensure that any ALLOW FILTERING queries on
> such a field are disabled.
>
> Plausibly the data could be cryptographically jumbled before using it
> in a primary key component (or permitting filtering), but it is probably
> easier and safer to exclude for now…
>
> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>
> 
> Some thoughts on this one:
>
> In a prior job, we'd give app teams access to a single keyspace, and

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Claude Warren, Jr via dev
This change appears to be looking at two aspects:

   1. Add metadata to columns
   2. Add functionality based on the metadata.

If the system had a generic user defined metadata and the ability to define
filter functions at the point where data are being returned to the client
it would be possible for users implement this filter, or any other filter
on the data.

The concept of user defined metadata and filters could be applied to
other parts of the system as well.  For example, if the metadata were
accessible from UDFs the metadata could be used in low level filters to
remove rows from queries before they were returned.




On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
wrote:

> The PCI DSS Standard v4_0
> 
>  requires
> that credit card numbers stored on the system must be "rendered
> unreadable", thus this proposal is _NOT_ a good way to protect credit card
> numbers.  In fact, for any critically sensitive data this is not an
> appropriate solution.  However, there seems to be agreement that it is
> appropriate for obfuscating some data in some queries by some users.
>
>
>
> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
>
>> Is it typical for a masking feature to make no effort to prevent
>>> unmasking? I’m just struggling to see the value of this without such
>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>> renaming the feature IMO
>>
>>
>> The security that Dynamic Data Masking is bringing is related to how you
>> make use of the feature. It is somehow the same with passwords. If you use
>> a weak password it does not bring much security.
>> Masking a field like people's gender is useless because you will be able
>> to determine its value in one query. On the other hand masking credit card
>> numbers makes a lot of sense as it will complicate the life of the person
>> trying to have access to it and the queries needed to reach the information
>> will leave some clear traces in the audit log.
>>
>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>> way to protect sensitive data like credit card numbers or passwords.
>>
>>
>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>
>>> Is it typical for a masking feature to make no effort to prevent
>>> unmasking? I’m just struggling to see the value of this without such
>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>> renaming the feature IMO
>>>
>>> On 23 Aug 2022, at 21:27, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>> prevent malicious users with SELECT permissions to indirectly guess the
>>> real value of the masked value. This can easily be done by just trying
>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>> replacement for proper column-level permissions.
>>>
>>> The data served by the database is usually consumed by applications that
>>> present this data to end users. These end users are not necessarily the
>>> users directly connecting to the database. With DDM, it would be easy for
>>> applications to mask sensitive data that is going to be consumed by the end
>>> users. However, the users directly connecting to the database should be
>>> trusted, provided that they have the right SELECT permissions.
>>>
>>> In other words, DDM doesn't directly protect the data, but it eases the
>>> production of protected data.
>>>
>>> Said that, we could later go one step ahead and add a way to prevent
>>> untrusted users from inferring the masked data. That could be done adding a
>>> new permission required to use certain columns on WHERE clauses, different
>>> to the current SELECT permission. That would play especially well with
>>> column-level permissions, which is something that we still have pending.
>>>
>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
>>> wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
> its contents, surely?
>

 In theory, yes.  Although I could see folks doing something like this:

 SELECT COUNT(*) FROM patients
 WHERE year_of_birth = 2002
 AND date_of_birth >= '2002-04-01'
 AND date_of_birth < '2002-11-01';

 In this case, the rows containing the masked key column(s) could be
 filtered on without revealing the actual data.  But again, that's probably
 better for a "phase 2" of the implementation.

 Agreed on not being a queryable field. That would also preclude
> secondary indexing, right?


 Yes, that's my thought as well.

 On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
 de...@chen-becker.org> wrote:

> Agreed on not being a queryable field. That would also preclude
> secondary indexing, right?
>
> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>
>> Applying this should prevent querying on a fie

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benjamin Lerer
>
> The PCI DSS Standard v4_0
> 
>  requires
> that credit card numbers stored on the system must be "rendered
> unreadable", thus this proposal is _NOT_ a good way to protect credit card
> numbers.


My point was simply about the fact that Dynamic Data Masking like any other
feature made sense for some scenario but not for others. I apologise if my
example was a bad one.

Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
dev@cassandra.apache.org> a écrit :

> This change appears to be looking at two aspects:
>
>1. Add metadata to columns
>2. Add functionality based on the metadata.
>
> If the system had a generic user defined metadata and the ability to
> define filter functions at the point where data are being returned to the
> client it would be possible for users implement this filter, or any other
> filter on the data.
>
> The concept of user defined metadata and filters could be applied to
> other parts of the system as well.  For example, if the metadata were
> accessible from UDFs the metadata could be used in low level filters to
> remove rows from queries before they were returned.
>
>
>
>
> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
> wrote:
>
>> The PCI DSS Standard v4_0
>> 
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.  In fact, for any critically sensitive data this is not an
>> appropriate solution.  However, there seems to be agreement that it is
>> appropriate for obfuscating some data in some queries by some users.
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
>>
>>> Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO
>>>
>>>
>>> The security that Dynamic Data Masking is bringing is related to how you
>>> make use of the feature. It is somehow the same with passwords. If you use
>>> a weak password it does not bring much security.
>>> Masking a field like people's gender is useless because you will be able
>>> to determine its value in one query. On the other hand masking credit card
>>> numbers makes a lot of sense as it will complicate the life of the person
>>> trying to have access to it and the queries needed to reach the information
>>> will leave some clear traces in the audit log.
>>>
>>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>>> way to protect sensitive data like credit card numbers or passwords.
>>>
>>>
>>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>>
 Is it typical for a masking feature to make no effort to prevent
 unmasking? I’m just struggling to see the value of this without such
 mechanisms. Otherwise it’s just a default formatter, and we should consider
 renaming the feature IMO

 On 23 Aug 2022, at 21:27, Andrés de la Peña 
 wrote:

 
 As mentioned in the CEP document, dynamic data masking doesn't try to
 prevent malicious users with SELECT permissions to indirectly guess the
 real value of the masked value. This can easily be done by just trying
 values on the WHERE clause of SELECT queries. DDM would not be a
 replacement for proper column-level permissions.

 The data served by the database is usually consumed by applications
 that present this data to end users. These end users are not necessarily
 the users directly connecting to the database. With DDM, it would be easy
 for applications to mask sensitive data that is going to be consumed by the
 end users. However, the users directly connecting to the database should be
 trusted, provided that they have the right SELECT permissions.

 In other words, DDM doesn't directly protect the data, but it eases the
 production of protected data.

 Said that, we could later go one step ahead and add a way to prevent
 untrusted users from inferring the masked data. That could be done adding a
 new permission required to use certain columns on WHERE clauses, different
 to the current SELECT permission. That would play especially well with
 column-level permissions, which is something that we still have pending.

 On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
 wrote:

> Applying this should prevent querying on a field, else you could leak
>> its contents, surely?
>>
>
> In theory, yes.  Although I could see folks doing something like this:
>
> SELECT COUNT(*) FROM patients
> WHERE year_of_birth = 2002
> AND date_of_birth >= '2002-04-01'
> AND date_of_birth < '2002-11-01';
>
> In this case, th

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-24 Thread Claude Warren, Jr via dev
Should

(**) It may seem counterintuitive, that A is being written to even after
> we've stopped reading from it. This is done in order to guarantee that by
> the time we stop writing to the node giving up the range, there is no
> coordinator that may attempt reading from it without learning about *at
> least* the epoch where it is not a part of a read set. In other words, we
> have to keep writing until there's any chance there might be a reader.


instead read:

(**) It may seem counterintuitive, that A is being written to even after
we've stopped reading from it. This is done in order to guarantee that by
the time we stop writing to the node giving up the range, there is no
coordinator that may attempt reading from it without learning about *at
least* the epoch where it is not a part of a read set. In other words, we
have to keep writing *while* there's any chance there might be a reader.

On Tue, Aug 23, 2022 at 7:13 PM Mick Semb Wever  wrote:

>
>
> I just want to say I’m really excited about this work. It’s one of the
>> last remaining major inadequacies of the project that makes it hard for
>> people to deploy, and hard for us to develop.
>>
>>
>
> Second this. And what a solid write up Sam - it's a real joy reading this
> CEP.
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
Right, but we get to decide how we offer such features and what we call them. I 
can’t imagine a good reason to call this a masking feature, especially one that 
applies differentially to certain users, when it is trivial to unmask.

I’m ok offering a feature called “default formatter” or something that applies 
some UDF to a field before returning to the client, and if users wish to “mask” 
their data in this way that’s fine. But calling it a data mask when it is 
trivial to circumvent is IMO dangerous, and I’d at least want to see evidence 
that all other equivalent features in the industry are similarly poorly named 
and offer similarly poor protection.

> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
> 
> 
>> The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
>> system must be "rendered unreadable", thus this proposal is _NOT_ a good way 
>> to protect credit card numbers.
> 
> My point was simply about the fact that Dynamic Data Masking like any other 
> feature made sense for some scenario but not for others. I apologise if my 
> example was a bad one.
> 
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev 
>>  a écrit :
>> This change appears to be looking at two aspects:
>> Add metadata to columns
>> Add functionality based on the metadata.
>> If the system had a generic user defined metadata and the ability to define 
>> filter functions at the point where data are being returned to the client it 
>> would be possible for users implement this filter, or any other filter on 
>> the data.
>> 
>> The concept of user defined metadata and filters could be applied to other 
>> parts of the system as well.  For example, if the metadata were accessible 
>> from UDFs the metadata could be used in low level filters to remove rows 
>> from queries before they were returned.
>> 
>> 
>> 
>> 
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr  
>>> wrote:
>>> The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
>>> system must be "rendered unreadable", thus this proposal is _NOT_ a good 
>>> way to protect credit card numbers.  In fact, for any critically sensitive 
>>> data this is not an appropriate solution.  However, there seems to be 
>>> agreement that it is appropriate for obfuscating some data in some queries 
>>> by some users.   
>>> 
>>> 
>>> 
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
> Is it typical for a masking feature to make no effort to prevent 
> unmasking? I’m just struggling to see the value of this without such 
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider renaming the feature IMO
 
 The security that Dynamic Data Masking is bringing is related to how you 
 make use of the feature. It is somehow the same with passwords. If you use 
 a weak password it does not bring much security.
 Masking a field like people's gender is useless because you will be able 
 to determine its value in one query. On the other hand masking credit card 
 numbers makes a lot of sense as it will complicate the life of the person 
 trying to have access to it and the queries needed to reach the 
 information will leave some clear traces in the audit log.
 
 Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good way 
 to protect sensitive data like credit card numbers or passwords. 
 
 
> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
> Is it typical for a masking feature to make no effort to prevent 
> unmasking? I’m just struggling to see the value of this without such 
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider renaming the feature IMO
> 
>>> On 23 Aug 2022, at 21:27, Andrés de la Peña  
>>> wrote:
>>> 
>> 
>> As mentioned in the CEP document, dynamic data masking doesn't try to 
>> prevent malicious users with SELECT permissions to indirectly guess the 
>> real value of the masked value. This can easily be done by just trying 
>> values on the WHERE clause of SELECT queries. DDM would not be a 
>> replacement for proper column-level permissions.
>> 
>> The data served by the database is usually consumed by applications that 
>> present this data to end users. These end users are not necessarily the 
>> users directly connecting to the database. With DDM, it would be easy 
>> for applications to mask sensitive data that is going to be consumed by 
>> the end users. However, the users directly connecting to the database 
>> should be trusted, provided that they have the right SELECT permissions.
>> 
>> In other words, DDM doesn't directly protect the data, but it eases the 
>> production of protected data.
>> 
>> Said that, we could later go one step ahead and add a way to prevent 
>> untrusted users from inferring the masked data. That could be done 
>> adding a 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
>
> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO


I'd say it's a pretty standard feature. There are two parts in the
proposal; the CQL functions and the ability to link them to columns.

The CQL functions can indeed been seen as a formatter. You can see similar
functions for example in MySQL, being what they call "Enterprise Data
Masking and De-identification". Its doc says "MySQL provides
general-purpose masking functions that mask arbitrary strings, and
special-purpose masking functions that mask specific types of values.". As
long as I know, MySQL only offers this functions, without being related to
any permissions. Documentation is here
.

Associating masking functions to columns allows to prevent the accidental
leakage of sensitive data by part of users that actually have access to the
data. So it can be seen as mandatory formatting, not preventing malicious
uses with read permission to force their way into the clear data. You can
find disclaimers about it for example in the "Dynamic Data Masking" feature
of Azure SQL/SQL server: "The purpose of dynamic data masking is to limit
exposure of sensitive data, preventing users who shouldn't have access to
the data from viewing it. Dynamic data masking doesn't aim to prevent
database users from connecting directly to the database and running
exhaustive queries that expose pieces of the sensitive data.". Its doc even
has a specific section about this, here

.

As another example, IBM Db2 allows to create what they call masks. I don't
see any disclaimer about inferring the clear data, but its documentation
says "The application of enabled column masks does not interfere with the
operations of other clauses within the statement such as the WHERE, GROUP
BY, HAVING, SELECT DISTINCT, or ORDER BY. The rows that are returned in the
final result table remain the same, except that the values in the resulting
rows might have been masked by the column masks.", so I understand that
it's possible to infer the clear values unless one uses additional
permissions or security policies.



On Wed, 24 Aug 2022 at 09:48, Benjamin Lerer  wrote:

> The PCI DSS Standard v4_0
>> 
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.
>
>
> My point was simply about the fact that Dynamic Data Masking like any
> other feature made sense for some scenario but not for others. I apologise
> if my example was a bad one.
>
> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> a écrit :
>
>> This change appears to be looking at two aspects:
>>
>>1. Add metadata to columns
>>2. Add functionality based on the metadata.
>>
>> If the system had a generic user defined metadata and the ability to
>> define filter functions at the point where data are being returned to the
>> client it would be possible for users implement this filter, or any other
>> filter on the data.
>>
>> The concept of user defined metadata and filters could be applied to
>> other parts of the system as well.  For example, if the metadata were
>> accessible from UDFs the metadata could be used in low level filters to
>> remove rows from queries before they were returned.
>>
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>> wrote:
>>
>>> The PCI DSS Standard v4_0
>>> 
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.  In fact, for any critically sensitive data this is not an
>>> appropriate solution.  However, there seems to be agreement that it is
>>> appropriate for obfuscating some data in some queries by some users.
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
>>> wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO


 The security that Dynamic Data Masking is bringing is related to how
 you make use of the feature. It is somehow the same with passwords. If you
 use a weak password it does not bring much security.
 Masking a field like people's gender is useless because you w

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
Here are the names of the feature on same databases out there, errors and
omission excepted:

   - Microsoft SQL Server / Azure SQL: Dynamic data masking
   - MySQL: Enterprise data masking and de-identification
   - PostgreSQL: Dynamic masking
   - MongoDB: Data masking
   - IBM Db2: Masks
   - Oracle: Redaction
   - MariaDB/MaxScale: Data masking
   - Snowflake: Dynamic data masking


On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:

> Right, but we get to decide how we offer such features and what we call
> them. I can’t imagine a good reason to call this a masking feature,
> especially one that applies differentially to certain users, when it is
> trivial to unmask.
>
> I’m ok offering a feature called “default formatter” or something that
> applies some UDF to a field before returning to the client, and if users
> wish to “mask” their data in this way that’s fine. But calling it a data
> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
> want to see evidence that all other equivalent features in the industry are
> similarly poorly named and offer similarly poor protection.
>
> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>
> 
>
>> The PCI DSS Standard v4_0
>> 
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.
>
>
> My point was simply about the fact that Dynamic Data Masking like any
> other feature made sense for some scenario but not for others. I apologise
> if my example was a bad one.
>
> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> a écrit :
>
>> This change appears to be looking at two aspects:
>>
>>1. Add metadata to columns
>>2. Add functionality based on the metadata.
>>
>> If the system had a generic user defined metadata and the ability to
>> define filter functions at the point where data are being returned to the
>> client it would be possible for users implement this filter, or any other
>> filter on the data.
>>
>> The concept of user defined metadata and filters could be applied to
>> other parts of the system as well.  For example, if the metadata were
>> accessible from UDFs the metadata could be used in low level filters to
>> remove rows from queries before they were returned.
>>
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>> wrote:
>>
>>> The PCI DSS Standard v4_0
>>> 
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.  In fact, for any critically sensitive data this is not an
>>> appropriate solution.  However, there seems to be agreement that it is
>>> appropriate for obfuscating some data in some queries by some users.
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
>>> wrote:
>>>
 Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO


 The security that Dynamic Data Masking is bringing is related to how
 you make use of the feature. It is somehow the same with passwords. If you
 use a weak password it does not bring much security.
 Masking a field like people's gender is useless because you will be
 able to determine its value in one query. On the other hand masking credit
 card numbers makes a lot of sense as it will complicate the life of the
 person trying to have access to it and the queries needed to reach the
 information will leave some clear traces in the audit log.

 Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
 way to protect sensitive data like credit card numbers or passwords.


 Le mer. 24 août 2022 à 09:40, Benedict  a écrit :

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should 
> consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña 
> wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications
> that present this data to end users. These end users are

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-24 Thread Sam Tunnicliffe
Good catch, I'll update the doc.

Thanks, 
Sam

> On 24 Aug 2022, at 10:24, Claude Warren, Jr via dev 
>  wrote:
> 
> Should 
> 
> (**) It may seem counterintuitive, that A is being written to even after 
> we've stopped reading from it. This is done in order to guarantee that by the 
> time we stop writing to the node giving up the range, there is no coordinator 
> that may attempt reading from it without learning about at least the epoch 
> where it is not a part of a read set. In other words, we have to keep writing 
> until there's any chance there might be a reader.
> 
> instead read:
> 
> (**) It may seem counterintuitive, that A is being written to even after 
> we've stopped reading from it. This is done in order to guarantee that by the 
> time we stop writing to the node giving up the range, there is no coordinator 
> that may attempt reading from it without learning about at least the epoch 
> where it is not a part of a read set. In other words, we have to keep writing 
> while there's any chance there might be a reader.
> 
> On Tue, Aug 23, 2022 at 7:13 PM Mick Semb Wever  > wrote:
> 
> 
> I just want to say I’m really excited about this work. It’s one of the last 
> remaining major inadequacies of the project that makes it hard for people to 
> deploy, and hard for us to develop.
> 
> 
> 
> Second this. And what a solid write up Sam - it's a real joy reading this CEP.



unsubscribe

2022-08-24 Thread Arpit J
Regards,
Arpit Joshi


Re: unsubscribe

2022-08-24 Thread Erick Ramirez
Sorry to see you go. If you'd like to unsubscribe from the dev ML, please
email dev-unsubscr...@cassandra.apache.org. Cheers!

On Wed, 24 Aug 2022 at 23:01, Arpit J  wrote:

>
> Regards,
> Arpit Joshi
>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
I can’t tell for sure, but the documentation on Postgres’ feature suggests to 
me that it does apply the masking to all possible uses of the data, including 
joining and querying.

Snowflake’s documentation explicitly says that it does.

MySQL’s documentation suggests that it does this.

Oracle, AWS and MS SQL do not.

My inclination would be to - at least by default - forbid querying on columns 
that are masked, unless the mask permits it.


> On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
> 
> 
> Here are the names of the feature on same databases out there, errors and 
> omission excepted:
> Microsoft SQL Server / Azure SQL: Dynamic data masking
> MySQL: Enterprise data masking and de-identification
> PostgreSQL: Dynamic masking
> MongoDB: Data masking
> IBM Db2: Masks
> Oracle: Redaction
> MariaDB/MaxScale: Data masking
> Snowflake: Dynamic data masking
> 
>> On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
>> Right, but we get to decide how we offer such features and what we call 
>> them. I can’t imagine a good reason to call this a masking feature, 
>> especially one that applies differentially to certain users, when it is 
>> trivial to unmask.
>> 
>> I’m ok offering a feature called “default formatter” or something that 
>> applies some UDF to a field before returning to the client, and if users 
>> wish to “mask” their data in this way that’s fine. But calling it a data 
>> mask when it is trivial to circumvent is IMO dangerous, and I’d at least 
>> want to see evidence that all other equivalent features in the industry are 
>> similarly poorly named and offer similarly poor protection.
>> 
 On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
 
>>> 
 The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
 system must be "rendered unreadable", thus this proposal is _NOT_ a good 
 way to protect credit card numbers.
>>> 
>>> My point was simply about the fact that Dynamic Data Masking like any other 
>>> feature made sense for some scenario but not for others. I apologise if my 
>>> example was a bad one.
>>> 
 Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev 
  a écrit :
 This change appears to be looking at two aspects:
 Add metadata to columns
 Add functionality based on the metadata.
 If the system had a generic user defined metadata and the ability to 
 define filter functions at the point where data are being returned to the 
 client it would be possible for users implement this filter, or any other 
 filter on the data.
 
 The concept of user defined metadata and filters could be applied to other 
 parts of the system as well.  For example, if the metadata were accessible 
 from UDFs the metadata could be used in low level filters to remove rows 
 from queries before they were returned.
 
 
 
 
> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>  wrote:
> The PCI DSS Standard v4_0 requires that credit card numbers stored on the 
> system must be "rendered unreadable", thus this proposal is _NOT_ a good 
> way to protect credit card numbers.  In fact, for any critically 
> sensitive data this is not an appropriate solution.  However, there seems 
> to be agreement that it is appropriate for obfuscating some data in some 
> queries by some users.   
> 
> 
> 
> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer  wrote:
>>> Is it typical for a masking feature to make no effort to prevent 
>>> unmasking? I’m just struggling to see the value of this without such 
>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>> consider renaming the feature IMO
>> 
>> The security that Dynamic Data Masking is bringing is related to how you 
>> make use of the feature. It is somehow the same with passwords. If you 
>> use a weak password it does not bring much security.
>> Masking a field like people's gender is useless because you will be able 
>> to determine its value in one query. On the other hand masking credit 
>> card numbers makes a lot of sense as it will complicate the life of the 
>> person trying to have access to it and the queries needed to reach the 
>> information will leave some clear traces in the audit log.
>> 
>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good 
>> way to protect sensitive data like credit card numbers or passwords. 
>> 
>> 
>>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>> Is it typical for a masking feature to make no effort to prevent 
>>> unmasking? I’m just struggling to see the value of this without such 
>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>> consider renaming the feature IMO
>>> 
> On 23 Aug 2022, at 21:27, Andrés de la Peña  
> wrote:
> 
 
 As mentioned in the CEP document, dyn

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
Where does MySQL suggest that? As far I can tell MySQL only offers a set of
functions for masking. I can't see a way to force users or tables to use
those functions, and is up to the users to use those functions or not. I'm
reading this documentation
.

As for broadening the scope the proposal to prevent malicious users from
inferring the masked data, I guess that the additional rule would simply be
that a user with READ but not UNMASK permissions cannot use masked columns
on WHERE or IF clauses. That would include both SELECT and UPDATE
statements. That would differentiate us from many popular databases out
there, where data masking usually is a simpler thing.

On Wed, 24 Aug 2022 at 14:08, Benedict  wrote:

> I can’t tell for sure, but the documentation on Postgres’ feature suggests
> to me that it does apply the masking to all possible uses of the data,
> including joining and querying.
>
> Snowflake’s documentation explicitly says that it does.
>
> MySQL’s documentation suggests that it does this.
>
> Oracle, AWS and MS SQL do not.
>
> My inclination would be to - at least by default - forbid querying on
> columns that are masked, unless the mask permits it.
>
>
> On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
>
> 
> Here are the names of the feature on same databases out there, errors and
> omission excepted:
>
>- Microsoft SQL Server / Azure SQL: Dynamic data masking
>- MySQL: Enterprise data masking and de-identification
>- PostgreSQL: Dynamic masking
>- MongoDB: Data masking
>- IBM Db2: Masks
>- Oracle: Redaction
>- MariaDB/MaxScale: Data masking
>- Snowflake: Dynamic data masking
>
>
> On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
>
>> Right, but we get to decide how we offer such features and what we call
>> them. I can’t imagine a good reason to call this a masking feature,
>> especially one that applies differentially to certain users, when it is
>> trivial to unmask.
>>
>> I’m ok offering a feature called “default formatter” or something that
>> applies some UDF to a field before returning to the client, and if users
>> wish to “mask” their data in this way that’s fine. But calling it a data
>> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
>> want to see evidence that all other equivalent features in the industry are
>> similarly poorly named and offer similarly poor protection.
>>
>> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>>
>> 
>>
>>> The PCI DSS Standard v4_0
>>> 
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.
>>
>>
>> My point was simply about the fact that Dynamic Data Masking like any
>> other feature made sense for some scenario but not for others. I apologise
>> if my example was a bad one.
>>
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
>> dev@cassandra.apache.org> a écrit :
>>
>>> This change appears to be looking at two aspects:
>>>
>>>1. Add metadata to columns
>>>2. Add functionality based on the metadata.
>>>
>>> If the system had a generic user defined metadata and the ability to
>>> define filter functions at the point where data are being returned to the
>>> client it would be possible for users implement this filter, or any other
>>> filter on the data.
>>>
>>> The concept of user defined metadata and filters could be applied to
>>> other parts of the system as well.  For example, if the metadata were
>>> accessible from UDFs the metadata could be used in low level filters to
>>> remove rows from queries before they were returned.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr <
>>> claude.war...@aiven.io> wrote:
>>>
 The PCI DSS Standard v4_0
 
  requires
 that credit card numbers stored on the system must be "rendered
 unreadable", thus this proposal is _NOT_ a good way to protect credit card
 numbers.  In fact, for any critically sensitive data this is not an
 appropriate solution.  However, there seems to be agreement that it is
 appropriate for obfuscating some data in some queries by some users.



 On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
 wrote:

> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should 
>> consider
>> renaming the feature IMO
>
>
> The security that Dynamic Data Masking is bringing is related to how
> you make use of the feature. It is somehow the same with passwords. If you
> use a weak password it does not bring much security.
> Masking

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Benedict
The MySQL feature is not equivalent to this proposal, it simply offers new 
transformation functions that implement this functionality, so it is up to the 
application to apply these functions to its own selects or, as most examples 
seem to use, to create a view on the data that applies the function. Of course, 
any joins or queries on such a view will operate over the result of the 
function, not its input. This permits the DBA to create roles that really do 
have no access to the unmasked data, and would have to infer information via 
other means (perhaps joins against other tables). So, it is perhaps a misnomer 
to say that “it does this” but the MySQL feature applies uniformly, and it is 
clear what access to the data a role is being granted, as there is no 
table-level masking.

Postgres appears to adopt the same approach.


> On 24 Aug 2022, at 14:32, Andrés de la Peña  wrote:
> 
> 
> Where does MySQL suggest that? As far I can tell MySQL only offers a set of 
> functions for masking. I can't see a way to force users or tables to use 
> those functions, and is up to the users to use those functions or not. I'm 
> reading this documentation.
> 
> As for broadening the scope the proposal to prevent malicious users from 
> inferring the masked data, I guess that the additional rule would simply be 
> that a user with READ but not UNMASK permissions cannot use masked columns on 
> WHERE or IF clauses. That would include both SELECT and UPDATE statements. 
> That would differentiate us from many popular databases out there, where data 
> masking usually is a simpler thing.
> 
>> On Wed, 24 Aug 2022 at 14:08, Benedict  wrote:
>> I can’t tell for sure, but the documentation on Postgres’ feature suggests 
>> to me that it does apply the masking to all possible uses of the data, 
>> including joining and querying.
>> 
>> Snowflake’s documentation explicitly says that it does.
>> 
>> MySQL’s documentation suggests that it does this.
>> 
>> Oracle, AWS and MS SQL do not.
>> 
>> My inclination would be to - at least by default - forbid querying on 
>> columns that are masked, unless the mask permits it.
>> 
>> 
 On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
 
>>> 
>>> Here are the names of the feature on same databases out there, errors and 
>>> omission excepted:
>>> Microsoft SQL Server / Azure SQL: Dynamic data masking
>>> MySQL: Enterprise data masking and de-identification
>>> PostgreSQL: Dynamic masking
>>> MongoDB: Data masking
>>> IBM Db2: Masks
>>> Oracle: Redaction
>>> MariaDB/MaxScale: Data masking
>>> Snowflake: Dynamic data masking
>>> 
 On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
 Right, but we get to decide how we offer such features and what we call 
 them. I can’t imagine a good reason to call this a masking feature, 
 especially one that applies differentially to certain users, when it is 
 trivial to unmask.
 
 I’m ok offering a feature called “default formatter” or something that 
 applies some UDF to a field before returning to the client, and if users 
 wish to “mask” their data in this way that’s fine. But calling it a data 
 mask when it is trivial to circumvent is IMO dangerous, and I’d at least 
 want to see evidence that all other equivalent features in the industry 
 are similarly poorly named and offer similarly poor protection.
 
>> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>> 
> 
>> The PCI DSS Standard v4_0 requires that credit card numbers stored on 
>> the system must be "rendered unreadable", thus this proposal is _NOT_ a 
>> good way to protect credit card numbers.
> 
> My point was simply about the fact that Dynamic Data Masking like any 
> other feature made sense for some scenario but not for others. I 
> apologise if my example was a bad one.
> 
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev 
>>  a écrit :
>> This change appears to be looking at two aspects:
>> Add metadata to columns
>> Add functionality based on the metadata.
>> If the system had a generic user defined metadata and the ability to 
>> define filter functions at the point where data are being returned to 
>> the client it would be possible for users implement this filter, or any 
>> other filter on the data.
>> 
>> The concept of user defined metadata and filters could be applied to 
>> other parts of the system as well.  For example, if the metadata were 
>> accessible from UDFs the metadata could be used in low level filters to 
>> remove rows from queries before they were returned.
>> 
>> 
>> 
>> 
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>>>  wrote:
>>> The PCI DSS Standard v4_0 requires that credit card numbers stored on 
>>> the system must be "rendered unreadable", thus this proposal is _NOT_ a 
>>> good way to protect credit card numbers.  In fact, for any criti

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Henrik Ingo
This is the difference between security and compliance I guess :-D

The way I see this, the attacker or threat in this concept is not the
developer with access to the database. Rather a feature like this is just a
convenient way to apply some masking rule in a centralized way. The
protection is against an end user of the application, who should not be
able to see the personal data of someone else. Or themselves, even. As long
as the application end user doesn't have access to run arbitrary CQL, then
these frorms of masking prevent accidental unauthorized use/leaking of
personal data.

henrik



On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:

> Is it typical for a masking feature to make no effort to prevent
> unmasking? I’m just struggling to see the value of this without such
> mechanisms. Otherwise it’s just a default formatter, and we should consider
> renaming the feature IMO
>
> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>
> 
> As mentioned in the CEP document, dynamic data masking doesn't try to
> prevent malicious users with SELECT permissions to indirectly guess the
> real value of the masked value. This can easily be done by just trying
> values on the WHERE clause of SELECT queries. DDM would not be a
> replacement for proper column-level permissions.
>
> The data served by the database is usually consumed by applications that
> present this data to end users. These end users are not necessarily the
> users directly connecting to the database. With DDM, it would be easy for
> applications to mask sensitive data that is going to be consumed by the end
> users. However, the users directly connecting to the database should be
> trusted, provided that they have the right SELECT permissions.
>
> In other words, DDM doesn't directly protect the data, but it eases the
> production of protected data.
>
> Said that, we could later go one step ahead and add a way to prevent
> untrusted users from inferring the masked data. That could be done adding a
> new permission required to use certain columns on WHERE clauses, different
> to the current SELECT permission. That would play especially well with
> column-level permissions, which is something that we still have pending.
>
> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>
>> Applying this should prevent querying on a field, else you could leak its
>>> contents, surely?
>>>
>>
>> In theory, yes.  Although I could see folks doing something like this:
>>
>> SELECT COUNT(*) FROM patients
>> WHERE year_of_birth = 2002
>> AND date_of_birth >= '2002-04-01'
>> AND date_of_birth < '2002-11-01';
>>
>> In this case, the rows containing the masked key column(s) could be
>> filtered on without revealing the actual data.  But again, that's probably
>> better for a "phase 2" of the implementation.
>>
>> Agreed on not being a queryable field. That would also preclude secondary
>>> indexing, right?
>>
>>
>> Yes, that's my thought as well.
>>
>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>> wrote:
>>
>>> Agreed on not being a queryable field. That would also preclude
>>> secondary indexing, right?
>>>
>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>
 Applying this should prevent querying on a field, else you could leak
 its contents, surely? This pretty much prohibits using it in a clustering
 key, and a partition key with the ordered partitioner - but probably also a
 hashed partitioner since we do not use a cryptographic hash and the hash
 function is well defined.

 We probably also need to ensure that any ALLOW FILTERING queries on
 such a field are disabled.

 Plausibly the data could be cryptographically jumbled before using it
 in a primary key component (or permitting filtering), but it is probably
 easier and safer to exclude for now…

 On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:

 
 Some thoughts on this one:

 In a prior job, we'd give app teams access to a single keyspace, and
 two roles: a read-write role and a read-only role.  In some cases, a
 "privileged" application role was also requested.  Depending on the
 requirements, I could see the UNMASK permission being applied to the RW or
 privileged roles.  But if there's a problem on the table and the operators
 go in to investigate, they will likely use a SUPERUSER account, and they'll
 see that data.

 How hard would it be for SUPERUSERs to *not* automatically get the
 UNMASK permission?

 I'll also echo the concerns around masking primary key components.
 It's highly likely that certain personal data properties would be used as a
 partition or clustering key (ex: range query for people born within a
 certain timeframe).  In addition to the "breaks existing" concern, I'm
 curious about the challenges around getting that to work with the current
 primary key implementation.

 Does this first implementation only apply to paylo

[DISCUSS] Join OpenJDK Quality Outreach program

2022-08-24 Thread Ekaterina Dimitrova
Hi everyone,
Some time ago I started ML thread [1] around Java 17 support.
I mentioned there joining the OpenJDK Quality Outreach program[2]. I can go
ahead and do it now if no one is against it, the contact can be just our
dev-mailing list I guess.
In other news...I am back to CASSANDRA-16895, opening sub-tasks/tickets and
attacking the Java 17 support. Let me know if you want to get involved or
if you have any questions/concerns. Based on the POC done before I started
with smaller incremental changes in our repositories in preparation to
switch from J8&J11 to J11&J17 in trunk  later on.
[1] https://lists.apache.org/thread/hny49r5vlg4nn9d53n3fksxvjg71joqz
[2] https://wiki.openjdk.java.net/display/quality/Quality+Outreach

Best regards,
Ekaterina


[Marketing] For Review: Performance Benchmarking of Apache Cassandra in the Cloud

2022-08-24 Thread Chris Thornett
Here is Part 1 in a series of 3 on performance benchmarking in Apache
Cassandra by Daniel Seybold:
https://docs.google.com/document/d/1eMFYEOp8lNxZCYelYCWj6jXZ-VaJGNbl2YE3jLWRdOA/edit?usp=sharing

We are opening this up for 72-hour community review. Please add your amends
in the comments—thanks very much!

We are looking at 30 August for publication.

Thanks,
-- 

Chris Thornett
Senior Content Strategist, Constantia.io