Re: [DISCUSS] CEP-20: Dynamic Data Masking

Benedict Tue, 30 Aug 2022 04:55:19 -0700

Not to push the point too strongly (I don’t have a very firm view of my own), 
but if we provide this via a view feature we’re just implementing one new 
feature and we get masking for free. I don’t think it is materially more 
complicated than redefining columns for users - it might even be less so, as we 
do not have to consider how applications interpret table metadata.


Projection views are a very simple concept and pretty simple to implement I 
think, and conceptually very familiar to users. So let’s at least not prefer 
the table column modifier approach because it’s simpler or requires fewer new 
features, as I do not believe this to be the case.

> On 30 Aug 2022, at 12:46, Andrés de la Peña <adelap...@apache.org> wrote:
> 
> 
>> GRANT SELECT ON foo.unmasked_name TO top_secret;
> 
> Note that Cassandra doesn't have support for column-level permissions. There 
> was an initiative to add them in 2016, CASSANDRA-12859. However, the ticket 
> has been inactive since 2017. The last comments seem some discussions about 
> design.
> 
> Also, generated columns in PostgreSQL are always stored, so if they were used 
> for masking they would constitute static data masking, not dynamic. 
> 
> The approach for dynamic data masking that PostgreSQL suggests on its 
> documentation doesn't seem based on generating a masked copy of the column, 
> neither on a generated column or on a view. Instead, it uses security labels 
> to associate columns to users and masking functions. That way, the same 
> column will be seen masked or unmasked depending on the user. 
> 
> I'd say that applying the masking rule to the base column itself, and not to 
> a copy, is the most common approach among the discussed databases so far. 
> Also, it has the advantage for us of not being based on other relatively 
> complex features that we miss, such as column-level permissions or 
> not-materialized views. If someday we add those features I think they would 
> play well with what is proposed on the CEP.
> 
>> On Tue, 30 Aug 2022 at 11:46, Avi Kivity via dev <dev@cassandra.apache.org> 
>> wrote:
>> Agree with views, or alternatively, column permissions together with 
>> computed columns:
>> 
>> 
>> 
>> CREATE TABLE foo (
>> 
>>   id int PRIMARY KEY,
>> 
>>   unmasked_name text,
>> 
>>   name text GENERATED ALWAYS AS some_mask_function(text, 'xxx', 7)
>> 
>> )
>> 
>> 
>> 
>> (syntax from postgresql)
>> 
>> 
>> 
>> GRANT SELECT ON foo.name TO general_use;
>> 
>> GRANT SELECT ON foo.unmasked_name TO top_secret;
>> 
>> 
>> 
>>> On 26/08/2022 00.10, Benedict wrote:
>>> I’m inclined to agree that this seems a more straightforward approach that 
>>> makes fewer implied promises.
>>> 
>>> Perhaps we could deliver simple views backed by virtual tables, and model 
>>> our approach on that of Postgres, MySQL et al?
>>> 
>>> Views in C* would be very simple, just offering a subset of fields with 
>>> some UDFs applied. It would allow users to define roles with access only to 
>>> the views, or for applications to use the views for presentation purposes.
>>> 
>>> It feels like a cleaner approach to me, and we’d get two features for the 
>>> price of one. BUT I don’t feel super strongly about this.
>>> 
>>>> On 25 Aug 2022, at 20:16, Derek Chen-Becker <de...@chen-becker.org> wrote:
>>>> 
>>>> 
>>>> To make sure I understand, if I wanted to use a masked column for a 
>>>> conditional update, you're saying we would need SELECT_MASKED to use it in 
>>>> the IF clause? I worry that this proposal is increasing in complexity; I 
>>>> would actually be OK starting with something smaller in scope. Perhaps 
>>>> just providing the masking functions and not tying masking to schema would 
>>>> be sufficient for an initial goal? That wouldn't preclude additional 
>>>> permissions, schema integration, or perhaps just plain Views in the future.
>>>> 
>>>> Cheers,
>>>> 
>>>> Derek
>>>> 
>>>> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña <adelap...@apache.org> 
>>>> wrote:
>>>>> I have modified the proposal adding a new SELECT_MASKED permission. Using 
>>>>> masked columns on WHERE/IF clauses would require having SELECT and either 
>>>>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the 
>>>>> query results would always require both SELECT and UNMASK.
>>>>> 
>>>>> This way we can have the best of both worlds, allowing admins to decide 
>>>>> whether they trust their immediate users or not. wdyt?
>>>>> 
>>>>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo <henrik.i...@datastax.com> 
>>>>> wrote:
>>>>>> This is the difference between security and compliance I guess :-D
>>>>>> 
>>>>>> The way I see this, the attacker or threat in this concept is not the 
>>>>>> developer with access to the database. Rather a feature like this is 
>>>>>> just a convenient way to apply some masking rule in a centralized way. 
>>>>>> The protection is against an end user of the application, who should not 
>>>>>> be able to see the personal data of someone else. Or themselves, even. 
>>>>>> As long as the application end user doesn't have access to run arbitrary 
>>>>>> CQL, then these frorms of masking prevent accidental unauthorized 
>>>>>> use/leaking of personal data.
>>>>>> 
>>>>>> henrik
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict <bened...@apache.org> wrote:
>>>>>>> Is it typical for a masking feature to make no effort to prevent 
>>>>>>> unmasking? I’m just struggling to see the value of this without such 
>>>>>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>>>>>> consider renaming the feature IMO
>>>>>>> 
>>>>>>>> On 23 Aug 2022, at 21:27, Andrés de la Peña <adelap...@apache.org> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As mentioned in the CEP document, dynamic data masking doesn't try to 
>>>>>>>> prevent malicious users with SELECT permissions to indirectly guess 
>>>>>>>> the real value of the masked value. This can easily be done by just 
>>>>>>>> trying values on the WHERE clause of SELECT queries. DDM would not be 
>>>>>>>> a replacement for proper column-level permissions.
>>>>>>>> 
>>>>>>>> The data served by the database is usually consumed by applications 
>>>>>>>> that present this data to end users. These end users are not 
>>>>>>>> necessarily the users directly connecting to the database. With DDM, 
>>>>>>>> it would be easy for applications to mask sensitive data that is going 
>>>>>>>> to be consumed by the end users. However, the users directly 
>>>>>>>> connecting to the database should be trusted, provided that they have 
>>>>>>>> the right SELECT permissions.
>>>>>>>> 
>>>>>>>> In other words, DDM doesn't directly protect the data, but it eases 
>>>>>>>> the production of protected data.
>>>>>>>> 
>>>>>>>> Said that, we could later go one step ahead and add a way to prevent 
>>>>>>>> untrusted users from inferring the masked data. That could be done 
>>>>>>>> adding a new permission required to use certain columns on WHERE 
>>>>>>>> clauses, different to the current SELECT permission. That would play 
>>>>>>>> especially well with column-level permissions, which is something that 
>>>>>>>> we still have pending. 
>>>>>>>> 
>>>>>>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz <aaronplo...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>>> Applying this should prevent querying on a field, else you could 
>>>>>>>>>> leak its contents, surely?
>>>>>>>>> 
>>>>>>>>> In theory, yes.  Although I could see folks doing something like this:
>>>>>>>>> 
>>>>>>>>> SELECT COUNT(*) FROM patients
>>>>>>>>> WHERE year_of_birth = 2002
>>>>>>>>> AND date_of_birth >= '2002-04-01'
>>>>>>>>> AND date_of_birth < '2002-11-01';
>>>>>>>>> 
>>>>>>>>> In this case, the rows containing the masked key column(s) could be 
>>>>>>>>> filtered on without revealing the actual data.  But again, that's 
>>>>>>>>> probably better for a "phase 2" of the implementation.
>>>>>>>>> 
>>>>>>>>>> Agreed on not being a queryable field. That would also preclude 
>>>>>>>>>> secondary indexing, right?
>>>>>>>>> 
>>>>>>>>> Yes, that's my thought as well. 
>>>>>>>>> 
>>>>>>>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
>>>>>>>>> <de...@chen-becker.org> wrote:
>>>>>>>>>> Agreed on not being a queryable field. That would also preclude 
>>>>>>>>>> secondary indexing, right? 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict <bened...@apache.org> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Applying this should prevent querying on a field, else you could 
>>>>>>>>>>> leak its contents, surely? This pretty much prohibits using it in a 
>>>>>>>>>>> clustering key, and a partition key with the ordered partitioner - 
>>>>>>>>>>> but probably also a hashed partitioner since we do not use a 
>>>>>>>>>>> cryptographic hash and the hash function is well defined.
>>>>>>>>>>> 
>>>>>>>>>>> We probably also need to ensure that any ALLOW FILTERING queries on 
>>>>>>>>>>> such a field are disabled.
>>>>>>>>>>> 
>>>>>>>>>>> Plausibly the data could be cryptographically jumbled before using 
>>>>>>>>>>> it in a primary key component (or permitting filtering), but it is 
>>>>>>>>>>> probably easier and safer to exclude for now…
>>>>>>>>>>> 
>>>>>>>>>>>> On 23 Aug 2022, at 18:13, Aaron Ploetz <aaronplo...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Some thoughts on this one:
>>>>>>>>>>>> 
>>>>>>>>>>>> In a prior job, we'd give app teams access to a single keyspace, 
>>>>>>>>>>>> and two roles: a read-write role and a read-only role.  In some 
>>>>>>>>>>>> cases, a "privileged" application role was also requested.  
>>>>>>>>>>>> Depending on the requirements, I could see the UNMASK permission 
>>>>>>>>>>>> being applied to the RW or privileged roles.  But if there's a 
>>>>>>>>>>>> problem on the table and the operators go in to investigate, they 
>>>>>>>>>>>> will likely use a SUPERUSER account, and they'll see that data.
>>>>>>>>>>>> 
>>>>>>>>>>>> How hard would it be for SUPERUSERs to *not* automatically get the 
>>>>>>>>>>>> UNMASK permission?
>>>>>>>>>>>> 
>>>>>>>>>>>> I'll also echo the concerns around masking primary key components. 
>>>>>>>>>>>>  It's highly likely that certain personal data properties would be 
>>>>>>>>>>>> used as a partition or clustering key (ex: range query for people 
>>>>>>>>>>>> born within a certain timeframe).  In addition to the "breaks 
>>>>>>>>>>>> existing" concern, I'm curious about the challenges around getting 
>>>>>>>>>>>> that to work with the current primary key implementation.
>>>>>>>>>>>> 
>>>>>>>>>>>> Does this first implementation only apply to payload (non-key) 
>>>>>>>>>>>> columns?  The examples in the CEP currently do not show primary 
>>>>>>>>>>>> key components being masked. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>> Aaron
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
>>>>>>>>>>>> <henrik.i...@datastax.com> wrote:
>>>>>>>>>>>>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
>>>>>>>>>>>>> <adelap...@apache.org> wrote:
>>>>>>>>>>>>>>> One thought: The way the CEP is currently written, it is only 
>>>>>>>>>>>>>>> possible to mask a column one way. You can only define one 
>>>>>>>>>>>>>>> masking function for a column, and since you use the original 
>>>>>>>>>>>>>>> column name, you could only return one version of it in the 
>>>>>>>>>>>>>>> result set, even if you had a way to define several functions.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Right, it's one single type of mapping per the column, declared 
>>>>>>>>>>>>>> on CREATE/ALTER TABLE statements. Also, users can manually 
>>>>>>>>>>>>>> specify their own masking function in SELECT statements if they 
>>>>>>>>>>>>>> have permissions for seeing the clear data.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For those cases where the data is automatically masked for an 
>>>>>>>>>>>>>> unprivileged user, I don't see the use of including different 
>>>>>>>>>>>>>> types of masking for the same column into the same result set. 
>>>>>>>>>>>>>> Instead, we might be interested on having different types of 
>>>>>>>>>>>>>> masking associated to different roles. We could do so with 
>>>>>>>>>>>>>> dedicated CREATE/DROP/LIST MASK statements, instead of using the 
>>>>>>>>>>>>>> CREATE/ALTER/DESCRIBE TABLE statements. That CREATE MASK 
>>>>>>>>>>>>>> statement would associate a masking function to a column and 
>>>>>>>>>>>>>> role. However, I'm not sure we need that type of granularity 
>>>>>>>>>>>>>> instead of the simplicity of attaching the masking to the column 
>>>>>>>>>>>>>> declaration. wdyt?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My gut feeling likewise is that this adds complexity but little 
>>>>>>>>>>>>> value.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Henrik Ingo
>>>>>>>>>>>>> +358 40 569 7354
>>>>>>>>>>>>>       
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> +---------------------------------------------------------------+
>>>>>>>>>> | Derek Chen-Becker                                             |
>>>>>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>>>>>>> +---------------------------------------------------------------+
>>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Henrik Ingo
>>>>>> +358 40 569 7354
>>>>>>       
>>>> 
>>>> 
>>>> -- 
>>>> +---------------------------------------------------------------+
>>>> | Derek Chen-Becker                                             |
>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>> +---------------------------------------------------------------+
>>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

Reply via email to