Not to push the point too strongly (I don’t have a very firm view of my own), but if we provide this via a view feature we’re just implementing one new feature and we get masking for free. I don’t think it is materially more complicated than redefining columns for users - it might even be less so, as we do not have to consider how applications interpret table metadata.
Projection views are a very simple concept and pretty simple to implement I think, and conceptually very familiar to users. So let’s at least not prefer the table column modifier approach because it’s simpler or requires fewer new features, as I do not believe this to be the case. > On 30 Aug 2022, at 12:46, Andrés de la Peña <adelap...@apache.org> wrote: > > >> GRANT SELECT ON foo.unmasked_name TO top_secret; > > Note that Cassandra doesn't have support for column-level permissions. There > was an initiative to add them in 2016, CASSANDRA-12859. However, the ticket > has been inactive since 2017. The last comments seem some discussions about > design. > > Also, generated columns in PostgreSQL are always stored, so if they were used > for masking they would constitute static data masking, not dynamic. > > The approach for dynamic data masking that PostgreSQL suggests on its > documentation doesn't seem based on generating a masked copy of the column, > neither on a generated column or on a view. Instead, it uses security labels > to associate columns to users and masking functions. That way, the same > column will be seen masked or unmasked depending on the user. > > I'd say that applying the masking rule to the base column itself, and not to > a copy, is the most common approach among the discussed databases so far. > Also, it has the advantage for us of not being based on other relatively > complex features that we miss, such as column-level permissions or > not-materialized views. If someday we add those features I think they would > play well with what is proposed on the CEP. > >> On Tue, 30 Aug 2022 at 11:46, Avi Kivity via dev <dev@cassandra.apache.org> >> wrote: >> Agree with views, or alternatively, column permissions together with >> computed columns: >> >> >> >> CREATE TABLE foo ( >> >> id int PRIMARY KEY, >> >> unmasked_name text, >> >> name text GENERATED ALWAYS AS some_mask_function(text, 'xxx', 7) >> >> ) >> >> >> >> (syntax from postgresql) >> >> >> >> GRANT SELECT ON foo.name TO general_use; >> >> GRANT SELECT ON foo.unmasked_name TO top_secret; >> >> >> >>> On 26/08/2022 00.10, Benedict wrote: >>> I’m inclined to agree that this seems a more straightforward approach that >>> makes fewer implied promises. >>> >>> Perhaps we could deliver simple views backed by virtual tables, and model >>> our approach on that of Postgres, MySQL et al? >>> >>> Views in C* would be very simple, just offering a subset of fields with >>> some UDFs applied. It would allow users to define roles with access only to >>> the views, or for applications to use the views for presentation purposes. >>> >>> It feels like a cleaner approach to me, and we’d get two features for the >>> price of one. BUT I don’t feel super strongly about this. >>> >>>> On 25 Aug 2022, at 20:16, Derek Chen-Becker <de...@chen-becker.org> wrote: >>>> >>>> >>>> To make sure I understand, if I wanted to use a masked column for a >>>> conditional update, you're saying we would need SELECT_MASKED to use it in >>>> the IF clause? I worry that this proposal is increasing in complexity; I >>>> would actually be OK starting with something smaller in scope. Perhaps >>>> just providing the masking functions and not tying masking to schema would >>>> be sufficient for an initial goal? That wouldn't preclude additional >>>> permissions, schema integration, or perhaps just plain Views in the future. >>>> >>>> Cheers, >>>> >>>> Derek >>>> >>>> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña <adelap...@apache.org> >>>> wrote: >>>>> I have modified the proposal adding a new SELECT_MASKED permission. Using >>>>> masked columns on WHERE/IF clauses would require having SELECT and either >>>>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the >>>>> query results would always require both SELECT and UNMASK. >>>>> >>>>> This way we can have the best of both worlds, allowing admins to decide >>>>> whether they trust their immediate users or not. wdyt? >>>>> >>>>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo <henrik.i...@datastax.com> >>>>> wrote: >>>>>> This is the difference between security and compliance I guess :-D >>>>>> >>>>>> The way I see this, the attacker or threat in this concept is not the >>>>>> developer with access to the database. Rather a feature like this is >>>>>> just a convenient way to apply some masking rule in a centralized way. >>>>>> The protection is against an end user of the application, who should not >>>>>> be able to see the personal data of someone else. Or themselves, even. >>>>>> As long as the application end user doesn't have access to run arbitrary >>>>>> CQL, then these frorms of masking prevent accidental unauthorized >>>>>> use/leaking of personal data. >>>>>> >>>>>> henrik >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict <bened...@apache.org> wrote: >>>>>>> Is it typical for a masking feature to make no effort to prevent >>>>>>> unmasking? I’m just struggling to see the value of this without such >>>>>>> mechanisms. Otherwise it’s just a default formatter, and we should >>>>>>> consider renaming the feature IMO >>>>>>> >>>>>>>> On 23 Aug 2022, at 21:27, Andrés de la Peña <adelap...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> As mentioned in the CEP document, dynamic data masking doesn't try to >>>>>>>> prevent malicious users with SELECT permissions to indirectly guess >>>>>>>> the real value of the masked value. This can easily be done by just >>>>>>>> trying values on the WHERE clause of SELECT queries. DDM would not be >>>>>>>> a replacement for proper column-level permissions. >>>>>>>> >>>>>>>> The data served by the database is usually consumed by applications >>>>>>>> that present this data to end users. These end users are not >>>>>>>> necessarily the users directly connecting to the database. With DDM, >>>>>>>> it would be easy for applications to mask sensitive data that is going >>>>>>>> to be consumed by the end users. However, the users directly >>>>>>>> connecting to the database should be trusted, provided that they have >>>>>>>> the right SELECT permissions. >>>>>>>> >>>>>>>> In other words, DDM doesn't directly protect the data, but it eases >>>>>>>> the production of protected data. >>>>>>>> >>>>>>>> Said that, we could later go one step ahead and add a way to prevent >>>>>>>> untrusted users from inferring the masked data. That could be done >>>>>>>> adding a new permission required to use certain columns on WHERE >>>>>>>> clauses, different to the current SELECT permission. That would play >>>>>>>> especially well with column-level permissions, which is something that >>>>>>>> we still have pending. >>>>>>>> >>>>>>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz <aaronplo...@gmail.com> >>>>>>>> wrote: >>>>>>>>>> Applying this should prevent querying on a field, else you could >>>>>>>>>> leak its contents, surely? >>>>>>>>> >>>>>>>>> In theory, yes. Although I could see folks doing something like this: >>>>>>>>> >>>>>>>>> SELECT COUNT(*) FROM patients >>>>>>>>> WHERE year_of_birth = 2002 >>>>>>>>> AND date_of_birth >= '2002-04-01' >>>>>>>>> AND date_of_birth < '2002-11-01'; >>>>>>>>> >>>>>>>>> In this case, the rows containing the masked key column(s) could be >>>>>>>>> filtered on without revealing the actual data. But again, that's >>>>>>>>> probably better for a "phase 2" of the implementation. >>>>>>>>> >>>>>>>>>> Agreed on not being a queryable field. That would also preclude >>>>>>>>>> secondary indexing, right? >>>>>>>>> >>>>>>>>> Yes, that's my thought as well. >>>>>>>>> >>>>>>>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker >>>>>>>>> <de...@chen-becker.org> wrote: >>>>>>>>>> Agreed on not being a queryable field. That would also preclude >>>>>>>>>> secondary indexing, right? >>>>>>>>>> >>>>>>>>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict <bened...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>>> Applying this should prevent querying on a field, else you could >>>>>>>>>>> leak its contents, surely? This pretty much prohibits using it in a >>>>>>>>>>> clustering key, and a partition key with the ordered partitioner - >>>>>>>>>>> but probably also a hashed partitioner since we do not use a >>>>>>>>>>> cryptographic hash and the hash function is well defined. >>>>>>>>>>> >>>>>>>>>>> We probably also need to ensure that any ALLOW FILTERING queries on >>>>>>>>>>> such a field are disabled. >>>>>>>>>>> >>>>>>>>>>> Plausibly the data could be cryptographically jumbled before using >>>>>>>>>>> it in a primary key component (or permitting filtering), but it is >>>>>>>>>>> probably easier and safer to exclude for now… >>>>>>>>>>> >>>>>>>>>>>> On 23 Aug 2022, at 18:13, Aaron Ploetz <aaronplo...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Some thoughts on this one: >>>>>>>>>>>> >>>>>>>>>>>> In a prior job, we'd give app teams access to a single keyspace, >>>>>>>>>>>> and two roles: a read-write role and a read-only role. In some >>>>>>>>>>>> cases, a "privileged" application role was also requested. >>>>>>>>>>>> Depending on the requirements, I could see the UNMASK permission >>>>>>>>>>>> being applied to the RW or privileged roles. But if there's a >>>>>>>>>>>> problem on the table and the operators go in to investigate, they >>>>>>>>>>>> will likely use a SUPERUSER account, and they'll see that data. >>>>>>>>>>>> >>>>>>>>>>>> How hard would it be for SUPERUSERs to *not* automatically get the >>>>>>>>>>>> UNMASK permission? >>>>>>>>>>>> >>>>>>>>>>>> I'll also echo the concerns around masking primary key components. >>>>>>>>>>>> It's highly likely that certain personal data properties would be >>>>>>>>>>>> used as a partition or clustering key (ex: range query for people >>>>>>>>>>>> born within a certain timeframe). In addition to the "breaks >>>>>>>>>>>> existing" concern, I'm curious about the challenges around getting >>>>>>>>>>>> that to work with the current primary key implementation. >>>>>>>>>>>> >>>>>>>>>>>> Does this first implementation only apply to payload (non-key) >>>>>>>>>>>> columns? The examples in the CEP currently do not show primary >>>>>>>>>>>> key components being masked. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Aaron >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo >>>>>>>>>>>> <henrik.i...@datastax.com> wrote: >>>>>>>>>>>>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña >>>>>>>>>>>>> <adelap...@apache.org> wrote: >>>>>>>>>>>>>>> One thought: The way the CEP is currently written, it is only >>>>>>>>>>>>>>> possible to mask a column one way. You can only define one >>>>>>>>>>>>>>> masking function for a column, and since you use the original >>>>>>>>>>>>>>> column name, you could only return one version of it in the >>>>>>>>>>>>>>> result set, even if you had a way to define several functions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Right, it's one single type of mapping per the column, declared >>>>>>>>>>>>>> on CREATE/ALTER TABLE statements. Also, users can manually >>>>>>>>>>>>>> specify their own masking function in SELECT statements if they >>>>>>>>>>>>>> have permissions for seeing the clear data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For those cases where the data is automatically masked for an >>>>>>>>>>>>>> unprivileged user, I don't see the use of including different >>>>>>>>>>>>>> types of masking for the same column into the same result set. >>>>>>>>>>>>>> Instead, we might be interested on having different types of >>>>>>>>>>>>>> masking associated to different roles. We could do so with >>>>>>>>>>>>>> dedicated CREATE/DROP/LIST MASK statements, instead of using the >>>>>>>>>>>>>> CREATE/ALTER/DESCRIBE TABLE statements. That CREATE MASK >>>>>>>>>>>>>> statement would associate a masking function to a column and >>>>>>>>>>>>>> role. However, I'm not sure we need that type of granularity >>>>>>>>>>>>>> instead of the simplicity of attaching the masking to the column >>>>>>>>>>>>>> declaration. wdyt? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> My gut feeling likewise is that this adds complexity but little >>>>>>>>>>>>> value. >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Henrik Ingo >>>>>>>>>>>>> +358 40 569 7354 >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> +---------------------------------------------------------------+ >>>>>>>>>> | Derek Chen-Becker | >>>>>>>>>> | GPG Key available at https://keybase.io/dchenbecker and | >>>>>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >>>>>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | >>>>>>>>>> +---------------------------------------------------------------+ >>>>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Henrik Ingo >>>>>> +358 40 569 7354 >>>>>> >>>> >>>> >>>> -- >>>> +---------------------------------------------------------------+ >>>> | Derek Chen-Becker | >>>> | GPG Key available at https://keybase.io/dchenbecker and | >>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | >>>> +---------------------------------------------------------------+ >>>>