Re: [Marketing] For Review: Learn How CommitLog Works in Apache Cassandra

2022-09-07 Thread Chris Thornett
Amends have been made, and we've closed the review period. Thanks
to everyone for your help on this one!

On Wed, Aug 31, 2022 at 3:16 PM Chris Thornett  wrote:

> Update: Some sections have been flagged by the PMC for further revision.
> Author, Alex Sorokoumov will review on 2 September.
>
> Revising publication date to 8 September.
>
> On Fri, Aug 26, 2022 at 11:15 PM Alexander Sorokoumov <
> aleksandr.sorokou...@gmail.com> wrote:
>
>> Hey Rhaul,
>>
>> Thanks for the feedback. I have changed it to just durability (without
>> mentioning ACID) to prevent confusion.
>>
>> Best,
>> Alex
>>
>> On Fri, Aug 26, 2022 at 11:53 PM Rahul Xavier Singh <
>> rahul.xavier.si...@gmail.com> wrote:
>>
>>> Added a comment about "ACID". I would recommend not saying ACID until
>>> it's there. C* has strong consistency when needed. It doesn't for example
>>> guarantee that two competing mutations will be executed (or be able to be
>>> rolled back to the previous state) in the same exact order they were
>>> intended if they come in at the same time, especially if these are coming
>>> from two different data centers for example.
>>>
>>> Maybe it can be explained later that the commitlog mechanism provides
>>> ACID-like features ... ?
>>>
>>> From my understanding the Accord white paper has not been implemented
>>> into any working Cassandra code. I may be wrong.
>>>
>>>
>>> Rahul Singh
>>>
>>> Chief Executive Officer | Business Platform Architect m: 202.905.2818
>>> e: rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
>>> http://calendly.com/xingh
>>>
>>> *We create, support, and manage real-time global data & analytics
>>> platforms for the modern enterprise.*
>>>
>>> *Anant | https://anant.us *
>>>
>>> 3 Washington Circle, Suite 301
>>>
>>> Washington, D.C. 20037
>>>
>>> *http://Cassandra.Link * : The best resources
>>> for Apache Cassandra
>>>
>>>
>>> On Tue, Aug 23, 2022 at 5:43 AM Sharan Foga  wrote:
>>>
 Hi Chris

 I've added a few comments and suggestions. Please feel free to use
 /ignore whichever ones you think :-)

 Thanks
 Sharan

 On 2022/08/23 00:08:52 Chris Thornett wrote:
 > Opening up Alex Sorokoumov's guide 'Learn How CommitLog Works in
 Apache
 > Cassandra' for a 72-hr community review by lazy consensus.
 >
 > Please add any amends and suggestions in the comments:
 >
 https://docs.google.com/document/d/1cyOi-IeU_I9GBkpQbJS6IIrmemAesEqvzLb-eeFs_rM/edit#
 >
 > Thanks!
 >
 > --
 >
 > Chris Thornett
 > Senior Content Strategist, Constantia.io
 >

>>>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Andrés de la Peña
If nobody has more concerns regarding the CEP I will start the vote
tomorrow.

On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
wrote:

> Is there enough support here for VIEWS to be the implementation strategy
>> for displaying masking functions?
>
>
> I'm not sure that views should be "the" strategy for masking functions. We
> have multiple approaches here:
>
> 1) CQL functions only. Users can decide to use the masking functions on
> their own will. I think most dbs allow this pattern of usage, which is
> quite straightforward. Obviously, it doesn't allow admins to decide enforce
> users seeing only masked data. Nevertheless, it's still useful for trusted
> database users generating masked data that will be consumed by the end
> users of the application.
>
> 2) Masking functions attached to specific columns. This way the same
> queries will see different data (masked or not) depending on the
> permissions of the user running the query. It has the advantage of not
> requiring to change the queries that users with different permissions run.
> The downside is that users would need to query the schema if they need to
> know whether a column is masked, unless we change the names of the returned
> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
> applying the masking function to columns on the base table, and some of
> them also allow to apply masking to views.
>
> 3) Masking functions as part of projected views. This ways users might
> need to query the view appropriate for their permissions instead of the
> base table. This might mean changing the queries if the masking policy is
> changed by the admin. MySQL recommends this approach on a blog entry,
> although it's not part of its main documentation for data masking, and the
> implementation has security issues. Some of the other databases offering
> the approach 2) as their main option also support masking on view columns.
>
> Each approach has its own advantages and limitations, and I don't think we
> necessarily have to choose. The CEP proposes implementing 1) and 2), but no
> one impedes us to also have 3) if we get to have projected views. However,
> I think that projected views is a new general-purpose feature with its own
> complexities, so it would deserve its own CEP, if someone is willing to
> work on the implementation.
>
>
>
> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
> dev@cassandra.apache.org> wrote:
>
>> Is there enough support here for VIEWS to be the implementation strategy
>> for displaying masking functions?
>>
>> It seems to me the view would have to store the query and apply a where
>> clause to it, so the same PK would be in play.
>>
>> It has data leaking properties.
>>
>> It has more use cases as it can be used to
>>
>>- construct views that filter out sensitive columns
>>- apply transforms to convert units of measure
>>
>> Are there more thoughts along this line?
>>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Benedict
I’m not convinced there’s been adequate resolution over which approach is 
adopted. I know you have expressed a preference for the table schema approach, 
but the weight of other opinion so far appears to be against this approach - 
even if it is broadly adopted by other databases. I will note that Postgres 
does not adopt this approach, it has a more sophisticated security label 
approach that has not been proposed by anybody so far.

I think extra weight should be given to the implementer’s preference, so while 
I personally do not like the table schema approach, I am happy to accept this 
is an industry norm, and leave the decision to you.

However, we should ensure the community as a whole endorses this. I think an 
indicative poll should be undertaken first, eg:

A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to the implementor 
selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should implement the 
view approach
D) We should NOT implement the table schema approach, and should implement some 
other scheme (or not implement this feature)

Where my vote is B

> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
> 
> 
> If nobody has more concerns regarding the CEP I will start the vote tomorrow.
> 
> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña  wrote:
>>> Is there enough support here for VIEWS to be the implementation strategy 
>>> for displaying masking functions?
>> 
>> I'm not sure that views should be "the" strategy for masking functions. We 
>> have multiple approaches here:
>> 
>> 1) CQL functions only. Users can decide to use the masking functions on 
>> their own will. I think most dbs allow this pattern of usage, which is quite 
>> straightforward. Obviously, it doesn't allow admins to decide enforce users 
>> seeing only masked data. Nevertheless, it's still useful for trusted 
>> database users generating masked data that will be consumed by the end users 
>> of the application.
>> 
>> 2) Masking functions attached to specific columns. This way the same queries 
>> will see different data (masked or not) depending on the permissions of the 
>> user running the query. It has the advantage of not requiring to change the 
>> queries that users with different permissions run. The downside is that 
>> users would need to query the schema if they need to know whether a column 
>> is masked, unless we change the names of the returned columns. This is the 
>> approach offered by Azure/SQL Server, PostgreSQL, IBM Db2, Oracle, 
>> MariaDB/MaxScale and SnowFlake. All these databases support applying the 
>> masking function to columns on the base table, and some of them also allow 
>> to apply masking to views.
>> 
>> 3) Masking functions as part of projected views. This ways users might need 
>> to query the view appropriate for their permissions instead of the base 
>> table. This might mean changing the queries if the masking policy is changed 
>> by the admin. MySQL recommends this approach on a blog entry, although it's 
>> not part of its main documentation for data masking, and the implementation 
>> has security issues. Some of the other databases offering the approach 2) as 
>> their main option also support masking on view columns.
>> 
>> Each approach has its own advantages and limitations, and I don't think we 
>> necessarily have to choose. The CEP proposes implementing 1) and 2), but no 
>> one impedes us to also have 3) if we get to have projected views. However, I 
>> think that projected views is a new general-purpose feature with its own 
>> complexities, so it would deserve its own CEP, if someone is willing to work 
>> on the implementation.
>> 
>> 
>> 
>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev 
>>  wrote:
>>> Is there enough support here for VIEWS to be the implementation strategy 
>>> for displaying masking functions?
>>> 
>>> It seems to me the view would have to store the query and apply a where 
>>> clause to it, so the same PK would be in play.
>>> 
>>> It has data leaking properties.
>>> 
>>> It has more use cases as it can be used to
>>> 
>>> construct views that filter out sensitive columns
>>> apply transforms to convert units of measure
>>> Are there more thoughts along this line?


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Claude Warren via dev

My vote is B

On 07/09/2022 13:12, Benedict wrote:
I’m not convinced there’s been adequate resolution over which approach 
is adopted. I know you have expressed a preference for the table 
schema approach, but the weight of other opinion so far appears to be 
against this approach - even if it is broadly adopted by other 
databases. I will note that Postgres does not adopt this approach, it 
has a more sophisticated security label approach that has not been 
proposed by anybody so far.


I think extra weight should be given to the implementer’s preference, 
so while I personally do not like the table schema approach, I am 
happy to accept this is an industry norm, and leave the decision to you.


However, we should ensure the community as a whole endorses this. I 
think an indicative poll should be undertaken first, eg:


A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to the 
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should 
implement the view approach
D) We should NOT implement the table schema approach, and should 
implement some other scheme (or not implement this feature)


Where my vote is B


On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:


If nobody has more concerns regarding the CEP I will start the vote 
tomorrow.


On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
 wrote:


Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?


I'm not sure that views should be "the" strategy for masking
functions. We have multiple approaches here:

1) CQL functions only. Users can decide to use the masking
functions on their own will. I think most dbs allow this pattern
of usage, which is quite straightforward. Obviously, it doesn't
allow admins to decide enforce users seeing only masked data.
Nevertheless, it's still useful for trusted database users
generating masked data that will be consumed by the end users of
the application.

2) Masking functions attached to specific columns. This way the
same queries will see different data (masked or not) depending on
the permissions of the user running the query. It has the
advantage of not requiring to change the queries that users with
different permissions run. The downside is that users would need
to query the schema if they need to know whether a column is
masked, unless we change the names of the returned columns. This
is the approach offered by Azure/SQL Server, PostgreSQL, IBM Db2,
Oracle, MariaDB/MaxScale and SnowFlake. All these databases
support applying the masking function to columns on the base
table, and some of them also allow to apply masking to views.

3) Masking functions as part of projected views. This ways users
might need to query the view appropriate for their permissions
instead of the base table. This might mean changing the queries
if the masking policy is changed by the admin. MySQL recommends
this approach on a blog entry, although it's not part of its main
documentation for data masking, and the implementation has
security issues. Some of the other databases offering the
approach 2) as their main option also support masking on view
columns.

Each approach has its own advantages and limitations, and I don't
think we necessarily have to choose. The CEP proposes
implementing 1) and 2), but no one impedes us to also have 3) if
we get to have projected views. However, I think that projected
views is a new general-purpose feature with its own complexities,
so it would deserve its own CEP, if someone is willing to work on
the implementation.



On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev
 wrote:

Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?

It seems to me the view would have to store the query and
apply a where clause to it, so the same PK would be in play.

It has data leaking properties.

It has more use cases as it can be used to

  * construct views that filter out sensitive columns
  * apply transforms to convert units of measure

Are there more thoughts along this line?


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Andrés de la Peña
The poll makes sense to me. I would slightly change it to:

A) We shouldn't prefer neither approach, and I agree to the implementor
selecting the table schema approach for this CEP
B) We should prefer the view approach, but I am not opposed to the
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should implement
the view approach
D) We should NOT implement the table view approach, and should implement
the schema approach
E) We should NOT implement the table schema approach, and should implement
some other scheme (or not implement this feature)

Where my vote is for A.


On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:

> I’m not convinced there’s been adequate resolution over which approach is
> adopted. I know you have expressed a preference for the table schema
> approach, but the weight of other opinion so far appears to be against this
> approach - even if it is broadly adopted by other databases. I will note
> that Postgres does not adopt this approach, it has a more sophisticated
> security label approach that has not been proposed by anybody so far.
>
> I think extra weight should be given to the implementer’s preference, so
> while I personally do not like the table schema approach, I am happy to
> accept this is an industry norm, and leave the decision to you.
>
> However, we should ensure the community as a whole endorses this. I think
> an indicative poll should be undertaken first, eg:
>
> A) We should implement the table schema approach, as proposed
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement
> the view approach
> D) We should NOT implement the table schema approach, and should implement
> some other scheme (or not implement this feature)
>
> Where my vote is B
>
> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>
> 
> If nobody has more concerns regarding the CEP I will start the vote
> tomorrow.
>
> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
> wrote:
>
>> Is there enough support here for VIEWS to be the implementation strategy
>>> for displaying masking functions?
>>
>>
>> I'm not sure that views should be "the" strategy for masking functions.
>> We have multiple approaches here:
>>
>> 1) CQL functions only. Users can decide to use the masking functions on
>> their own will. I think most dbs allow this pattern of usage, which is
>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>> users seeing only masked data. Nevertheless, it's still useful for trusted
>> database users generating masked data that will be consumed by the end
>> users of the application.
>>
>> 2) Masking functions attached to specific columns. This way the same
>> queries will see different data (masked or not) depending on the
>> permissions of the user running the query. It has the advantage of not
>> requiring to change the queries that users with different permissions run.
>> The downside is that users would need to query the schema if they need to
>> know whether a column is masked, unless we change the names of the returned
>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>> applying the masking function to columns on the base table, and some of
>> them also allow to apply masking to views.
>>
>> 3) Masking functions as part of projected views. This ways users might
>> need to query the view appropriate for their permissions instead of the
>> base table. This might mean changing the queries if the masking policy is
>> changed by the admin. MySQL recommends this approach on a blog entry,
>> although it's not part of its main documentation for data masking, and the
>> implementation has security issues. Some of the other databases offering
>> the approach 2) as their main option also support masking on view columns.
>>
>> Each approach has its own advantages and limitations, and I don't think
>> we necessarily have to choose. The CEP proposes implementing 1) and 2), but
>> no one impedes us to also have 3) if we get to have projected views.
>> However, I think that projected views is a new general-purpose feature with
>> its own complexities, so it would deserve its own CEP, if someone is
>> willing to work on the implementation.
>>
>>
>>
>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
>> dev@cassandra.apache.org> wrote:
>>
>>> Is there enough support here for VIEWS to be the implementation strategy
>>> for displaying masking functions?
>>>
>>> It seems to me the view would have to store the query and apply a where
>>> clause to it, so the same PK would be in play.
>>>
>>> It has data leaking properties.
>>>
>>> It has more use cases as it can be used to
>>>
>>>- construct views that filter out sensitive columns
>>>- apply transforms

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Ekaterina Dimitrova
A

On Wed, 7 Sep 2022 at 9:05, Andrés de la Peña  wrote:

> The poll makes sense to me. I would slightly change it to:
>
> A) We shouldn't prefer neither approach, and I agree to the implementor
> selecting the table schema approach for this CEP
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement
> the view approach
> D) We should NOT implement the table view approach, and should implement
> the schema approach
> E) We should NOT implement the table schema approach, and should implement
> some other scheme (or not implement this feature)
>
> Where my vote is for A.
>
>
> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>
>> I’m not convinced there’s been adequate resolution over which approach is
>> adopted. I know you have expressed a preference for the table schema
>> approach, but the weight of other opinion so far appears to be against this
>> approach - even if it is broadly adopted by other databases. I will note
>> that Postgres does not adopt this approach, it has a more sophisticated
>> security label approach that has not been proposed by anybody so far.
>>
>> I think extra weight should be given to the implementer’s preference, so
>> while I personally do not like the table schema approach, I am happy to
>> accept this is an industry norm, and leave the decision to you.
>>
>> However, we should ensure the community as a whole endorses this. I think
>> an indicative poll should be undertaken first, eg:
>>
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should
>> implement the view approach
>> D) We should NOT implement the table schema approach, and should
>> implement some other scheme (or not implement this feature)
>>
>> Where my vote is B
>>
>> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>>
>> 
>> If nobody has more concerns regarding the CEP I will start the vote
>> tomorrow.
>>
>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
>> wrote:
>>
>>> Is there enough support here for VIEWS to be the implementation strategy
 for displaying masking functions?
>>>
>>>
>>> I'm not sure that views should be "the" strategy for masking functions.
>>> We have multiple approaches here:
>>>
>>> 1) CQL functions only. Users can decide to use the masking functions on
>>> their own will. I think most dbs allow this pattern of usage, which is
>>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>>> users seeing only masked data. Nevertheless, it's still useful for trusted
>>> database users generating masked data that will be consumed by the end
>>> users of the application.
>>>
>>> 2) Masking functions attached to specific columns. This way the same
>>> queries will see different data (masked or not) depending on the
>>> permissions of the user running the query. It has the advantage of not
>>> requiring to change the queries that users with different permissions run.
>>> The downside is that users would need to query the schema if they need to
>>> know whether a column is masked, unless we change the names of the returned
>>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>>> applying the masking function to columns on the base table, and some of
>>> them also allow to apply masking to views.
>>>
>>> 3) Masking functions as part of projected views. This ways users might
>>> need to query the view appropriate for their permissions instead of the
>>> base table. This might mean changing the queries if the masking policy is
>>> changed by the admin. MySQL recommends this approach on a blog entry,
>>> although it's not part of its main documentation for data masking, and the
>>> implementation has security issues. Some of the other databases offering
>>> the approach 2) as their main option also support masking on view columns.
>>>
>>> Each approach has its own advantages and limitations, and I don't think
>>> we necessarily have to choose. The CEP proposes implementing 1) and 2), but
>>> no one impedes us to also have 3) if we get to have projected views.
>>> However, I think that projected views is a new general-purpose feature with
>>> its own complexities, so it would deserve its own CEP, if someone is
>>> willing to work on the implementation.
>>>
>>>
>>>
>>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
>>> dev@cassandra.apache.org> wrote:
>>>
 Is there enough support here for VIEWS to be the implementation
 strategy for displaying masking functions?

 It seems to me the view would have to store the query and apply a where
 clause to it, so the same PK would be in play.

>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Berenguer Blasi
A. I agree the implementor's preference is an important aspect to take 
into account.


On 7/9/22 15:23, Ekaterina Dimitrova wrote:

A

On Wed, 7 Sep 2022 at 9:05, Andrés de la Peña  
wrote:


The poll makes sense to me. I would slightly change it to:

A) We shouldn't prefer neither approach, and I agree to the
implementor selecting the table schema approach for this CEP
B) We should prefer the view approach, but I am not opposed to the
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should
implement the view approach
D) We should NOT implement the table view approach, and should
implement the schema approach
E) We should NOT implement the table schema approach, and should
implement some other scheme (or not implement this feature)

Where my vote is for A.


On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:

I’m not convinced there’s been adequate resolution over which
approach is adopted. I know you have expressed a preference
for the table schema approach, but the weight of other opinion
so far appears to be against this approach - even if it is
broadly adopted by other databases. I will note that Postgres
does not adopt this approach, it has a more sophisticated
security label approach that has not been proposed by anybody
so far.

I think extra weight should be given to the implementer’s
preference, so while I personally do not like the table schema
approach, I am happy to accept this is an industry norm, and
leave the decision to you.

However, we should ensure the community as a whole endorses
this. I think an indicative poll should be undertaken first, eg:

A) We should implement the table schema approach, as proposed
B) We should prefer the view approach, but I am not opposed to
the implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and
should implement the view approach
D) We should NOT implement the table schema approach, and
should implement some other scheme (or not implement this feature)

Where my vote is B


On 7 Sep 2022, at 12:50, Andrés de la Peña
 wrote:


If nobody has more concerns regarding the CEP I will start
the vote tomorrow.

On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña
 wrote:

Is there enough support here for VIEWS to be the
implementation strategy for displaying masking functions?


I'm not sure that views should be "the" strategy for
masking functions. We have multiple approaches here:

1) CQL functions only. Users can decide to use the
masking functions on their own will. I think most dbs
allow this pattern of usage, which is quite
straightforward. Obviously, it doesn't allow admins to
decide enforce users seeing only masked data.
Nevertheless, it's still useful for trusted database
users generating masked data that will be consumed by the
end users of the application.

2) Masking functions attached to specific columns. This
way the same queries will see different data (masked or
not) depending on the permissions of the user running the
query. It has the advantage of not requiring to change
the queries that users with different permissions run.
The downside is that users would need to query the schema
if they need to know whether a column is masked, unless
we change the names of the returned columns. This is the
approach offered by Azure/SQL Server, PostgreSQL, IBM
Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these
databases support applying the masking function to
columns on the base table, and some of them also allow to
apply masking to views.

3) Masking functions as part of projected views. This
ways users might need to query the view appropriate for
their permissions instead of the base table. This might
mean changing the queries if the masking policy is
changed by the admin. MySQL recommends this approach on a
blog entry, although it's not part of its main
documentation for data masking, and the implementation
has security issues. Some of the other databases offering
the approach 2) as their main option also support masking
on view columns.

Each approach has its own advantages and limitations, and
I don't think we necessarily have to choose. The CEP
proposes implementing 1) and 2), but no one impe

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Benedict
Well, I am not convinced these changes will materially impact the outcome, but 
at least we’ll have some extra fun collating the votes.


> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
> 
> 
> The poll makes sense to me. I would slightly change it to:
> 
> A) We shouldn't prefer neither approach, and I agree to the implementor 
> selecting the table schema approach for this CEP
> B) We should prefer the view approach, but I am not opposed to the 
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement 
> the view approach
> D) We should NOT implement the table view approach, and should implement the 
> schema approach
> E) We should NOT implement the table schema approach, and should implement 
> some other scheme (or not implement this feature)
> 
> Where my vote is for A.
> 
> 
>> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>> I’m not convinced there’s been adequate resolution over which approach is 
>> adopted. I know you have expressed a preference for the table schema 
>> approach, but the weight of other opinion so far appears to be against this 
>> approach - even if it is broadly adopted by other databases. I will note 
>> that Postgres does not adopt this approach, it has a more sophisticated 
>> security label approach that has not been proposed by anybody so far.
>> 
>> I think extra weight should be given to the implementer’s preference, so 
>> while I personally do not like the table schema approach, I am happy to 
>> accept this is an industry norm, and leave the decision to you.
>> 
>> However, we should ensure the community as a whole endorses this. I think an 
>> indicative poll should be undertaken first, eg:
>> 
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the 
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should implement 
>> the view approach
>> D) We should NOT implement the table schema approach, and should implement 
>> some other scheme (or not implement this feature)
>> 
>> Where my vote is B
>> 
 On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
 
>>> 
>>> If nobody has more concerns regarding the CEP I will start the vote 
>>> tomorrow.
>>> 
>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña  
>>> wrote:
> Is there enough support here for VIEWS to be the implementation strategy 
> for displaying masking functions?
 
 I'm not sure that views should be "the" strategy for masking functions. We 
 have multiple approaches here:
 
 1) CQL functions only. Users can decide to use the masking functions on 
 their own will. I think most dbs allow this pattern of usage, which is 
 quite straightforward. Obviously, it doesn't allow admins to decide 
 enforce users seeing only masked data. Nevertheless, it's still useful for 
 trusted database users generating masked data that will be consumed by the 
 end users of the application.
 
 2) Masking functions attached to specific columns. This way the same 
 queries will see different data (masked or not) depending on the 
 permissions of the user running the query. It has the advantage of not 
 requiring to change the queries that users with different permissions run. 
 The downside is that users would need to query the schema if they need to 
 know whether a column is masked, unless we change the names of the 
 returned columns. This is the approach offered by Azure/SQL Server, 
 PostgreSQL, IBM Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these 
 databases support applying the masking function to columns on the base 
 table, and some of them also allow to apply masking to views.
 
 3) Masking functions as part of projected views. This ways users might 
 need to query the view appropriate for their permissions instead of the 
 base table. This might mean changing the queries if the masking policy is 
 changed by the admin. MySQL recommends this approach on a blog entry, 
 although it's not part of its main documentation for data masking, and the 
 implementation has security issues. Some of the other databases offering 
 the approach 2) as their main option also support masking on view columns.
 
 Each approach has its own advantages and limitations, and I don't think we 
 necessarily have to choose. The CEP proposes implementing 1) and 2), but 
 no one impedes us to also have 3) if we get to have projected views. 
 However, I think that projected views is a new general-purpose feature 
 with its own complexities, so it would deserve its own CEP, if someone is 
 willing to work on the implementation.
 
 
 
> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev 
>  wrote:
> Is there enough support here f

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Jeremiah D Jordan
A

> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
> 
> Well, I am not convinced these changes will materially impact the outcome, 
> but at least we’ll have some extra fun collating the votes.
> 
> 
>> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>> 
>> 
>> The poll makes sense to me. I would slightly change it to:
>> 
>> A) We shouldn't prefer neither approach, and I agree to the implementor 
>> selecting the table schema approach for this CEP
>> B) We should prefer the view approach, but I am not opposed to the 
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should implement 
>> the view approach
>> D) We should NOT implement the table view approach, and should implement the 
>> schema approach
>> E) We should NOT implement the table schema approach, and should implement 
>> some other scheme (or not implement this feature)
>> 
>> Where my vote is for A.
>> 
>> 
>> On Wed, 7 Sept 2022 at 13:12, Benedict > > wrote:
>> I’m not convinced there’s been adequate resolution over which approach is 
>> adopted. I know you have expressed a preference for the table schema 
>> approach, but the weight of other opinion so far appears to be against this 
>> approach - even if it is broadly adopted by other databases. I will note 
>> that Postgres does not adopt this approach, it has a more sophisticated 
>> security label approach that has not been proposed by anybody so far.
>> 
>> I think extra weight should be given to the implementer’s preference, so 
>> while I personally do not like the table schema approach, I am happy to 
>> accept this is an industry norm, and leave the decision to you.
>> 
>> However, we should ensure the community as a whole endorses this. I think an 
>> indicative poll should be undertaken first, eg:
>> 
>> A) We should implement the table schema approach, as proposed
>> B) We should prefer the view approach, but I am not opposed to the 
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should implement 
>> the view approach
>> D) We should NOT implement the table schema approach, and should implement 
>> some other scheme (or not implement this feature)
>> 
>> Where my vote is B
>> 
>>> On 7 Sep 2022, at 12:50, Andrés de la Peña >> > wrote:
>>> 
>>> 
>>> If nobody has more concerns regarding the CEP I will start the vote 
>>> tomorrow.
>>> 
>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña >> > wrote:
>>> Is there enough support here for VIEWS to be the implementation strategy 
>>> for displaying masking functions?
>>> 
>>> I'm not sure that views should be "the" strategy for masking functions. We 
>>> have multiple approaches here:
>>> 
>>> 1) CQL functions only. Users can decide to use the masking functions on 
>>> their own will. I think most dbs allow this pattern of usage, which is 
>>> quite straightforward. Obviously, it doesn't allow admins to decide enforce 
>>> users seeing only masked data. Nevertheless, it's still useful for trusted 
>>> database users generating masked data that will be consumed by the end 
>>> users of the application.
>>> 
>>> 2) Masking functions attached to specific columns. This way the same 
>>> queries will see different data (masked or not) depending on the 
>>> permissions of the user running the query. It has the advantage of not 
>>> requiring to change the queries that users with different permissions run. 
>>> The downside is that users would need to query the schema if they need to 
>>> know whether a column is masked, unless we change the names of the returned 
>>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM 
>>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support 
>>> applying the masking function to columns on the base table, and some of 
>>> them also allow to apply masking to views.
>>> 
>>> 3) Masking functions as part of projected views. This ways users might need 
>>> to query the view appropriate for their permissions instead of the base 
>>> table. This might mean changing the queries if the masking policy is 
>>> changed by the admin. MySQL recommends this approach on a blog entry, 
>>> although it's not part of its main documentation for data masking, and the 
>>> implementation has security issues. Some of the other databases offering 
>>> the approach 2) as their main option also support masking on view columns.
>>> 
>>> Each approach has its own advantages and limitations, and I don't think we 
>>> necessarily have to choose. The CEP proposes implementing 1) and 2), but no 
>>> one impedes us to also have 3) if we get to have projected views. However, 
>>> I think that projected views is a new general-purpose feature with its own 
>>> complexities, so it would deserve its own CEP, if someone is willing to 
>>> work on the implementatio

Re: [DISCUSS] CEP-23: Enhancement for Sparse Data Serialization

2022-09-07 Thread Claude Warren via dev
I have looked through the code mentioned.  What I found in the 
ColumnSerializer was the use of VInt encoding.  Are you proposing 
switching directly to VInt encoding for sizes rather than one of the 
other encodings?  Using a -2 as the first length to signal that the new 
encoding is in use so that existing encodings can be read unchanged?



On 06/09/2022 16:37, Benedict wrote:

So, looking more closely at your proposal I realise what you are trying to do. 
The thing that threw me was your mention of lists and other collections. This 
will likely not work as there is no index that is possible to define on a list 
(or other collection) within a single sstable - a list is defined over the 
whole on-disk contents, so the index is undefined within a given sstable.

Tuple and UDT are encoded inefficiently if there are many null fields, but this 
is a very localised change, affecting just one class. You should take a look at 
Columns.Serializer for code you can lift for encoding and decoding sparse 
subsets of fields.

It might be that this can be switched on or off per sstable with a header flag 
bit so that there is no additional cost for datasets that would not benefit. 
Likely we can also migrate to vint encoding for the component sizes also (and 
either 1 or 0 bytes for fixed width values), no doubt saving a lot of space 
over the status quo, even for small UDT with few null entries.

Essentially at this point we’re talking about pushing through storage 
optimisations applied elsewhere to tuples and UDT, which is a very 
uncontroversial change.


On 6 Sep 2022, at 07:28, Benedict  wrote:

I agree a Jira would suffice, and if visibility there required a DISCUSS 
thread or simply a notice sent to the list.

While we’re here though, while I don’t have a lot of time to engage in 
discussion it’s unclear to me what advantage this encoding scheme brings. It 
might be worth outlining what algorithmic advantage you foresee for what data 
distributions in which collection types.


On 6 Sep 2022, at 07:16, Claude Warren via dev  wrote:

I am just learning the ropes here so perhaps it is not CEP worthy.  That being 
said, It felt like there was a lot of information to put into and track in a 
ticket, particularly when I expected discussion about how to best encode, 
changes to the algorithms etc.  It feels like it would be difficult to track. 
But if that is standard for this project I will move the information there.

As to the benchmarking, I had thought that usage and performance measures 
should be included.  Thank you for calling out the subset of data selected 
query as being of particular importance.

Claude


On 06/09/2022 03:11, Abe Ratnofsky wrote:

Looking at this link: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization

Do you have any plans to include benchmarks in your test plan? It would be 
useful to include disk usage / read performance / write performance comparisons 
with the new encodings, particularly for sparse collections where a subset of 
data is selected out of a collection.

I do wonder whether this is CEP-worthy. The CEP says that the changes will not 
impact existing users, will be backwards compatible, and overall is an 
efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
significant user-facing or changes that cut across multiple subsystems”. Any 
reason why a Jira isn’t sufficient?

Abe


On Sep 5, 2022, at 1:57 AM, Claude Warren via dev  
wrote:

I have just posted a CEP  covering an Enhancement for Sparse Data Serialzation. 
 This is in response to CASSANDRA-8959

I look forward to responses.