[DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Jacek Lewandowski
Hi,

I was working on limiting query results by their size expressed in bytes,
and some questions arose that I'd like to bring to the mailing list.

The semantics of queries (without aggregation) - data limits are applied on
the raw data returned from replicas - while it works fine for the row
number limits as the number of rows is not likely to change after
post-processing, it is not that accurate for size based limits as the cell
sizes may be different after post-processing (for example due to applying
some transformation function, projection, or whatever).

We can truncate the results after post-processing to stay within the
user-provided limit in bytes, but if the result is smaller than the limit -
we will not fetch more. In that case, the meaning of "limit" being an
actual limit is valid though it would be misleading for the page size
because we will not fetch the maximum amount of data that does not exceed
the page size.

Such a problem is much more visible for "group by" queries with
aggregation. The paging and limiting mechanism is applied to the rows
rather than groups, as it has no information about how much memory a single
group uses. For now, I've approximated a group size as the size of the
largest participating row.

The problem concerns the allowed interpretation of the size limit expressed
in bytes. Whether we want to use this mechanism to let the users precisely
control the size of the resultset, or we instead want to use this mechanism
to limit the amount of memory used internally for the data and prevent
problems (assuming restricting size and rows number can be used
simultaneously in a way that we stop when we reach any of the specified
limits).

https://issues.apache.org/jira/browse/CASSANDRA-11745

thanks,
- - -- --- -  -
Jacek Lewandowski


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Benjamin Lerer
Thanks Jacek for raising that discussion.

I do not have in mind a scenario where it could be useful to specify a
LIMIT in bytes. The LIMIT clause is usually used when you know how many
rows you wish to display or use. Unless somebody has a useful scenario in
mind I do not think that there is a need for that feature.

Paging in bytes makes sense to me as the paging mechanism is transparent
for the user in most drivers. It is simply a way to optimize your memory
usage from end to end.

I do not like the approach of using both of them simultaneously because if
you request a page with a certain amount of rows and do not get it then is
is really confusing and can be a problem for some usecases. We have users
keeping their session open and the page information to display page of data.

Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski 
a écrit :

> Hi,
>
> I was working on limiting query results by their size expressed in bytes,
> and some questions arose that I'd like to bring to the mailing list.
>
> The semantics of queries (without aggregation) - data limits are applied
> on the raw data returned from replicas - while it works fine for the row
> number limits as the number of rows is not likely to change after
> post-processing, it is not that accurate for size based limits as the cell
> sizes may be different after post-processing (for example due to applying
> some transformation function, projection, or whatever).
>
> We can truncate the results after post-processing to stay within the
> user-provided limit in bytes, but if the result is smaller than the limit -
> we will not fetch more. In that case, the meaning of "limit" being an
> actual limit is valid though it would be misleading for the page size
> because we will not fetch the maximum amount of data that does not exceed
> the page size.
>
> Such a problem is much more visible for "group by" queries with
> aggregation. The paging and limiting mechanism is applied to the rows
> rather than groups, as it has no information about how much memory a single
> group uses. For now, I've approximated a group size as the size of the
> largest participating row.
>
> The problem concerns the allowed interpretation of the size limit
> expressed in bytes. Whether we want to use this mechanism to let the users
> precisely control the size of the resultset, or we instead want to use this
> mechanism to limit the amount of memory used internally for the data and
> prevent problems (assuming restricting size and rows number can be used
> simultaneously in a way that we stop when we reach any of the specified
> limits).
>
> https://issues.apache.org/jira/browse/CASSANDRA-11745
>
> thanks,
> - - -- --- -  -
> Jacek Lewandowski
>


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Benedict
I agree that this is more suitable as a paging option, and not as a CQL LIMIT option. If it were to be a CQL LIMIT option though, then it should be accurate regarding result set IMO; there shouldn’t be any further results that could have been returned within the LIMIT.On 12 Jun 2023, at 10:16, Benjamin Lerer  wrote:Thanks Jacek for raising that discussion.I do not have in mind a scenario where it could be useful to specify a LIMIT in bytes. The LIMIT clause is usually used when you know how many rows you wish to display or use. Unless somebody has a useful scenario in mind I do not think that there is a need for that feature.Paging in bytes makes sense to me as the paging mechanism is transparent for the user in most drivers. It is simply a way to optimize your memory usage from end to end.I do not like the approach of using both of them simultaneously because if you request a page with a certain amount of rows and do not get it then is is really confusing and can be a problem for some usecases. We have users keeping their session open and the page information to display page of data.Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski  a écrit :Hi,I was working on limiting query results by their size expressed in bytes, and some questions arose that I'd like to bring to the mailing list.The semantics of queries (without aggregation) - data limits are applied on the raw data returned from replicas - while it works fine for the row number limits as the number of rows is not likely to change after post-processing, it is not that accurate for size based limits as the cell sizes may be different after post-processing (for example due to applying some transformation function, projection, or whatever). We can truncate the results after post-processing to stay within the user-provided limit in bytes, but if the result is smaller than the limit - we will not fetch more. In that case, the meaning of "limit" being an actual limit is valid though it would be misleading for the page size because we will not fetch the maximum amount of data that does not exceed the page size.Such a problem is much more visible for "group by" queries with aggregation. The paging and limiting mechanism is applied to the rows rather than groups, as it has no information about how much memory a single group uses. For now, I've approximated a group size as the size of the largest participating row. The problem concerns the allowed interpretation of the size limit expressed in bytes. Whether we want to use this mechanism to let the users precisely control the size of the resultset, or we instead want to use this mechanism to limit the amount of memory used internally for the data and prevent problems (assuming restricting size and rows number can be used simultaneously in a way that we stop when we reach any of the specified limits).https://issues.apache.org/jira/browse/CASSANDRA-11745thanks,- - -- --- -  -Jacek Lewandowski



Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Jacek Lewandowski
Limiting the amount of returned data in bytes in addition to the row limit
could be helpful when applied transparently by the server as a kind of
guardrail. The server could fail the query if it exceeds some
administratively imposed limit on the configuration level, WDYT?



pon., 12 cze 2023 o 11:16 Benjamin Lerer  napisał(a):

> Thanks Jacek for raising that discussion.
>
> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> Paging in bytes makes sense to me as the paging mechanism is transparent
> for the user in most drivers. It is simply a way to optimize your memory
> usage from end to end.
>
> I do not like the approach of using both of them simultaneously because if
> you request a page with a certain amount of rows and do not get it then is
> is really confusing and can be a problem for some usecases. We have users
> keeping their session open and the page information to display page of data.
>
> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> a écrit :
>
>> Hi,
>>
>> I was working on limiting query results by their size expressed in bytes,
>> and some questions arose that I'd like to bring to the mailing list.
>>
>> The semantics of queries (without aggregation) - data limits are applied
>> on the raw data returned from replicas - while it works fine for the row
>> number limits as the number of rows is not likely to change after
>> post-processing, it is not that accurate for size based limits as the cell
>> sizes may be different after post-processing (for example due to applying
>> some transformation function, projection, or whatever).
>>
>> We can truncate the results after post-processing to stay within the
>> user-provided limit in bytes, but if the result is smaller than the limit -
>> we will not fetch more. In that case, the meaning of "limit" being an
>> actual limit is valid though it would be misleading for the page size
>> because we will not fetch the maximum amount of data that does not exceed
>> the page size.
>>
>> Such a problem is much more visible for "group by" queries with
>> aggregation. The paging and limiting mechanism is applied to the rows
>> rather than groups, as it has no information about how much memory a single
>> group uses. For now, I've approximated a group size as the size of the
>> largest participating row.
>>
>> The problem concerns the allowed interpretation of the size limit
>> expressed in bytes. Whether we want to use this mechanism to let the users
>> precisely control the size of the resultset, or we instead want to use this
>> mechanism to limit the amount of memory used internally for the data and
>> prevent problems (assuming restricting size and rows number can be used
>> simultaneously in a way that we stop when we reach any of the specified
>> limits).
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-11745
>>
>> thanks,
>> - - -- --- -  -
>> Jacek Lewandowski
>>
>


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Jacek Lewandowski
Yes, LIMIT BY  provided by the user in CQL does not make much sense
to me either


pon., 12 cze 2023 o 11:20 Benedict  napisał(a):

> I agree that this is more suitable as a paging option, and not as a CQL
> LIMIT option.
>
> If it were to be a CQL LIMIT option though, then it should be accurate
> regarding result set IMO; there shouldn’t be any further results that could
> have been returned within the LIMIT.
>
> On 12 Jun 2023, at 10:16, Benjamin Lerer  wrote:
>
> 
> Thanks Jacek for raising that discussion.
>
> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> Paging in bytes makes sense to me as the paging mechanism is transparent
> for the user in most drivers. It is simply a way to optimize your memory
> usage from end to end.
>
> I do not like the approach of using both of them simultaneously because if
> you request a page with a certain amount of rows and do not get it then is
> is really confusing and can be a problem for some usecases. We have users
> keeping their session open and the page information to display page of data.
>
> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> a écrit :
>
>> Hi,
>>
>> I was working on limiting query results by their size expressed in bytes,
>> and some questions arose that I'd like to bring to the mailing list.
>>
>> The semantics of queries (without aggregation) - data limits are applied
>> on the raw data returned from replicas - while it works fine for the row
>> number limits as the number of rows is not likely to change after
>> post-processing, it is not that accurate for size based limits as the cell
>> sizes may be different after post-processing (for example due to applying
>> some transformation function, projection, or whatever).
>>
>> We can truncate the results after post-processing to stay within the
>> user-provided limit in bytes, but if the result is smaller than the limit -
>> we will not fetch more. In that case, the meaning of "limit" being an
>> actual limit is valid though it would be misleading for the page size
>> because we will not fetch the maximum amount of data that does not exceed
>> the page size.
>>
>> Such a problem is much more visible for "group by" queries with
>> aggregation. The paging and limiting mechanism is applied to the rows
>> rather than groups, as it has no information about how much memory a single
>> group uses. For now, I've approximated a group size as the size of the
>> largest participating row.
>>
>> The problem concerns the allowed interpretation of the size limit
>> expressed in bytes. Whether we want to use this mechanism to let the users
>> precisely control the size of the resultset, or we instead want to use this
>> mechanism to limit the amount of memory used internally for the data and
>> prevent problems (assuming restricting size and rows number can be used
>> simultaneously in a way that we stop when we reach any of the specified
>> limits).
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-11745
>>
>> thanks,
>> - - -- --- -  -
>> Jacek Lewandowski
>>
>


[DISCUSSION] Adding sonar report analysis to the Cassandra project

2023-06-12 Thread Maxim Muzafarov
Hello everyone,

I would like to make the source code of the Cassandra project more
visible to people outside of the Cassandra Community and highlight the
typical known issues in new contributions in the GitHub pull-request
interface as well. This makes it easier for those who are unfamiliar
with the accepted code style and just want to be part of a large and
friendly community to add new contributions.

The ASF provides [1] the SonarClound facilities for the open source
project, which are free to use, and we can also easily add the process
of building and uploading reports to the build using GitHub actions
with almost no maintenance costs for us. Of course, as a
recommendation quality tool, it doesn't reject any changes/pull
requests, so nothing will change from that perspective.

I've prepared everything we need to do this here (we also need to
modify the default Sonar Way profile to suit our needs, which I can't
do as I don't have sufficient privileges):
https://issues.apache.org/jira/browse/CASSANDRA-18390

I look forward to hearing your thoughts on this.


Examples:

I did the same way for the Apache Ignite project, and here is how in
the end it may look like.
For the pull-requests queue:
https://sonarcloud.io/project/pull_requests_list?id=apache_ignite

The report itself for a pull request (the aggregation is used):
https://github.com/apache/ignite/pull/10769

The main branch quality gate profile:
https://sonarcloud.io/summary/overall?id=apache_ignite


In addition to this:

A developer can configure the SonarLint IDE plugin (available for
IntelliJ IDEA, Eclipse) to retrieve Cassandra's quality profiles and
configured rules from the sonarcloud.io resource and highlight any
violated warnings locally, making it easier to develop a new patch.


[1] 
https://cwiki.apache.org/confluence/display/INFRA/SonarCloud+for+ASF+projects


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Josh McKenzie
> I do not have in mind a scenario where it could be useful to specify a LIMIT 
> in bytes. The LIMIT clause is usually used when you know how many rows you 
> wish to display or use. Unless somebody has a useful scenario in mind I do 
> not think that there is a need for that feature.
If you have rows that vary significantly in their size, your latencies could 
end up being pretty unpredictable using a LIMIT BY . Being able to 
specify a limit by bytes at the driver / API level would allow app devs to get 
more deterministic results out of their interaction w/the DB if they're looking 
to respond back to a client within a certain time frame and / or determine next 
steps in the app (continue paging, stop, etc) based on how long it took to get 
results back.

I'm seeing similar tradeoffs working on gracefully paging over tombstones; 
there's a strong desire to be able to have more confidence in the statement "If 
I ask the server for a page of data, I'll very likely get it back within time 
X".

There's an argument that it's a data modeling problem and apps should model 
differently to have more consistent row sizes and/or tombstone counts; I'm 
sympathetic to that but the more we can loosen those constraints on users the 
better their experience in my opinion.

On Mon, Jun 12, 2023, at 5:39 AM, Jacek Lewandowski wrote:
> Yes, LIMIT BY  provided by the user in CQL does not make much sense to 
> me either
> 
> 
> pon., 12 cze 2023 o 11:20 Benedict  napisał(a):
>> 
>> I agree that this is more suitable as a paging option, and not as a CQL 
>> LIMIT option. 
>> 
>> If it were to be a CQL LIMIT option though, then it should be accurate 
>> regarding result set IMO; there shouldn’t be any further results that could 
>> have been returned within the LIMIT.
>> 
>> 
>>> On 12 Jun 2023, at 10:16, Benjamin Lerer  wrote:
>>> 
>>> Thanks Jacek for raising that discussion.
>>> 
>>> I do not have in mind a scenario where it could be useful to specify a 
>>> LIMIT in bytes. The LIMIT clause is usually used when you know how many 
>>> rows you wish to display or use. Unless somebody has a useful scenario in 
>>> mind I do not think that there is a need for that feature.
>>> 
>>> Paging in bytes makes sense to me as the paging mechanism is transparent 
>>> for the user in most drivers. It is simply a way to optimize your memory 
>>> usage from end to end.
>>> 
>>> I do not like the approach of using both of them simultaneously because if 
>>> you request a page with a certain amount of rows and do not get it then is 
>>> is really confusing and can be a problem for some usecases. We have users 
>>> keeping their session open and the page information to display page of data.
>>> 
>>> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski 
>>>  a écrit :
 Hi,
 
 I was working on limiting query results by their size expressed in bytes, 
 and some questions arose that I'd like to bring to the mailing list.
 
 The semantics of queries (without aggregation) - data limits are applied 
 on the raw data returned from replicas - while it works fine for the row 
 number limits as the number of rows is not likely to change after 
 post-processing, it is not that accurate for size based limits as the cell 
 sizes may be different after post-processing (for example due to applying 
 some transformation function, projection, or whatever). 
 
 We can truncate the results after post-processing to stay within the 
 user-provided limit in bytes, but if the result is smaller than the limit 
 - we will not fetch more. In that case, the meaning of "limit" being an 
 actual limit is valid though it would be misleading for the page size 
 because we will not fetch the maximum amount of data that does not exceed 
 the page size.
 
 Such a problem is much more visible for "group by" queries with 
 aggregation. The paging and limiting mechanism is applied to the rows 
 rather than groups, as it has no information about how much memory a 
 single group uses. For now, I've approximated a group size as the size of 
 the largest participating row. 
 
 The problem concerns the allowed interpretation of the size limit 
 expressed in bytes. Whether we want to use this mechanism to let the users 
 precisely control the size of the resultset, or we instead want to use 
 this mechanism to limit the amount of memory used internally for the data 
 and prevent problems (assuming restricting size and rows number can be 
 used simultaneously in a way that we stop when we reach any of the 
 specified limits).
 
 https://issues.apache.org/jira/browse/CASSANDRA-11745
 
 thanks,
 - - -- --- -  -
 Jacek Lewandowski


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Benjamin Lerer
>
> If you have rows that vary significantly in their size, your latencies
> could end up being pretty unpredictable using a LIMIT BY . Being
> able to specify a limit by bytes at the driver / API level would allow app
> devs to get more deterministic results out of their interaction w/the DB if
> they're looking to respond back to a client within a certain time frame and
> / or determine next steps in the app (continue paging, stop, etc) based on
> how long it took to get results back.


Are you talking about the page size or the LIMIT. Once the LIMIT is reached
there is no "continue paging". LIMIT is also at the CQL level not at the
driver level.
I can totally understand the need for a page size in bytes not for a LIMIT.

Le lun. 12 juin 2023 à 16:25, Josh McKenzie  a écrit :

> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> If you have rows that vary significantly in their size, your latencies
> could end up being pretty unpredictable using a LIMIT BY . Being
> able to specify a limit by bytes at the driver / API level would allow app
> devs to get more deterministic results out of their interaction w/the DB if
> they're looking to respond back to a client within a certain time frame and
> / or determine next steps in the app (continue paging, stop, etc) based on
> how long it took to get results back.
>
> I'm seeing similar tradeoffs working on gracefully paging over tombstones;
> there's a strong desire to be able to have more confidence in the statement
> "If I ask the server for a page of data, I'll very likely get it back
> within time X".
>
> There's an argument that it's a data modeling problem and apps should
> model differently to have more consistent row sizes and/or tombstone
> counts; I'm sympathetic to that but the more we can loosen those
> constraints on users the better their experience in my opinion.
>
> On Mon, Jun 12, 2023, at 5:39 AM, Jacek Lewandowski wrote:
>
> Yes, LIMIT BY  provided by the user in CQL does not make much sense
> to me either
>
>
> pon., 12 cze 2023 o 11:20 Benedict  napisał(a):
>
>
> I agree that this is more suitable as a paging option, and not as a CQL
> LIMIT option.
>
> If it were to be a CQL LIMIT option though, then it should be accurate
> regarding result set IMO; there shouldn’t be any further results that could
> have been returned within the LIMIT.
>
>
> On 12 Jun 2023, at 10:16, Benjamin Lerer  wrote:
>
> 
> Thanks Jacek for raising that discussion.
>
> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> Paging in bytes makes sense to me as the paging mechanism is transparent
> for the user in most drivers. It is simply a way to optimize your memory
> usage from end to end.
>
> I do not like the approach of using both of them simultaneously because if
> you request a page with a certain amount of rows and do not get it then is
> is really confusing and can be a problem for some usecases. We have users
> keeping their session open and the page information to display page of data.
>
> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> a écrit :
>
> Hi,
>
> I was working on limiting query results by their size expressed in bytes,
> and some questions arose that I'd like to bring to the mailing list.
>
> The semantics of queries (without aggregation) - data limits are applied
> on the raw data returned from replicas - while it works fine for the row
> number limits as the number of rows is not likely to change after
> post-processing, it is not that accurate for size based limits as the cell
> sizes may be different after post-processing (for example due to applying
> some transformation function, projection, or whatever).
>
> We can truncate the results after post-processing to stay within the
> user-provided limit in bytes, but if the result is smaller than the limit -
> we will not fetch more. In that case, the meaning of "limit" being an
> actual limit is valid though it would be misleading for the page size
> because we will not fetch the maximum amount of data that does not exceed
> the page size.
>
> Such a problem is much more visible for "group by" queries with
> aggregation. The paging and limiting mechanism is applied to the rows
> rather than groups, as it has no information about how much memory a single
> group uses. For now, I've approximated a group size as the size of the
> largest participating row.
>
> The problem concerns the allowed interpretation of the size limit
> expressed in bytes. Whether we want to use this mechanism to let the users
>

Re: [DISCUSSION] Adding sonar report analysis to the Cassandra project

2023-06-12 Thread Mick Semb Wever
On Mon, 12 Jun 2023 at 15:02, Maxim Muzafarov  wrote:

> Hello everyone,
>
> I would like to make the source code of the Cassandra project more
> visible to people outside of the Cassandra Community and highlight the
> typical known issues in new contributions in the GitHub pull-request
> interface as well. This makes it easier for those who are unfamiliar
> with the accepted code style and just want to be part of a large and
> friendly community to add new contributions.
>
> The ASF provides [1] the SonarClound facilities for the open source
> project, which are free to use, and we can also easily add the process
> of building and uploading reports to the build using GitHub actions
> with almost no maintenance costs for us. Of course, as a
> recommendation quality tool, it doesn't reject any changes/pull
> requests, so nothing will change from that perspective.
>
> I've prepared everything we need to do this here (we also need to
> modify the default Sonar Way profile to suit our needs, which I can't
> do as I don't have sufficient privileges):
> https://issues.apache.org/jira/browse/CASSANDRA-18390
>
> I look forward to hearing your thoughts on this.
>


Looks good.  Agree with the use of GHA, but it's worth noting that this
cannot be a pre-commit gate – as PRs are not required.  And if it came as
part of pre-commit CI, how would the feedback then work (as it's the jira
ticket that is our point-of-contact pre-commit) ?

I say go for it.  Especially with the post-commit trends it will be
valuable for us to see it before further adoption and adjustment.


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Jeff Jirsa
On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer  wrote:

> If you have rows that vary significantly in their size, your latencies
>> could end up being pretty unpredictable using a LIMIT BY . Being
>> able to specify a limit by bytes at the driver / API level would allow app
>> devs to get more deterministic results out of their interaction w/the DB if
>> they're looking to respond back to a client within a certain time frame and
>> / or determine next steps in the app (continue paging, stop, etc) based on
>> how long it took to get results back.
>
>
> Are you talking about the page size or the LIMIT. Once the LIMIT is
> reached there is no "continue paging". LIMIT is also at the CQL level not
> at the driver level.
> I can totally understand the need for a page size in bytes not for a LIMIT.
>

Would only ever EXPECT to see a page size in bytes, never a LIMIT
specifying bytes.

I know the C-11745 ticket says LIMIT, too, but that feels very odd to me.


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Josh McKenzie
Yeah, my bad. I have paging on the brain. Seriously.

I can't think of a use-case in which a LIMIT based on # bytes makes sense from 
a user perspective.

On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote:
> 
> 
> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer  wrote:
>>> If you have rows that vary significantly in their size, your latencies 
>>> could end up being pretty unpredictable using a LIMIT BY . Being 
>>> able to specify a limit by bytes at the driver / API level would allow app 
>>> devs to get more deterministic results out of their interaction w/the DB if 
>>> they're looking to respond back to a client within a certain time frame and 
>>> / or determine next steps in the app (continue paging, stop, etc) based on 
>>> how long it took to get results back.
>> 
>> Are you talking about the page size or the LIMIT. Once the LIMIT is reached 
>> there is no "continue paging". LIMIT is also at the CQL level not at the 
>> driver level.
>> I can totally understand the need for a page size in bytes not for a LIMIT.
> 
> Would only ever EXPECT to see a page size in bytes, never a LIMIT specifying 
> bytes.
> 
> I know the C-11745 ticket says LIMIT, too, but that feels very odd to me.
> 


Re: [DISCUSSION] Adding sonar report analysis to the Cassandra project

2023-06-12 Thread Jeff Jirsa
On Mon, Jun 12, 2023 at 10:18 AM Mick Semb Wever  wrote:

>
>
> On Mon, 12 Jun 2023 at 15:02, Maxim Muzafarov  wrote:
>
>> Hello everyone,
>>
>> I would like to make the source code of the Cassandra project more
>> visible to people outside of the Cassandra Community and highlight the
>> typical known issues in new contributions in the GitHub pull-request
>> interface as well. This makes it easier for those who are unfamiliar
>> with the accepted code style and just want to be part of a large and
>> friendly community to add new contributions.
>>
>> The ASF provides [1] the SonarClound facilities for the open source
>> project, which are free to use, and we can also easily add the process
>> of building and uploading reports to the build using GitHub actions
>> with almost no maintenance costs for us. Of course, as a
>> recommendation quality tool, it doesn't reject any changes/pull
>> requests, so nothing will change from that perspective.
>>
>> I've prepared everything we need to do this here (we also need to
>> modify the default Sonar Way profile to suit our needs, which I can't
>> do as I don't have sufficient privileges):
>> https://issues.apache.org/jira/browse/CASSANDRA-18390
>>
>> I look forward to hearing your thoughts on this.
>>
>
>
> Looks good.  Agree with the use of GHA, but it's worth noting that this
> cannot be a pre-commit gate – as PRs are not required.  And if it came as
> part of pre-commit CI, how would the feedback then work (as it's the jira
> ticket that is our point-of-contact pre-commit) ?
>
> I say go for it.  Especially with the post-commit trends it will be
> valuable for us to see it before further adoption and adjustment.
>

I'd also say the same - Go for it, at worst people can ignore it, at best
someone sees the data and decides to take action.

If we eventually try to define a POLICY based on the feedback, I suspect
it'll be a longer  conversation, but I don't see any harm in setting it up.


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Jeremiah Jordan
 As long as it is valid in the paging protocol to return a short page, but
still say “there are more pages”, I think that is fine to do that.  For an
actual LIMIT that is part of the user query, I think the server must always
have returned all data that fits into the LIMIT when all pages have been
returned.

-Jeremiah

On Jun 12, 2023 at 12:56:14 PM, Josh McKenzie  wrote:

> Yeah, my bad. I have paging on the brain. Seriously.
>
> I can't think of a use-case in which a LIMIT based on # bytes makes sense
> from a user perspective.
>
> On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote:
>
>
>
> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer  wrote:
>
> If you have rows that vary significantly in their size, your latencies
> could end up being pretty unpredictable using a LIMIT BY . Being
> able to specify a limit by bytes at the driver / API level would allow app
> devs to get more deterministic results out of their interaction w/the DB if
> they're looking to respond back to a client within a certain time frame and
> / or determine next steps in the app (continue paging, stop, etc) based on
> how long it took to get results back.
>
>
> Are you talking about the page size or the LIMIT. Once the LIMIT is
> reached there is no "continue paging". LIMIT is also at the CQL level not
> at the driver level.
> I can totally understand the need for a page size in bytes not for a LIMIT.
>
>
> Would only ever EXPECT to see a page size in bytes, never a LIMIT
> specifying bytes.
>
> I know the C-11745 ticket says LIMIT, too, but that feels very odd to me.
>
>
>


Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Josh McKenzie
> As long as it is valid in the paging protocol to return a short page, but 
> still say “there are more pages”, I think that is fine to do that.
Thankfully the v3-v5 spec all make it clear that clients need to respect what 
the server has to say about there being more pages: 
https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec#L1247-L1253

>   - Clients should not rely on the actual size of the result set returned to
> decide if there are more results to fetch or not. Instead, they should 
> always
> check the Has_more_pages flag (unless they did not enable paging for the 
> query
> obviously). Clients should also not assert that no result will have more 
> than
>  results. While the current implementation always 
> respects
> the exact value of , we reserve the right to return
> slightly smaller or bigger pages in the future for performance reasons.

On Mon, Jun 12, 2023, at 3:19 PM, Jeremiah Jordan wrote:
> As long as it is valid in the paging protocol to return a short page, but 
> still say “there are more pages”, I think that is fine to do that.  For an 
> actual LIMIT that is part of the user query, I think the server must always 
> have returned all data that fits into the LIMIT when all pages have been 
> returned.
> 
> -Jeremiah
> 
> On Jun 12, 2023 at 12:56:14 PM, Josh McKenzie  wrote:
>> 
>> Yeah, my bad. I have paging on the brain. Seriously.
>> 
>> I can't think of a use-case in which a LIMIT based on # bytes makes sense 
>> from a user perspective.
>> 
>> On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote:
>>> 
>>> 
>>> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer  wrote:
> If you have rows that vary significantly in their size, your latencies 
> could end up being pretty unpredictable using a LIMIT BY . 
> Being able to specify a limit by bytes at the driver / API level would 
> allow app devs to get more deterministic results out of their interaction 
> w/the DB if they're looking to respond back to a client within a certain 
> time frame and / or determine next steps in the app (continue paging, 
> stop, etc) based on how long it took to get results back.
 
 Are you talking about the page size or the LIMIT. Once the LIMIT is 
 reached there is no "continue paging". LIMIT is also at the CQL level not 
 at the driver level.
 I can totally understand the need for a page size in bytes not for a LIMIT.
>>> 
>>> Would only ever EXPECT to see a page size in bytes, never a LIMIT 
>>> specifying bytes.
>>> 
>>> I know the C-11745 ticket says LIMIT, too, but that feels very odd to me.
>>> 
>> 


[DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-12 Thread Dan Jatnieks
Hello everyone,

I would like to propose removing the non-guardrail thresholds
'keyspace_count_warn_threshold' and 'table_count_warn_threshold'
configuration settings on the trunk branch for the next major release.

These thresholds were first added with CASSANDRA-16309 in 4.0-beta4 and
have subsequently been deprecated since 4.1-alpha in CASSANDRA-17195 when
they were replaced/migrated to guardrails as part of CEP-3 (Guardrails).

I'd appreciate any thoughts about this. I will open a ticket to get started
if there is support for doing this.

Reference:
https://issues.apache.org/jira/browse/CASSANDRA-16309
https://issues.apache.org/jira/browse/CASSANDRA-17195
CEP-3: Guardrails
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-3%3A+Guardrails


Thanks,
Dan Jatnieks


Re: [DISCUSS] CEP-8 Drivers Donation - take 2

2023-06-12 Thread Jeremy Hanna
I'd like to close out this thread.  As Benjamin notes, we'll have a single 
subproject for all of the drivers and with 3 PMC members overseeing the 
subproject as outlined in the linked subproject governance procedures.  However 
we'll introduce the drivers to that subproject one by one out of necessity.

I'll open up a vote thread shortly so that we can move forward on the CEP and 
subproject approach.

> On May 30, 2023, at 7:32 AM, Benjamin Lerer  wrote:
> 
> The idea was to have a single driver sub-project. Even if the code bases are 
> different we believe that it is important to keep the drivers together to 
> retain cohesive API semantics and make sure they have similar functionality 
> and feature support.
> In this scenario we would need only 3 PMC members for the governance. I am 
> willing to be one of them.
> 
> For the committers, my understanding, based on subproject governance 
> procedures, 
>  was that 
> they should be proposed directly to the PMC members.
> 
>> Is the vote for the CEP to be for all drivers, but we will introduce each 
>> driver one by one?  What determines when we are comfortable with one driver 
>> subproject and can move on to accepting the next ? 
> 
> The goal of the CEP is simply to ensure that the community is in favor of the 
> donation. Nothing more. 
> The plan is to introduce the drivers, one by one. Each driver donation will 
> need to be accepted first by the PMC members, as it is the case for any 
> donation. Therefore the PMC should have full control on the pace at which new 
> drivers are accepted.
>   
> 
> Le mar. 30 mai 2023 à 12:22, Josh McKenzie  > a écrit :
>>> Is the vote for the CEP to be for all drivers, but we will introduce each 
>>> driver one by one?  What determines when we are comfortable with one driver 
>>> subproject and can move on to accepting the next ? 
>> Curious to hear on this as well. There's 2 implications from the CEP as 
>> written:
>> 
>> 1. The Java and Python drivers hold special importance due to their language 
>> proximity and/or project's dependence upon them 
>> (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation#CEP8:DatastaxDriversDonation-Scope)
>> 2. Datastax is explicitly offering all 7 drivers for donation 
>> (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation#CEP8:DatastaxDriversDonation-Goals)
>> 
>> This is the most complex contribution via CEP thus far from a governance 
>> perspective; I suggest we chart a bespoke path to navigate this. Having a 
>> top level indication of "the CEP is approved" logically separate from a 
>> per-language indication of "the project is ready to absorb this language 
>> driver now" makes sense to me. This could look like:
>> 
>> * Vote on the CEP itself
>> * Per language (processing one at a time):
>> * identify 3 PMC members willing to take on the governance role for the 
>> language driver
>> * Identify 2 contributors who are active on a given driver and stepping 
>> forward for a committer role on the driver
>> * Vote on inclusion of that language driver in the project + commit bits
>> * Integrate that driver into the project ecosystem (build, ci, docs, etc)
>> 
>> Not sure how else we could handle committers / contributors / PMC members 
>> other than on a per-driver basis.
>> 
>> On Tue, May 30, 2023, at 5:36 AM, Mick Semb Wever wrote:
>>> 
>>> Thank you so much Jeremy and Greg (+others) for all the hard work on this.
>>>  
>>> 
>>> At this point, we'd like to propose CEP-8 for consideration, starting the 
>>> process to accept the DataStax Java driver as an official ASF project.
>>> 
>>> 
>>> Is the vote for the CEP to be for all drivers, but we will introduce each 
>>> driver one by one?  What determines when we are comfortable with one driver 
>>> subproject and can move on to accepting the next ? 
>>> 
>>> Are there key committers and contributors on each driver that want to be 
>>> involved?  Should they be listed before the vote?
>>> We also need three PMC for the new subproject.  Are we to assign these 
>>> before the vote?  
>>> 
>>> 
>> 



Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Jacek Lewandowski
Josh, that answers my question exactly; thank you.

I will not implement limiting the result set in CQL (that is, by LIMIT
clause) and stay with just paging. Whether the page size is defined in
bytes or rows can be determined by a flag - there are many unused bits for
that.

So my other question - for aggregation with the "group by" clause, we
return an aggregated row which is computed from a group of rows - with my
current implementation, it is approximated by counting the size of the
largest row in that group - I think it is the safest and simplest
approximation - wdyt?


pon., 12 cze 2023 o 22:55 Josh McKenzie  napisał(a):

> As long as it is valid in the paging protocol to return a short page, but
> still say “there are more pages”, I think that is fine to do that.
>
> Thankfully the v3-v5 spec all make it clear that clients need to respect
> what the server has to say about there being more pages:
> https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec#L1247-L1253
>
>   - Clients should not rely on the actual size of the result set returned
> to
> decide if there are more results to fetch or not. Instead, they should
> always
> check the Has_more_pages flag (unless they did not enable paging for
> the query
> obviously). Clients should also not assert that no result will have
> more than
>  results. While the current implementation always
> respects
> the exact value of , we reserve the right to return
> slightly smaller or bigger pages in the future for performance reasons.
>
>
> On Mon, Jun 12, 2023, at 3:19 PM, Jeremiah Jordan wrote:
>
> As long as it is valid in the paging protocol to return a short page, but
> still say “there are more pages”, I think that is fine to do that.  For an
> actual LIMIT that is part of the user query, I think the server must always
> have returned all data that fits into the LIMIT when all pages have been
> returned.
>
> -Jeremiah
>
> On Jun 12, 2023 at 12:56:14 PM, Josh McKenzie 
> wrote:
>
>
> Yeah, my bad. I have paging on the brain. Seriously.
>
> I can't think of a use-case in which a LIMIT based on # bytes makes sense
> from a user perspective.
>
> On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote:
>
>
>
> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer  wrote:
>
> If you have rows that vary significantly in their size, your latencies
> could end up being pretty unpredictable using a LIMIT BY . Being
> able to specify a limit by bytes at the driver / API level would allow app
> devs to get more deterministic results out of their interaction w/the DB if
> they're looking to respond back to a client within a certain time frame and
> / or determine next steps in the app (continue paging, stop, etc) based on
> how long it took to get results back.
>
>
> Are you talking about the page size or the LIMIT. Once the LIMIT is
> reached there is no "continue paging". LIMIT is also at the CQL level not
> at the driver level.
> I can totally understand the need for a page size in bytes not for a LIMIT.
>
>
> Would only ever EXPECT to see a page size in bytes, never a LIMIT
> specifying bytes.
>
> I know the C-11745 ticket says LIMIT, too, but that feels very odd to me.
>
>
>
>