Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Štefan Miklošovič
+1 and I am happy to go over streams and at least map where we are at to
see what could be done about that. As David mentioned, the main problem is
to distinguish what counts as a hot path and what not.

On Thu, May 30, 2024 at 6:29 PM Benedict  wrote:

> Since it’s related to the logging discussion we’re already having, I have
> seen stream pipelines showing up in a lot of traces recently. I am
> surprised; I thought it was understood that they shouldn’t be used on hot
> paths as they are not typically as efficient as old skool for-each
> constructions done sensibly, especially for small collections that may
> normally take zero or one items.
>
> I would like to propose forbidding the use of streams on hot paths without
> good justification that the cost:benefit is justified.
>
> It looks like it was nominally agreed two years ago that we would include
> words to this effect in the code style guide, but I forgot to include them
> when I transferred the new contents from the Google Doc proposal. So we
> could just include the “Performance” section that was meant to be included
> at the time.
>
> lists.apache.org
> 
> [image: favicon.ico]
> 
> 
>
>
> On 30 May 2024, at 13:33, Štefan Miklošovič 
> wrote:
>
> 
> I see the feedback is overall positive. I will merge that and I will
> improve the documentation on the website along with what Benedict suggested.
>
> On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever  wrote:
>
>>
>>
>>
>>> Based on these findings, I went through the code and I have incorporated
>>> these rules and I rewrote it like this:
>>>
>>> 1) no wrapping in "if" if we are not logging more than 2 parameters.
>>> 2) rewritten log messages to not contain any string concatenation but
>>> moving it all to placeholders ({}).
>>> 3) wrap it in "if" if we need to execute a method(s) on parameter(s)
>>> which is resource-consuming.
>>>
>>
>>
>> +1
>>
>>
>> It's a shame slf4j botched it with lambdas, their 2.0 fluent api doesn't
>> impress me.
>>
>


favicon.ico
Description: Binary data


Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Benedict
My concept of hot path is simply anything we can expect to be called frequently enough in normal operation that it might show up in a profiler. If it’s a library method then it’s reasonable to assume it should be able to be used in a hot path unless clearly labelled otherwise. In my view this includes things that might normally be masked by caching but under supported workloads may not be - such as query preparation.In fact, I’d say the default assumption should probably be that a method is “in a hot path” unless there’s good argument they aren’t - such as that the operation is likely to be run at some low frequency and the slow part is not part of any loop. Repair setup messages perhaps aren’t a hot path for instance (unless many of them are sent per repair), but validation compaction or merkle tree construction definitely is.I think it’s fine to not have perfect agreement about edge cases, but if anyone in a discussion thinks something is a hot path then it should be treated as one IMO.On 30 May 2024, at 18:39, David Capwell  wrote:As a general statement I agree with you (same for String.format as well), but one thing to call out is that it can be hard to tell what is the hot path and what isn’t.  When you are doing background work (like repair) its clear, but when touching something internal it can be hard to tell; this can also be hard with shared code as it gets authored outside the hot path then later used in the hot path…Also, what defines hot path?  Is this user facing only?  What about Validation/Streaming (stuff processing a large dataset)?  On May 30, 2024, at 9:29 AM, Benedict  wrote:Since it’s related to the logging discussion we’re already having, I have seen stream pipelines showing up in a lot of traces recently. I am surprised; I thought it was understood that they shouldn’t be used on hot paths as they are not typically as efficient as old skool for-each constructions done sensibly, especially for small collections that may normally take zero or one items.I would like to propose forbidding the use of streams on hot paths without good justification that the cost:benefit is justified. It looks like it was nominally agreed two years ago that we would include words to this effect in the code style guide, but I forgot to include them when I transferred the new contents from the Google Doc proposal. So we could just include the “Performance” section that was meant to be included at the time.lists.apache.orgOn 30 May 2024, at 13:33, Štefan Miklošovič  wrote:I see the feedback is overall positive. I will merge that and I will improve the documentation on the website along with what Benedict suggested.On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever  wrote:    Based on these findings, I went through the code and I have incorporated these rules and I rewrote it like this:1) no wrapping in "if" if we are not logging more than 2 parameters.2) rewritten log messages to not contain any string concatenation but moving it all to placeholders ({}).3) wrap it in "if" if we need to execute a method(s) on parameter(s) which is resource-consuming.+1It's a shame slf4j botched it with lambdas, their 2.0 fluent api doesn't impress me.



Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Berenguer Blasi

+1 on avoiding streams in hot paths

On 31/5/24 9:48, Benedict wrote:
My concept of hot path is simply anything we can expect to be called 
frequently enough in normal operation that it might show up in a 
profiler. If it’s a library method then it’s reasonable to assume it 
should be able to be used in a hot path unless clearly labelled 
otherwise.


In my view this includes things that might normally be masked by 
caching but under supported workloads may not be - such as query 
preparation.


In fact, I’d say the default assumption should probably be that a 
method is “in a hot path” unless there’s good argument they aren’t - 
such as that the operation is likely to be run at some low frequency 
and the slow part is not part of any loop. Repair setup messages 
perhaps aren’t a hot path for instance (unless many of them are sent 
per repair), but validation compaction or merkle tree construction 
definitely is.


I think it’s fine to not have perfect agreement about edge cases, but 
if anyone in a discussion thinks something is a hot path then it 
should be treated as one IMO.



On 30 May 2024, at 18:39, David Capwell  wrote:

 As a general statement I agree with you (same for String.format as 
well), but one thing to call out is that it can be hard to tell what 
is the hot path and what isn’t.  When you are doing background work 
(like repair) its clear, but when touching something internal it can 
be hard to tell; this can also be hard with shared code as it gets 
authored outside the hot path then later used in the hot path…


Also, what defines hot path?  Is this user facing only?  What about 
Validation/Streaming (stuff processing a large dataset)?



On May 30, 2024, at 9:29 AM, Benedict  wrote:

Since it’s related to the logging discussion we’re already having, I 
have seen stream pipelines showing up in a lot of traces recently. I 
am surprised; I thought it was understood that they shouldn’t be 
used on hot paths as they are not typically as efficient as old 
skool for-each constructions done sensibly, especially for small 
collections that may normally take zero or one items.


I would like to propose forbidding the use of streams on hot paths 
without good justification that the cost:benefit is justified.


It looks like it was nominally agreed two years ago that we would 
include words to this effect in the code style guide, but I forgot 
to include them when I transferred the new contents from the Google 
Doc proposal. So we could just include the “Performance” section 
that was meant to be included at the time.


lists.apache.org 

	 






On 30 May 2024, at 13:33, Štefan Miklošovič 
 wrote:



I see the feedback is overall positive. I will merge that and I 
will improve the documentation on the website along with what 
Benedict suggested.


On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever  
wrote:


Based on these findings, I went through the code and I have
incorporated these rules and I rewrote it like this:

1) no wrapping in "if" if we are not logging more than 2
parameters.
2) rewritten log messages to not contain any string
concatenation but moving it all to placeholders ({}).
3) wrap it in "if" if we need to execute a method(s) on
parameter(s) which is resource-consuming.



+1


It's a shame slf4j botched it with lambdas, their 2.0 fluent
api doesn't impress me.



Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread David Capwell
My concept of hot path is simply anything we can expect to be called frequently enough in normal operation that it might show up in a profiler. If it’s a library method then it’s reasonable to assume it should be able to be used in a hot path unless clearly labelled otherwise. In fact, I’d say the default assumption should probably be that a method is “in a hot path” unless there’s good argument they aren’t I think it’s fine to not have perfect agreement about edge cases, but if anyone in a discussion thinks something is a hot path then it should be treated as one IMO.Works for me. If we are going to use the term “fast path” in the style guides then we should define it similarly to this; undefined terms are hard as everyone will have their own definitions Sent from my iPhoneOn May 31, 2024, at 2:08 AM, Berenguer Blasi  wrote:

  

  
  
+1 on avoiding streams in hot paths

On 31/5/24 9:48, Benedict wrote:


  
  

My concept of hot path is simply anything we can
  expect to be called frequently enough in normal operation that
  it might show up in a profiler. If it’s a library method then
  it’s reasonable to assume it should be able to be used in a
  hot path unless clearly labelled otherwise. 


In my view this includes things that might
  normally be masked by caching but under supported workloads
  may not be - such as query preparation.



  In fact, I’d say the default assumption should
probably be that a method is “in a hot path” unless there’s
good argument they aren’t - such as that the operation is
likely to be run at some low frequency and the slow part is
not part of any loop. Repair setup messages perhaps aren’t a
hot path for instance (unless many of them are sent per
repair), but validation compaction or merkle tree
construction definitely is.
  
  
  I think it’s fine to not have perfect agreement
about edge cases, but if anyone in a discussion thinks
something is a hot path then it should be treated as one
IMO.
  
  


  On 30 May 2024, at 18:39, David
Capwell  wrote:

  


  

As a general statement I agree with you (same for
String.format as well), but one thing to call out is that it
can be hard to tell what is the hot path and what isn’t.
 When you are doing background work (like repair) its clear,
but when touching something internal it can be hard to tell;
this can also be hard with shared code as it gets authored
outside the hot path then later used in the hot path…


Also, what defines hot path?  Is this user facing only?
   What about Validation/Streaming (stuff processing a large
  dataset)?  
  

  On May 30, 2024, at 9:29 AM, Benedict
 wrote:
  
  


  
  
Since it’s related to the logging
  discussion we’re already having, I have seen
  stream pipelines showing up in a lot of traces
  recently. I am surprised; I thought it was
  understood that they shouldn’t be used on hot
  paths as they are not typically as efficient
  as old skool for-each constructions done
  sensibly, especially for small collections
  that may normally take zero or one items.


I would like to propose
  forbidding the use of streams on hot paths
  without good justification that the
  cost:benefit is justified. 


It looks like it was nominally
  agreed two years ago that we would include
  words to this effect in the code style guide,
  but I forgot to include them when I
  transferred the new contents from the Google
  Doc proposal. So we could just include the
  “Performance” section that was meant to be
  included at the time.



  


   

Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Benjamin Lerer
For me the definition of hot path is too vague. We had arguments with
Berenger multiple times and it is more a waste of time than anything else
at the end. If we are truly concerned about stream efficiency then we
should simply forbid them. That will avoid lengthy discussions about what
constitute the hot path and what does not.

Le ven. 31 mai 2024 à 11:08, Berenguer Blasi  a
écrit :

> +1 on avoiding streams in hot paths
> On 31/5/24 9:48, Benedict wrote:
>
> My concept of hot path is simply anything we can expect to be called
> frequently enough in normal operation that it might show up in a profiler.
> If it’s a library method then it’s reasonable to assume it should be able
> to be used in a hot path unless clearly labelled otherwise.
>
> In my view this includes things that might normally be masked by caching
> but under supported workloads may not be - such as query preparation.
>
> In fact, I’d say the default assumption should probably be that a method
> is “in a hot path” unless there’s good argument they aren’t - such as that
> the operation is likely to be run at some low frequency and the slow part
> is not part of any loop. Repair setup messages perhaps aren’t a hot path
> for instance (unless many of them are sent per repair), but validation
> compaction or merkle tree construction definitely is.
>
> I think it’s fine to not have perfect agreement about edge cases, but if
> anyone in a discussion thinks something is a hot path then it should be
> treated as one IMO.
>
> On 30 May 2024, at 18:39, David Capwell 
>  wrote:
>
>  As a general statement I agree with you (same for String.format as
> well), but one thing to call out is that it can be hard to tell what is the
> hot path and what isn’t.  When you are doing background work (like repair)
> its clear, but when touching something internal it can be hard to tell;
> this can also be hard with shared code as it gets authored outside the hot
> path then later used in the hot path…
>
> Also, what defines hot path?  Is this user facing only?  What about
> Validation/Streaming (stuff processing a large dataset)?
>
> On May 30, 2024, at 9:29 AM, Benedict 
>  wrote:
>
> Since it’s related to the logging discussion we’re already having, I have
> seen stream pipelines showing up in a lot of traces recently. I am
> surprised; I thought it was understood that they shouldn’t be used on hot
> paths as they are not typically as efficient as old skool for-each
> constructions done sensibly, especially for small collections that may
> normally take zero or one items.
>
> I would like to propose forbidding the use of streams on hot paths without
> good justification that the cost:benefit is justified.
>
> It looks like it was nominally agreed two years ago that we would include
> words to this effect in the code style guide, but I forgot to include them
> when I transferred the new contents from the Google Doc proposal. So we
> could just include the “Performance” section that was meant to be included
> at the time.
>
> lists.apache.org
> 
> 
> 
> 
>
>
> On 30 May 2024, at 13:33, Štefan Miklošovič 
>  wrote:
>
> 
> I see the feedback is overall positive. I will merge that and I will
> improve the documentation on the website along with what Benedict suggested.
>
> On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever  wrote:
>
>>
>>
>>
>>> Based on these findings, I went through the code and I have incorporated
>>> these rules and I rewrote it like this:
>>>
>>> 1) no wrapping in "if" if we are not logging more than 2 parameters.
>>> 2) rewritten log messages to not contain any string concatenation but
>>> moving it all to placeholders ({}).
>>> 3) wrap it in "if" if we need to execute a method(s) on parameter(s)
>>> which is resource-consuming.
>>>
>>
>>
>> +1
>>
>>
>> It's a shame slf4j botched it with lambdas, their 2.0 fluent api doesn't
>> impress me.
>>
>
>


Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Jacek Lewandowski
+1 to either forbid them entirely or not at all.


pt., 31 maj 2024, 16:05 użytkownik Benjamin Lerer 
napisał:

> For me the definition of hot path is too vague. We had arguments with
> Berenger multiple times and it is more a waste of time than anything else
> at the end. If we are truly concerned about stream efficiency then we
> should simply forbid them. That will avoid lengthy discussions about what
> constitute the hot path and what does not.
>
> Le ven. 31 mai 2024 à 11:08, Berenguer Blasi  a
> écrit :
>
>> +1 on avoiding streams in hot paths
>> On 31/5/24 9:48, Benedict wrote:
>>
>> My concept of hot path is simply anything we can expect to be called
>> frequently enough in normal operation that it might show up in a profiler.
>> If it’s a library method then it’s reasonable to assume it should be able
>> to be used in a hot path unless clearly labelled otherwise.
>>
>> In my view this includes things that might normally be masked by caching
>> but under supported workloads may not be - such as query preparation.
>>
>> In fact, I’d say the default assumption should probably be that a method
>> is “in a hot path” unless there’s good argument they aren’t - such as that
>> the operation is likely to be run at some low frequency and the slow part
>> is not part of any loop. Repair setup messages perhaps aren’t a hot path
>> for instance (unless many of them are sent per repair), but validation
>> compaction or merkle tree construction definitely is.
>>
>> I think it’s fine to not have perfect agreement about edge cases, but if
>> anyone in a discussion thinks something is a hot path then it should be
>> treated as one IMO.
>>
>> On 30 May 2024, at 18:39, David Capwell 
>>  wrote:
>>
>>  As a general statement I agree with you (same for String.format as
>> well), but one thing to call out is that it can be hard to tell what is the
>> hot path and what isn’t.  When you are doing background work (like repair)
>> its clear, but when touching something internal it can be hard to tell;
>> this can also be hard with shared code as it gets authored outside the hot
>> path then later used in the hot path…
>>
>> Also, what defines hot path?  Is this user facing only?  What about
>> Validation/Streaming (stuff processing a large dataset)?
>>
>> On May 30, 2024, at 9:29 AM, Benedict 
>>  wrote:
>>
>> Since it’s related to the logging discussion we’re already having, I have
>> seen stream pipelines showing up in a lot of traces recently. I am
>> surprised; I thought it was understood that they shouldn’t be used on hot
>> paths as they are not typically as efficient as old skool for-each
>> constructions done sensibly, especially for small collections that may
>> normally take zero or one items.
>>
>> I would like to propose forbidding the use of streams on hot paths
>> without good justification that the cost:benefit is justified.
>>
>> It looks like it was nominally agreed two years ago that we would include
>> words to this effect in the code style guide, but I forgot to include them
>> when I transferred the new contents from the Google Doc proposal. So we
>> could just include the “Performance” section that was meant to be included
>> at the time.
>>
>> lists.apache.org
>> 
>> 
>> 
>> 
>>
>>
>> On 30 May 2024, at 13:33, Štefan Miklošovič 
>>  wrote:
>>
>> 
>> I see the feedback is overall positive. I will merge that and I will
>> improve the documentation on the website along with what Benedict suggested.
>>
>> On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever  wrote:
>>
>>>
>>>
>>>
 Based on these findings, I went through the code and I have
 incorporated these rules and I rewrote it like this:

 1) no wrapping in "if" if we are not logging more than 2 parameters.
 2) rewritten log messages to not contain any string concatenation but
 moving it all to placeholders ({}).
 3) wrap it in "if" if we need to execute a method(s) on parameter(s)
 which is resource-consuming.

>>>
>>>
>>> +1
>>>
>>>
>>> It's a shame slf4j botched it with lambdas, their 2.0 fluent api doesn't
>>> impress me.
>>>
>>
>>


Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Benedict Elliott Smith
I think I have already proposed a simple solution to this problem on the 
thread: if anyone says it’s a hot path (and cannot be persuaded otherwise), it 
should be treated as such. Saves argument, but permits an easy escape hatch if 
everyone agrees with minimal discussion.

I think this is a good general principle for raising standards in the codebase 
like this: if somebody says something is important, and cannot be convinced 
otherwise, then it should generally be treated as important. This is different 
from cases where there are simply competing approaches.

That said, if people want to be absolutist about this I won’t mind.



> On 31 May 2024, at 15:04, Benjamin Lerer  wrote:
> 
> For me the definition of hot path is too vague. We had arguments with 
> Berenger multiple times and it is more a waste of time than anything else at 
> the end. If we are truly concerned about stream efficiency then we should 
> simply forbid them. That will avoid lengthy discussions about what constitute 
> the hot path and what does not.
> 
> Le ven. 31 mai 2024 à 11:08, Berenguer Blasi  > a écrit :
>> +1 on avoiding streams in hot paths
>> 
>> On 31/5/24 9:48, Benedict wrote:
>>> My concept of hot path is simply anything we can expect to be called 
>>> frequently enough in normal operation that it might show up in a profiler. 
>>> If it’s a library method then it’s reasonable to assume it should be able 
>>> to be used in a hot path unless clearly labelled otherwise. 
>>> 
>>> In my view this includes things that might normally be masked by caching 
>>> but under supported workloads may not be - such as query preparation.
>>> 
>>> In fact, I’d say the default assumption should probably be that a method is 
>>> “in a hot path” unless there’s good argument they aren’t - such as that the 
>>> operation is likely to be run at some low frequency and the slow part is 
>>> not part of any loop. Repair setup messages perhaps aren’t a hot path for 
>>> instance (unless many of them are sent per repair), but validation 
>>> compaction or merkle tree construction definitely is.
>>> 
>>> I think it’s fine to not have perfect agreement about edge cases, but if 
>>> anyone in a discussion thinks something is a hot path then it should be 
>>> treated as one IMO.
>>> 
 On 30 May 2024, at 18:39, David Capwell  
  wrote:
 
  As a general statement I agree with you (same for String.format as 
 well), but one thing to call out is that it can be hard to tell what is 
 the hot path and what isn’t.  When you are doing background work (like 
 repair) its clear, but when touching something internal it can be hard to 
 tell; this can also be hard with shared code as it gets authored outside 
 the hot path then later used in the hot path…
 
 Also, what defines hot path?  Is this user facing only?  What about 
 Validation/Streaming (stuff processing a large dataset)?  
 
> On May 30, 2024, at 9:29 AM, Benedict  
>  wrote:
> 
> Since it’s related to the logging discussion we’re already having, I have 
> seen stream pipelines showing up in a lot of traces recently. I am 
> surprised; I thought it was understood that they shouldn’t be used on hot 
> paths as they are not typically as efficient as old skool for-each 
> constructions done sensibly, especially for small collections that may 
> normally take zero or one items.
> 
> I would like to propose forbidding the use of streams on hot paths 
> without good justification that the cost:benefit is justified. 
> 
> It looks like it was nominally agreed two years ago that we would include 
> words to this effect in the code style guide, but I forgot to include 
> them when I transferred the new contents from the Google Doc proposal. So 
> we could just include the “Performance” section that was meant to be 
> included at the time.
> 
> lists.apache.org
> 
>  
> lists.apache.org
>  
>  
> 
> 
> 
>> On 30 May 2024, at 13:33, Štefan Miklošovič 
>>   wrote:
>> 
>> 
>> I see the feedback is overall positive. I will merge that and I will 
>> improve the documentation on the website along with what Benedict 
>> suggested.
>> 
>> On Thu, May 30, 2024 at 10:32 AM Mick Semb Wever > > wrote:
>>>   
>>>  
 Based on these findings, I went through the code and I have 
 incorporated these rules and I rewrote it like this:
 
 1) no wrapping in "if" if we are not logging more than 2 parameters.
 2) rewritten log messages to n

Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Abe Ratnofsky
+1 to forbidding Stream usage entirely; the convenience of using them outside 
of hot paths is less than the burden of figuring out whether or not a 
particular path is hot. Even for reviewers it can be difficult to tell whether 
a particular path is hot; hard-to-diagnose bugs like CASSANDRA-18110 
 were somewhat caused by 
this ambiguity.

I wonder if there are ways we can better automate tracking of 
performance-sensitive paths. I know FB's Infer claims to be able to do this[1], 
but is limited in practice.

[1]: https://fbinfer.com/docs/checker-annotation-reachability

> On May 31, 2024, at 10:19 AM, Benedict Elliott Smith  
> wrote:
> 
> I think I have already proposed a simple solution to this problem on the 
> thread: if anyone says it’s a hot path (and cannot be persuaded otherwise), 
> it should be treated as such. Saves argument, but permits an easy escape 
> hatch if everyone agrees with minimal discussion.
> 
> I think this is a good general principle for raising standards in the 
> codebase like this: if somebody says something is important, and cannot be 
> convinced otherwise, then it should generally be treated as important. This 
> is different from cases where there are simply competing approaches.
> 
> That said, if people want to be absolutist about this I won’t mind.
> 
> 
> 
>> On 31 May 2024, at 15:04, Benjamin Lerer  wrote:
>> 
>> For me the definition of hot path is too vague. We had arguments with 
>> Berenger multiple times and it is more a waste of time than anything else at 
>> the end. If we are truly concerned about stream efficiency then we should 
>> simply forbid them. That will avoid lengthy discussions about what 
>> constitute the hot path and what does not.
>> 
>> Le ven. 31 mai 2024 à 11:08, Berenguer Blasi > > a écrit :
>>> +1 on avoiding streams in hot paths
>>> 
>>> On 31/5/24 9:48, Benedict wrote:
 My concept of hot path is simply anything we can expect to be called 
 frequently enough in normal operation that it might show up in a profiler. 
 If it’s a library method then it’s reasonable to assume it should be able 
 to be used in a hot path unless clearly labelled otherwise. 
 
 In my view this includes things that might normally be masked by caching 
 but under supported workloads may not be - such as query preparation.
 
 In fact, I’d say the default assumption should probably be that a method 
 is “in a hot path” unless there’s good argument they aren’t - such as that 
 the operation is likely to be run at some low frequency and the slow part 
 is not part of any loop. Repair setup messages perhaps aren’t a hot path 
 for instance (unless many of them are sent per repair), but validation 
 compaction or merkle tree construction definitely is.
 
 I think it’s fine to not have perfect agreement about edge cases, but if 
 anyone in a discussion thinks something is a hot path then it should be 
 treated as one IMO.
 
> On 30 May 2024, at 18:39, David Capwell  
>  wrote:
> 
>  As a general statement I agree with you (same for String.format as 
> well), but one thing to call out is that it can be hard to tell what is 
> the hot path and what isn’t.  When you are doing background work (like 
> repair) its clear, but when touching something internal it can be hard to 
> tell; this can also be hard with shared code as it gets authored outside 
> the hot path then later used in the hot path…
> 
> Also, what defines hot path?  Is this user facing only?  What about 
> Validation/Streaming (stuff processing a large dataset)?  
> 
>> On May 30, 2024, at 9:29 AM, Benedict  
>>  wrote:
>> 
>> Since it’s related to the logging discussion we’re already having, I 
>> have seen stream pipelines showing up in a lot of traces recently. I am 
>> surprised; I thought it was understood that they shouldn’t be used on 
>> hot paths as they are not typically as efficient as old skool for-each 
>> constructions done sensibly, especially for small collections that may 
>> normally take zero or one items.
>> 
>> I would like to propose forbidding the use of streams on hot paths 
>> without good justification that the cost:benefit is justified. 
>> 
>> It looks like it was nominally agreed two years ago that we would 
>> include words to this effect in the code style guide, but I forgot to 
>> include them when I transferred the new contents from the Google Doc 
>> proposal. So we could just include the “Performance” section that was 
>> meant to be included at the time.
>> 
>> lists.apache.org
>> 
>>  
>> lists.apache.org
>>  

Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Brandon Williams
On Fri, May 31, 2024 at 9:35 AM Abe Ratnofsky  wrote:

> +1 to forbidding Stream usage entirely; the convenience of using them
> outside of hot paths is less than the burden of figuring out whether or not
> a particular path is hot.
>

I think I have most frequently appreciated them in tests, which I think we
could except, since these are categorically not in the hot path.

Kind Regards,
Brandon


Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread David Capwell
I am cool for forbidding with a callout that tests are ok.  I am cool with 
forbidding in tests as well, but thats just for consistency reasons than 
anything.

> On May 31, 2024, at 8:12 AM, Brandon Williams  wrote:
> 
> 
> On Fri, May 31, 2024 at 9:35 AM Abe Ratnofsky  > wrote:
>> +1 to forbidding Stream usage entirely; the convenience of using them 
>> outside of hot paths is less than the burden of figuring out whether or not 
>> a particular path is hot.
> 
> I think I have most frequently appreciated them in tests, which I think we 
> could except, since these are categorically not in the hot path.
> 
> Kind Regards,
> Brandon
>  



Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Mick Semb Wever
>
> I think this is a good general principle for *raising standards* in the
> codebase like this: if somebody says something is important, and cannot be
> convinced otherwise, then it should generally be treated as important. This
> is different from cases where there are simply competing approaches.
>


FTR this is my preference too.  The code trends in one direction, there's
no ambiguity or conflict when in doubt.  And IMHO the simple rule is to
what is hot path is whether it's called more than once per request.  It's
not close to accurate, but it's such an easy line to draw (and one which
tests fall into).


Re: [DISCUSS] Stream Pipelines on hot paths

2024-05-31 Thread Jacek Lewandowski
Usages of them in tests are ok I think. We have a separate checkstyle file
for the test code.

- - -- --- -  -
Jacek Lewandowski


pt., 31 maj 2024 o 19:14 David Capwell  napisał(a):

> I am cool for forbidding with a callout that tests are ok.  I am cool with
> forbidding in tests as well, but thats just for consistency reasons than
> anything.
>
> On May 31, 2024, at 8:12 AM, Brandon Williams  wrote:
>
>
> On Fri, May 31, 2024 at 9:35 AM Abe Ratnofsky  wrote:
>
>> +1 to forbidding Stream usage entirely; the convenience of using them
>> outside of hot paths is less than the burden of figuring out whether or not
>> a particular path is hot.
>>
>
> I think I have most frequently appreciated them in tests, which I think we
> could except, since these are categorically not in the hot path.
>
> Kind Regards,
> Brandon
>
>
>
>


Re: [EXTERNAL] Re: [DISCUSS] Adding experimental vtables and rules around them

2024-05-31 Thread German Eichberger via dev
Hi,

To sum where everyone is coming from: We would like to have features in a 
stable version of Cassandra which are experimental and are subject to 
non-backward compatible change. This indicates to me that the feature is not 
finished and should likely not be included in a stable release.  What benefit 
are we looking for by including it into a stable release as opposed to rolling 
it to the next release.

Thanks,
German

From: Maxim Muzafarov 
Sent: Wednesday, May 29, 2024 1:09 PM
To: dev@cassandra.apache.org 
Subject: [EXTERNAL] Re: [DISCUSS] Adding experimental vtables and rules around 
them

Hello everyone,

I like the idea of highlighting some of the experimental virtual
tables whose model might be changed in future releases.

As another option, we could add an @Experimetal annotation (or another
name) and a configuration parameter
experimental_virtula_tables_enabled (default is false). This, in turn,
means if a virtual table is experimental, it won't be registered in a
virtual keyspace unless the corresponding configuration parameter is
enabled. This also means that a user must implicitly enable an
experimental API, and prevent us from spamming the log with warnings.
All of this does not preclude us from specifying the experimental
state of some virtual tables in the documentation.

On Wed, 29 May 2024 at 21:18, Abe Ratnofsky  wrote:
>
> I agree that ClientWarning is the best way to indicate the risk of using an 
> experimental feature directly to the user. Presenting information in the 
> client application's logs directly means that the person who wrote the query 
> is most likely to see the warning, rather than an operator who sees cluster 
> logs.
>
> I don't think it's necessary to attach a ClientWarning to every single client 
> response; a ClientWarning analog to NoSpamLogger would be useful for this 
> ("warn a client at most once per day").
>
> This would also be useful for warning on usage of deprecated features.
>
> > On May 29, 2024, at 3:01 PM, David Capwell  wrote:
> >
> > We agreed a long time ago that all new features are disabled by default, 
> > but I wanted to try to flesh out what we “should” do with something that 
> > might still be experimental and subject to breaking changes; I would prefer 
> > we keep this thread specific to vtables as the UX is different for 
> > different types of things…
> >
> > So, lets say we are adding a set of vtables but we are not 100% sure what 
> > the schema should be and we learn after the release that changes should be 
> > made, but that would end up breaking the table… we currently define 
> > everything as “don’t break this” so if we publish a table that isn’t 100% 
> > baked we are kinda stuck with it for a very very long time… I would like to 
> > define a way to expose vtables that are subject to change (breaking schema 
> > changes) across different release and rules around them (only in minor?  
> > Maybe even in patch?).
> >
> > Lets try to use a concrete example so everyone is on the same page.
> >
> > Accord is disabled by default (it is a new feature), so the vtables to 
> > expose internals would be expected to be undefined and not present on the 
> > instance.
> >
> > When accord is enabled (accord.enabled = true) we add a set of vtables:
> >
> > Epochs - shows what epochs are known to accord
> > Cache - shows how the internal caches are performing
> > Etc.
> >
> > Using epochs as an example it currently only shows a single column: the 
> > long epoch
> >
> > CREATE VIRTUAL TABLE system_accord.epochs (epoch bigint PRIMARY KEY);
> >
> > Lets say we find that this table isn’t enough and we really need to scope 
> > it to each of the “stores” (threads for processing accord tasks)
> >
> > CREATE VIRTUAL TABLE system_accord.epochs (epoch bigint, store_id int, 
> > PRIMARY KEY (epoch, store_id));
> >
> > In this example the table changed the schema in a way that could break 
> > users, so this normally is not allowed.
> >
> > Since we don’t really have a way to define something experimental other 
> > than NEWS.txt, we kinda get stuck with this table and are forced to make 
> > new versions and maintain them for a long time (in this example we would 
> > have epochs and epochs_v2)… it would be nice if we could define a way to 
> > express that tables are free to be changed (modified or even deleted) and 
> > the life cycle for them….
> >
> > I propose that we allow such a case and make changes to the UX (as best as 
> > we can) to warn about this:
> >
> > 1) update NEWS.txt to denote that the feature is experimental
> > 2) when you access an experimental table you get a ClientWarning stating 
> > that this is free to change
> > 3) the tables comments starts with “[EXPERIMENTAL]”
> >
> > What do others think?
> >
> >
>


Re: [EXTERNAL] Re: [DISCUSS] Adding experimental vtables and rules around them

2024-05-31 Thread J. D. Jordan
We have already agreed in the past that having experimental features, behind 
feature flags, in stable releases is a good thing for keeping those features up 
to date, for getting feedback from end users, and many others.
The question here is about how we ensure that end users are aware something is 
experimental, because the person who installed C* and enabled the feature flag 
is likely not the same person who is using the feature.

> On May 31, 2024, at 3:23 PM, German Eichberger via dev 
>  wrote:
> 
> 
> Hi,
> 
> To sum where everyone is coming from: We would like to have features in a 
> stable version of Cassandra which are experimental and are subject to 
> non-backward compatible change. This indicates to me that the feature is not 
> finished and should likely not be included in a stable release.  What benefit 
> are we looking for by including it into a stable release as opposed to 
> rolling it to the next release.
> 
> Thanks,
> German
> From: Maxim Muzafarov 
> Sent: Wednesday, May 29, 2024 1:09 PM
> To: dev@cassandra.apache.org 
> Subject: [EXTERNAL] Re: [DISCUSS] Adding experimental vtables and rules 
> around them
>  
> Hello everyone,
> 
> I like the idea of highlighting some of the experimental virtual
> tables whose model might be changed in future releases.
> 
> As another option, we could add an @Experimetal annotation (or another
> name) and a configuration parameter
> experimental_virtula_tables_enabled (default is false). This, in turn,
> means if a virtual table is experimental, it won't be registered in a
> virtual keyspace unless the corresponding configuration parameter is
> enabled. This also means that a user must implicitly enable an
> experimental API, and prevent us from spamming the log with warnings.
> All of this does not preclude us from specifying the experimental
> state of some virtual tables in the documentation.
> 
> On Wed, 29 May 2024 at 21:18, Abe Ratnofsky  wrote:
> >
> > I agree that ClientWarning is the best way to indicate the risk of using an 
> > experimental feature directly to the user. Presenting information in the 
> > client application's logs directly means that the person who wrote the 
> > query is most likely to see the warning, rather than an operator who sees 
> > cluster logs.
> >
> > I don't think it's necessary to attach a ClientWarning to every single 
> > client response; a ClientWarning analog to NoSpamLogger would be useful for 
> > this ("warn a client at most once per day").
> >
> > This would also be useful for warning on usage of deprecated features.
> >
> > > On May 29, 2024, at 3:01 PM, David Capwell  wrote:
> > >
> > > We agreed a long time ago that all new features are disabled by default, 
> > > but I wanted to try to flesh out what we “should” do with something that 
> > > might still be experimental and subject to breaking changes; I would 
> > > prefer we keep this thread specific to vtables as the UX is different for 
> > > different types of things…
> > >
> > > So, lets say we are adding a set of vtables but we are not 100% sure what 
> > > the schema should be and we learn after the release that changes should 
> > > be made, but that would end up breaking the table… we currently define 
> > > everything as “don’t break this” so if we publish a table that isn’t 100% 
> > > baked we are kinda stuck with it for a very very long time… I would like 
> > > to define a way to expose vtables that are subject to change (breaking 
> > > schema changes) across different release and rules around them (only in 
> > > minor?  Maybe even in patch?).
> > >
> > > Lets try to use a concrete example so everyone is on the same page.
> > >
> > > Accord is disabled by default (it is a new feature), so the vtables to 
> > > expose internals would be expected to be undefined and not present on the 
> > > instance.
> > >
> > > When accord is enabled (accord.enabled = true) we add a set of vtables:
> > >
> > > Epochs - shows what epochs are known to accord
> > > Cache - shows how the internal caches are performing
> > > Etc.
> > >
> > > Using epochs as an example it currently only shows a single column: the 
> > > long epoch
> > >
> > > CREATE VIRTUAL TABLE system_accord.epochs (epoch bigint PRIMARY KEY);
> > >
> > > Lets say we find that this table isn’t enough and we really need to scope 
> > > it to each of the “stores” (threads for processing accord tasks)
> > >
> > > CREATE VIRTUAL TABLE system_accord.epochs (epoch bigint, store_id int, 
> > > PRIMARY KEY (epoch, store_id));
> > >
> > > In this example the table changed the schema in a way that could break 
> > > users, so this normally is not allowed.
> > >
> > > Since we don’t really have a way to define something experimental other 
> > > than NEWS.txt, we kinda get stuck with this table and are forced to make 
> > > new versions and maintain them for a long time (in this example we would 
> > > have epochs and epochs_v2)… it would be nice if we could define a way to 
> > > express that t

[DISCUSS] CEP-42: Constraints Framework

2024-05-31 Thread Bernardo Botella
Hello everyone,

I am proposing this CEP:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-42%3A+Constraints+Framework

And I’m looking for feedback from the community.

Thanks a lot!
Bernardo