Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
> - Resources isolation. Having the said service running within the same JVM 
> may negatively impact Cassandra storage's performance. It could be more 
> beneficial to have them in Sidecar, which offers strong resource isolation 
> guarantees.

How does having this in a side car change the impact on “storage performance”?  
The side car reading sstables will have the same impact on storage IO as the 
main process reading sstables.  Given the sidecar is running on the same node 
as the main C* process, the only real resource isolation you have is in 
heap/GC?  CPU/Memory/IO are all still shared between the main C* process and 
the side car, and coordinating those across processes is harder than 
coordinating them within a single process.  For example if we wanted to have 
the compaction throughput, streaming throughput, and analytics read throughput 
all tied back to a single disk IO cap, that is harder with an external process.

> - Complexity. Considering the existence of the Sidecar project, it would be 
> less complex to avoid adding another (http?) service in Cassandra.

Not sure that is really very complex, running an http service is a pretty easy? 
 We already have netty in use to instantiate one from.
I worry more about the complexity of having the matching schema for a set of 
sstables being read.  The complexity of new sstable versions/formats being 
introduced.  The complexity of having up to date data from memtables being 
considered by this API without having to flush before every query of it.  The 
complexity of dealing with the new memtable API introduced in CEP-11.  The 
complexity of coordinating compaction/streaming adding and removing files with 
these APIs reading them.  There are a lot of edge cases to consider for this 
external access to sstables that the main process considers itself the “owner” 
of.

All of this is not to say that I think separating things out into other 
processes/services is bad.  But I think we need to be very careful with how we 
do it, or end users will end up running into all the sharp edges and the 
feature will fail.

-Jeremiah

> On Mar 24, 2023, at 8:15 PM, Yifan Cai  wrote:
> 
> Hi Jeremiah, 
> 
> There are good reasons to not have these inside Cassandra. Consider the 
> following.
> - Resources isolation. Having the said service running within the same JVM 
> may negatively impact Cassandra storage's performance. It could be more 
> beneficial to have them in Sidecar, which offers strong resource isolation 
> guarantees.
> - Availability. If the Cassandra cluster is being bounced, using sidecar 
> would not affect the SBR/SBW functionality, e.g. SBR can still read SSTables 
> via sidecar endpoints. 
> - Compatibility. Sidecar provides stable REST-based APIs, such as uploading 
> SSTables endpoint, which would remain compatible with different versions of 
> Cassandra. The current implementation supports versions 3.0 and 4.0.
> - Complexity. Considering the existence of the Sidecar project, it would be 
> less complex to avoid adding another (http?) service in Cassandra.
> - Release velocity. Sidecar, as an independent project, can have a quicker 
> release cycle from Cassandra. 
> - The features in sidecar are mostly implemented based on various existing 
> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
> 
> Regarding authentication and authorization
> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold up 
> this CEP. It would be a feature that benefits all Sidecar endpoints.
> 
> - Yifan
> 
> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer  > wrote:
>> I agree that the analytics library will need to support vnodes. To be clear, 
>> there’s nothing preventing the solution from working with vnodes right now, 
>> and no assumptions about a 1:1 topology between a token and a node. However, 
>> we don’t, today, have the ability to test vnode support end-to-end. We are 
>> working towards that, however, and should be able to remove the caveat from 
>> the released analytics library once we can properly test vnode support.
>> If it helps, I can update the CEP to say something more like “Caveat: 
>> Currently untested with vnodes - work is ongoing to remove this limitation” 
>> if that helps?
>> 
>> Doug
>> 
>> > On Mar 24, 2023, at 11:43 AM, Brandon Williams > > > wrote:
>> > 
>> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> > mailto:jeremiah.jor...@gmail.com>> wrote:
>> >> 
>> >> I have concerns with the majority of this being in the sidecar and not in 
>> >> the database itself.  I think it would make sense for the server side of 
>> >> this to be a new service exposed by the database, not in the sidecar.  
>> >> That way it can be able to properly integrate with the authentication and 
>> >> authorization apis, and to make it a first class citizen in terms of 
>> >> having unit/integration tests in the main DB ensuring no one breaks it.
>> > 
>> > I

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeff Jirsa
On Tue, Mar 28, 2023 at 7:30 AM Jeremiah D Jordan 
wrote:

> - Resources isolation. Having the said service running within the same JVM
> may negatively impact Cassandra storage's performance. It could be more
> beneficial to have them in Sidecar, which offers strong resource isolation
> guarantees.
>
>
> How does having this in a side car change the impact on “storage
> performance”?  The side car reading sstables will have the same impact on
> storage IO as the main process reading sstables.
>

This is true.


>  Given the sidecar is running on the same node as the main C* process, the
> only real resource isolation you have is in heap/GC?  CPU/Memory/IO are all
> still shared between the main C* process and the side car, and coordinating
> those across processes is harder than coordinating them within a single
> process. For example if we wanted to have the compaction throughput,
> streaming throughput, and analytics read throughput all tied back to a
> single disk IO cap, that is harder with an external process.
>

Relatively trivial, for CPU and memory, to run them in different
containers/cgroups/etc, so you can put an exact cpu/memory limit on the
sidecar. That's different from a jmx rate limiter/throttle, but (arguably)
more precise, because it actually limits the underlying physical resource
instead of a proxy for it in a config setting.



>
> - Complexity. Considering the existence of the Sidecar project, it would
> be less complex to avoid adding another (http?) service in Cassandra.
>
>
> Not sure that is really very complex, running an http service is a pretty
> easy?  We already have netty in use to instantiate one from.
> I worry more about the complexity of having the matching schema for a set
> of sstables being read.  The complexity of new sstable versions/formats
> being introduced.  The complexity of having up to date data from memtables
> being considered by this API without having to flush before every query of
> it.  The complexity of dealing with the new memtable API introduced in
> CEP-11.  The complexity of coordinating compaction/streaming adding and
> removing files with these APIs reading them.  There are a lot of edge cases
> to consider for this external access to sstables that the main process
> considers itself the “owner” of.
>
> All of this is not to say that I think separating things out into other
> processes/services is bad.  But I think we need to be very careful with how
> we do it, or end users will end up running into all the sharp edges and the
> feature will fail.
>
> -Jeremiah
>
> On Mar 24, 2023, at 8:15 PM, Yifan Cai  wrote:
>
> Hi Jeremiah,
>
> There are good reasons to not have these inside Cassandra. Consider the
> following.
> - Resources isolation. Having the said service running within the same JVM
> may negatively impact Cassandra storage's performance. It could be more
> beneficial to have them in Sidecar, which offers strong resource isolation
> guarantees.
> - Availability. If the Cassandra cluster is being bounced, using sidecar
> would not affect the SBR/SBW functionality, e.g. SBR can still read
> SSTables via sidecar endpoints.
> - Compatibility. Sidecar provides stable REST-based APIs, such as
> uploading SSTables endpoint, which would remain compatible with different
> versions of Cassandra. The current implementation supports versions 3.0 and
> 4.0.
> - Complexity. Considering the existence of the Sidecar project, it would
> be less complex to avoid adding another (http?) service in Cassandra.
> - Release velocity. Sidecar, as an independent project, can have a quicker
> release cycle from Cassandra.
> - The features in sidecar are mostly implemented based on various existing
> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
>
> Regarding authentication and authorization
> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold
> up this CEP. It would be a feature that benefits all Sidecar endpoints.
>
> - Yifan
>
> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer  wrote:
>
>> I agree that the analytics library will need to support vnodes. To be
>> clear, there’s nothing preventing the solution from working with vnodes
>> right now, and no assumptions about a 1:1 topology between a token and a
>> node. However, we don’t, today, have the ability to test vnode support
>> end-to-end. We are working towards that, however, and should be able to
>> remove the caveat from the released analytics library once we can properly
>> test vnode support.
>> If it helps, I can update the CEP to say something more like “Caveat:
>> Currently untested with vnodes - work is ongoing to remove this limitation”
>> if that helps?
>>
>> Doug
>>
>> > On Mar 24, 2023, at 11:43 AM, Brandon Williams 
>> wrote:
>> >
>> > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan
>> >  wrote:
>> >>
>> >> I have concerns with the majority of this being in the sidecar and not
>> in the database itself.  I think it would make sense for the serv

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan


>> Given the sidecar is running on the same node as the main C* process, the 
>> only real resource isolation you have is in heap/GC?  CPU/Memory/IO are all 
>> still shared between the main C* process and the side car, and coordinating 
>> those across processes is harder than coordinating them within a single 
>> process. For example if we wanted to have the compaction throughput, 
>> streaming throughput, and analytics read throughput all tied back to a 
>> single disk IO cap, that is harder with an external process.
> 
> Relatively trivial, for CPU and memory, to run them in different 
> containers/cgroups/etc, so you can put an exact cpu/memory limit on the 
> sidecar. That's different from a jmx rate limiter/throttle, but (arguably) 
> more precise, because it actually limits the underlying physical resource 
> instead of a proxy for it in a config setting. 
> 

If we want to bring groups/containers/etc into the default deployment 
mechanisms of C*, great.  I am all for dividing it up into micro services given 
we solve all the problems I listed in the complexity section.

I am actually all for dividing C* up into multiple micro services, but the 
project needs to buy in to containers as the default mechanism for running it 
for that to be viable in my mind.

>  
>> 
>>> - Complexity. Considering the existence of the Sidecar project, it would be 
>>> less complex to avoid adding another (http?) service in Cassandra.
>> 
>> Not sure that is really very complex, running an http service is a pretty 
>> easy?  We already have netty in use to instantiate one from.
>> I worry more about the complexity of having the matching schema for a set of 
>> sstables being read.  The complexity of new sstable versions/formats being 
>> introduced.  The complexity of having up to date data from memtables being 
>> considered by this API without having to flush before every query of it.  
>> The complexity of dealing with the new memtable API introduced in CEP-11.  
>> The complexity of coordinating compaction/streaming adding and removing 
>> files with these APIs reading them.  There are a lot of edge cases to 
>> consider for this external access to sstables that the main process 
>> considers itself the “owner” of.
>> 
>> All of this is not to say that I think separating things out into other 
>> processes/services is bad.  But I think we need to be very careful with how 
>> we do it, or end users will end up running into all the sharp edges and the 
>> feature will fail.
>> 
>> -Jeremiah
>> 
>>> On Mar 24, 2023, at 8:15 PM, Yifan Cai >> > wrote:
>>> 
>>> Hi Jeremiah, 
>>> 
>>> There are good reasons to not have these inside Cassandra. Consider the 
>>> following.
>>> - Resources isolation. Having the said service running within the same JVM 
>>> may negatively impact Cassandra storage's performance. It could be more 
>>> beneficial to have them in Sidecar, which offers strong resource isolation 
>>> guarantees.
>>> - Availability. If the Cassandra cluster is being bounced, using sidecar 
>>> would not affect the SBR/SBW functionality, e.g. SBR can still read 
>>> SSTables via sidecar endpoints. 
>>> - Compatibility. Sidecar provides stable REST-based APIs, such as uploading 
>>> SSTables endpoint, which would remain compatible with different versions of 
>>> Cassandra. The current implementation supports versions 3.0 and 4.0.
>>> - Complexity. Considering the existence of the Sidecar project, it would be 
>>> less complex to avoid adding another (http?) service in Cassandra.
>>> - Release velocity. Sidecar, as an independent project, can have a quicker 
>>> release cycle from Cassandra. 
>>> - The features in sidecar are mostly implemented based on various existing 
>>> tools/APIs exposed from Cassandra, e.g. ring, commit sstable, snapshot, etc.
>>> 
>>> Regarding authentication and authorization
>>> - We will add it as a follow-on CEP in Sidecar, but we don't want to hold 
>>> up this CEP. It would be a feature that benefits all Sidecar endpoints.
>>> 
>>> - Yifan
>>> 
>>> On Fri, Mar 24, 2023 at 2:43 PM Doug Rohrer >> > wrote:
 I agree that the analytics library will need to support vnodes. To be 
 clear, there’s nothing preventing the solution from working with vnodes 
 right now, and no assumptions about a 1:1 topology between a token and a 
 node. However, we don’t, today, have the ability to test vnode support 
 end-to-end. We are working towards that, however, and should be able to 
 remove the caveat from the released analytics library once we can properly 
 test vnode support.
 If it helps, I can update the CEP to say something more like “Caveat: 
 Currently untested with vnodes - work is ongoing to remove this 
 limitation” if that helps?
 
 Doug
 
 > On Mar 24, 2023, at 11:43 AM, Brandon Williams >>> > > wrote:
 > 
 > On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D 

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Joseph Lynch
One of the explicit goals of making an official sidecar project was to
try to make it something the project does not break compatibility with
as one of the main issues the third-party sidecars (that handle
distributed control, backup, repair, etc ...) have is they break
constantly because C* breaks the control interfaces (JMX and config
files in particular) constantly. If it helps with the mental model,
maybe think of the Cassandra sidecar as part of the Cassandra
distribution and we try not to break the distribution? Just like we
can't break CQL and break the CQL client ecosystem, we hopefully don't
break control interfaces of the sidecar either.

On Tue, Mar 28, 2023 at 10:30 AM Jeremiah D Jordan
 wrote:
>
> - Resources isolation. Having the said service running within the same JVM 
> may negatively impact Cassandra storage's performance. It could be more 
> beneficial to have them in Sidecar, which offers strong resource isolation 
> guarantees.
>
> How does having this in a side car change the impact on “storage 
> performance”?  The side car reading sstables will have the same impact on 
> storage IO as the main process reading sstables.  Given the sidecar is 
> running on the same node as the main C* process, the only real resource 
> isolation you have is in heap/GC?  CPU/Memory/IO are all still shared between 
> the main C* process and the side car, and coordinating those across processes 
> is harder than coordinating them within a single process.  For example if we 
> wanted to have the compaction throughput, streaming throughput, and analytics 
> read throughput all tied back to a single disk IO cap, that is harder with an 
> external process.

I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. In addition to that, having
this in a separate process gives us access to easy-to-use OS level
protections over CPU time, memory, network, and disk via cgroups; as
well as taking advantage of the existing isolation techniques kernels
already offer to protect processes from each other e.g. CPU schedulers
like CFS [1], network qdiscs like tc-fq/tc-prio[2, 3], and io
schedulers like kyber/bfq [4].

Mixing latency sensitive point queries with throughput sensitive ones
in the same JVM just seems fraught with peril and I don't buy we will
build the same level of performance isolation that the kernel has.
Note you do not need containers to do this, the kernel by default uses
these isolation mechanisms to enforce fairness to resources, cgroups
just make it better (and can be used regardless of containerization).
This was the thinking behind backup/restore, repair, bulk operations,
etc ... living in a separate process.

As has been mentioned elsewhere, being able to run that workload on
different physical machines is even better to isolate, and I could
totally see a wonderful architecture in the future where you have
sidecar doing incremental backups from source nodes and restores every
~10 minutes to the "analytics" nodes where spark bulk readers are
pointed. For isolation the best would be a separate process on a
separate machine, followed by a separate process on the same machine,
followed by a separate thread on the same machine (historically what
C* does) ... now thats not so say we need to go straight to best, but
we probably shouldn't do the worst thing?

-Joey

[1] https://man7.org/linux/man-pages/man7/sched.7.html
[2] https://man7.org/linux/man-pages/man8/tc-fq.8.html
[3] https://man7.org/linux/man-pages/man8/tc-prio.8.html
[4] https://docs.kernel.org/block/index.html


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Joseph Lynch
> If we want to bring groups/containers/etc into the default deployment 
> mechanisms of C*, great.  I am all for dividing it up into micro services 
> given we solve all the problems I listed in the complexity section.
>
> I am actually all for dividing C* up into multiple micro services, but the 
> project needs to buy in to containers as the default mechanism for running it 
> for that to be viable in my mind.

I was under the impression that with CEP-1 the project did buy into
the direction of moving the workloads that are non-latency sensitive
out of the main process? At the time of the discussion folks mentioned
repair, bulk workloads, backup, restore, compaction etc ... as all
possible things we would like to extract over time to the sidecar.

I don't think we want to go full on micro services, with like 12
processes all handling one thing, but 2 seems like a good step? One
for latency sensitive requests (reads/writes - the current process),
and one for non latency sensitive requests (control plane, bulk work,
etc ... - the sidecar).

-Joey


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Benedict
Fwiw I’m sceptical of the performance angle long term. You can do a lot more to 
control QoS when you understand what each query is doing, and what your SLOs 
are. You can also more efficiently apportion your resources (not leaving any 
lying fallow to ensure it’s free later)

But, we’re a long way from that.

My personal view of the sidecar is to offer these sorts of facilities more 
rapidly than we might in Cassandra proper, but that we might eventually (when 
mature enough and Cassandra is ready for it) bring them in process.

Certainly, managing consistency (repair etc) and serving bulk operations should 
*long term* live in Cassandra IMO.

But that isn’t the state of the world today, so I support a separate process.

Though, I am nervous about the issues Jeremiah raises - we need to ensure we 
are not tightly coupling things and creating new problems. Managing other 
processes reliably and promptly seeing sstable changes and memtable flushes 
isn’t something that would be pretty, and we should probably offer weak 
guarantees about what’s visible when - ideally the sidecar would rely on file 
system watch notifications, or perhaps at most some fsync like functionality 
for flushing memtables.

> On 28 Mar 2023, at 16:09, Joseph Lynch  wrote:
> 
> 
>> 
>> If we want to bring groups/containers/etc into the default deployment 
>> mechanisms of C*, great.  I am all for dividing it up into micro services 
>> given we solve all the problems I listed in the complexity section.
>> 
>> I am actually all for dividing C* up into multiple micro services, but the 
>> project needs to buy in to containers as the default mechanism for running 
>> it for that to be viable in my mind.
> 
> I was under the impression that with CEP-1 the project did buy into
> the direction of moving the workloads that are non-latency sensitive
> out of the main process? At the time of the discussion folks mentioned
> repair, bulk workloads, backup, restore, compaction etc ... as all
> possible things we would like to extract over time to the sidecar.
> 
> I don't think we want to go full on micro services, with like 12
> processes all handling one thing, but 2 seems like a good step? One
> for latency sensitive requests (reads/writes - the current process),
> and one for non latency sensitive requests (control plane, bulk work,
> etc ... - the sidecar).
> 
> -Joey



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
> One of the explicit goals of making an official sidecar project was to
> try to make it something the project does not break compatibility with
> as one of the main issues the third-party sidecars (that handle
> distributed control, backup, repair, etc ...) have is they break
> constantly because C* breaks the control interfaces (JMX and config
> files in particular) constantly. If it helps with the mental model,
> maybe think of the Cassandra sidecar as part of the Cassandra
> distribution and we try not to break the distribution? Just like we
> can't break CQL and break the CQL client ecosystem, we hopefully don't
> break control interfaces of the sidecar either.

Do we have tests which enforce this?  I agree we said we won’t break stuff, 
agreeing to something and actually doing it are different things.  We have for 
years said “we won’t break interface X in a patch release”, but we always end 
up doing it if there is no test enforcing the contract with a comment saying 
not to break it.  Without such guards a contributor who has no clue about the 
“what we said” changes it, and the reviewer misses it (and possible also 
doesn’t know/remember “what we said” because we said it 3 years back)…

This is not impossible, we just need to make sure that we are pro-active about 
marking such things.  Maybe the answer is “running the side car integration 
tests” as part of C* patch CI?

> In addition to that, having
> this in a separate process gives us access to easy-to-use OS level
> protections over CPU time, memory, network, and disk via cgroups; as
> well as taking advantage of the existing isolation techniques kernels
> already offer to protect processes from each other e.g. CPU schedulers
> like CFS [1], network qdiscs like tc-fq/tc-prio[2, 3], and io
> schedulers like kyber/bfq [4].

How do we get this tuning to be part of the default install for all users of C* 
+ sidecar?



Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Derek Chen-Becker
On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:
...

I think we might be underselling how valuable JVM isolation is,
> especially for analytics queries that are going to pass the entire
> dataset through heap somewhat constantly.
>

Big +1 here. The JVM simply does not have significant granularity of
control for resource utilization, but this is explicitly a feature of
separate processes. Add in being able to separate GC domains and you can
avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.

Cheers,

Derek


-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Benedict
I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker                                             || GPG Key available at https://keybase.io/dchenbecker and       || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |+---+


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Yifan Cai
A lot of great discussions!

On the sidecar front, especially what the role sidecar plays in terms of
this CEP, I feel there might be some confusion. Once the code is published,
we should have clarity.
Sidecar does not read sstables nor do any coordination for analytics
queries. It is local to the companion Cassandra instance. For bulk read, it
takes snapshots and streams sstables to spark workers to read. For bulk
write, it imports the sstables uploaded from spark workers. All commands
are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the
http interface to them. It might be an over simplified description. The
complex computation is performed in spark clusters only.

In the long run, Cassandra might evolve into a database that does both OLTP
and OLAP. (Not what this thread aims for)
At the current stage, Spark is very suited for analytic purposes.

On Tue, Mar 28, 2023 at 9:06 AM Benedict  wrote:

> I disagree with the first claim, as the process has all the information it
> chooses to utilise about which resources it’s using and what it’s using
> those resources for.
>
> The inability to isolate GC domains is something we cannot address, but
> also probably not a problem if we were doing everything with memory
> management as well as we could be.
>
> But, not worth detailing this thread for. Today we do very little well on
> this front within the process, and a separate process is well justified
> given the state of play.
>
> On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:
>
> 
>
> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch 
> wrote:
> ...
>
> I think we might be underselling how valuable JVM isolation is,
>> especially for analytics queries that are going to pass the entire
>> dataset through heap somewhat constantly.
>>
>
> Big +1 here. The JVM simply does not have significant granularity of
> control for resource utilization, but this is explicitly a feature of
> separate processes. Add in being able to separate GC domains and you can
> avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
>
> Cheers,
>
> Derek
>
>
> --
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
>
>


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread J. D. Jordan
Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?On Mar 28, 2023, at 11:35 AM, Yifan Cai  wrote:A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity.Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only.In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for) At the current stage, Spark is very suited for analytic purposes. On Tue, Mar 28, 2023 at 9:06 AM Benedict  wrote:I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker  wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch  wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker                                             || GPG Key available at https://keybase.io/dchenbecker and       || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |+---+



JDK 20 is now GA, JDK 21 Early-Access builds, and 2 important heads-up!

2023-03-28 Thread David Delabassee

Welcome to the latest OpenJDK Quality Outreach update!

Last week was busy as we released both Java 20 and JavaFX 20. To 
celebrate the launch, we hosted a live event focused on Java 20, i.e. 
Level Up Java Day. All the sessions recordings will be made available 
shortly on the YouTube Java channel.


Some recent events shown us that it is useful to conduct tests using the 
latest early-access OpenJDK builds. This will benefit the OpenJDK 
codebase but also your own codebase. Sometime, a failure could be due to 
an actual regression introduced in OpenJDK. In that case, we obviously 
want to hear about it while we can still address it. But sometime, a 
failure could also be due to a subtle behaviour change... that works as 
expected. Regardless of if it's a bug or a test that is now broken due 
to a behaviour change, we want to hear from you. In the latter case, it 
might also mean that we should probably communicate more about those 
changes even if they might seem subtle. On that note, please make sure 
to check all the 2 Heads-Up below: "Support for Unicode CLDR Version 42" 
and "New network interface names on Windows".


So please, let us know if you observe anything using the latest 
early-access builds of JDK 21.



## Heads-Up - JDK 20 - Support for Unicode CLDR Version 42

The JDK's locale data is based on the Unicode Consortium's Unicode 
Common Locale Data Repository (CLDR). As mentioned in the December 2022 
Quality Outreach newsletter [1], JDK 20 upgraded CLDR [2] to version 42 
[3], which was released in October 2022. This version includes a "more 
sophisticated handling of spaces" [4] that replaces regular spaces with 
non-breaking spaces (NBSP / `\u00A0`) or narrow non-breaking spaces 
(NNBSP / `\u202F`):

- in time formats between `a` and time
- in unit formats between {0} and unit
- in Cyrillic date formats before year marker such as `г`

Other noticeable changes include:
* " at " is no longer used for standard date/time format ’ [5]
* fix first day of week info for China (CN) [6]
* Japanese: Support numbers up to 京 [7]

As a consequence, production and test code that produces or parses 
locale-dependent strings like formatted dates and times may change 
behavior in potentially breaking ways (e.g. when a handcrafted datetime 
string with a regular space is parsed, but the parser now expects an 
NBSP or NNBSP). Issues can be hard to analyze because expected and 
actual strings look very similar or even identical in various text 
representations. To detect and fix these issues, make sure to use a text 
editor that displays different kinds of spaces differently.


If the required fixes can't be implemented when upgrading to JDK 20, 
consider using the JVM argument `-Djava.locale.providers=COMPAT` to use 
legacy locale data. Note that this limits some locale-related 
functionality and treat it as a temporary workaround, not a proper 
solution. Moreover, the `COMPAT` option will be eventually removed in 
the future.


It is also important to keep in mind that this kind of locale data 
evolves regularly so programs parsing/composing the locale data by 
themselves should be routinely checked with each JDK release.


[1] 
https://mail.openjdk.org/pipermail/quality-discuss/2022-December/001100.html

[2] https://bugs.openjdk.org/browse/JDK-8284840
[3] https://cldr.unicode.org/index/downloads/cldr-42
[4] https://unicode-org.atlassian.net/browse/CLDR-14032
[5] https://unicode-org.atlassian.net/browse/CLDR-14831
[6] https://unicode-org.atlassian.net/browse/CLDR-11510
[7] https://unicode-org.atlassian.net/browse/CLDR-15966


## Heads-Up - JDK 21 - New network interface names on Windows

Network Names that the JDK assigns to network interfaces on Windows are 
changing in JDK 21 [8].


The JDK historically synthesized names for network interfaces on 
Windows. This has changed to use the names assigned by the Windows 
operating system. For example, the JDK may have historically assigned a 
name such as “eth0” for an ethernet interface and “lo” for the loopback. 
The equivalent names that Windows assigns may be names such as 
“ethernet_32768” and “loopback_0".


This change may impact code that does a lookup of network interfaces 
with the `NetworkInterace.getByName(String name)` method. It also may 
also be surprising to code that enumerates all network interfaces with 
the `NetworkInterfaces.networkInterfaces()` or 
`NetworkInterface.getNetworkInterfaces()` methods as the names of the 
network interfaces will look different to previous releases. Depending 
on configuration, it is possible that enumerating all network interfaces 
will enumerate network interfaces that weren’t previously enumerated 
because they didn’t have an Internet Protocol address assigned. The 
display name returned by `NetworkInterface::getDisplayName` has not 
changed so this should facilitate the identification of network 
interfaces when using Windows native tools.


[8] https://bugs.openjdk.org/browse/JDK-8303898


## JDK 20