Thank you Josh. Glad to see that our CI is getting more attention. As
no Cassandra feature will be there if we don't do proper testing,
right? Important as all the suites and tools we have. With that being
said I am glad to see Derek is volunteering to spend more time on this
as I believe this is always the main issue - ideas and willingness for
improvements are there but people are swamped with other things and we
lack manpower for something so important.
1. Tune parallelism levels per job (David and Ekaterina have insight
on this)
Question for David, do you tune only parallelism and use only xlarge?
If yes, we need to talk :D
Reading what Stefan shared as experience/feedback, I think we can
revise the current config and move to a more reasonable config that
can work for most people but there will always be someone who needs
something a bit different. With that said maybe we can add to our
scripts/menu an option to change from command line through parameters
parallelism and/or resources? For those who want further
customization? I see this as a separate additional ticket probably. In
that case we might probably skip the use of circleci config process
for that part of the menu. (but not for adding new jobs and meaningful
permanent updates)
2. Rename jobs on circle to be more indicative of their function
+0 I am probably super used to the current names but Derek brought it
to my attention that there are names which are confusing for someone
new to the cassandra world. With that said I would say we can do this
in a separate ticket, mass update.
3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see:
https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595
<https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595>)
I am against unifying per JDK workflows but I am all in for unifying
the pre-commit/separate workflows and getting back to 2 workflows as
suggested by Andres. If we think of how that will look in the UI I
think it will be super hard to follow. (the case of having unified
both jdks in one workflow)
4. Update documentation w/guidance on using circle,
.circleci/generate.sh examples, etc 4a. How to commit:
https://cassandra.apache.org/_/development/how_to_commit.html4b.
Testing: https://cassandra.apache.org/_/development/testing.html
I will open a ticket and post the guide I was working on. But it also
doesn't make sense to fully update it now if we are going to
significantly change the workflow soon. Until then I believe Andres
has updated the circleci readme and provided good usage examples.
5. Flag on generate.sh to allow auto-run on push
Auto-run on push? Can you elaborate? Like to start your whole workflow
directly without using the UI? There is an approval step in the config
file, we can probably add some flags to change pre-commit workflows to
start build without approval when we use those mentioned flags. But
having by default everything to start on push is an overkill in my
opinion. People will be forgetting it and pushing builds for nothing
on WIP branches. Talking from experience :D
6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for
all suites, default to -m, deprecate -h?) <- may not be a code-change
issue and instead be a documentation issue
If we agree except the free tier config file we want one more
reasonable config which doesn't bump resources to the max without a
need but provides balanced use of resources - absolutely. -h was kept
as there was understanding there are people in the community actively
using it.
7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE]
temporary circleci config" as the commit message
+0
I also wanted to address a few of the points David made.
"Ekaterina is probably dealing with with her JDK17 work" - if you mean
to ensure we have all jobs for all jdks properly, yes. That was my
plan. Until Derek was so good at suggesting to work on adding missing
jobs in CircleCI now so my work on that will be a bit less for certain
things. This is an effort related to the recent changes in our release
document. Ticket CASSANDRA-17950
<https://issues.apache.org/jira/browse/CASSANDRA-17950> :-) I am
helping with mentoring/reviews. Everyone is welcome to join the party.
"1) resource_class used is not because its needed… in HIGHER file we
default to xlarge but only python upgrade tests need that… reported in
CASSANDRA-17600" - one of the reasons. we had the MIDRES in the first
place as I mentioned in my other email the other day. [1]
"our current patching allows MID/HIGHER to drift as changes need new
patches else patching may do the wrong thing… reported in
CASSANDRA-17600" - I'd say the patching is annoying sometimes, indeed
but with/without the patching any changes to config mean we need to
check it by reading through diff and pushing a run to CI before
commit. With that said I am all in for automation but this will not
change the fact we need to push test runs and verify the changes did
not hurt us in a way. Same as testing patches on all branches, running
all needed tests and confirming no regressions. Nothing new or
changing here IMHO
"CI is a combinatorial problem, we need to run all jobs for all JDKs,
vnode on/of, cdc on/off, compression on/of, etc…. But this is
currently controlled and fleshed out by humans who want to add new
jobs.. we should move away from maintaining .circleci/config-2_1.yml
and instead auto-generate it. Simple example of this problem is jdk11
support… we run a subset of tests on jdk11 and say its supported… will
jdk17 have the same issue? Will it be even less tests? Why does the
burden lie on everyone to “do the right thing” when all they want is a
simple job?"
Controlled and fleshed by humans it will always be but I agree we
need to automate the steps to make it easier for people to add most of
the combinations and not to skip any because it is too much work. We
will always need a human to decide which jdks, cdc, vnodes, etc. With
that said I shared your ticket/patch with Derek as he had similar
thoughts, we need to get back to that one at some point.
(CASSANDRA-17600) Thanks for working on that!
"why do we require people to install “circleci” command to contribute?
If you rename .circleci/config-2_1.yml to .circleci/config.yml then CI
will work just fine… we don’t need to call “circleci config process”
every time we touch circle config…. Also, seems that w/e someone new
to circle config (but not cassandra) touch it they always mutate
LOW/MID/HIGH and not .circleci/config-2_1.yml… so I keep going back to
fix .circleci/config-2_1.yml…."
I'd say config-2_1.yml is mainly for those who will make permanent
changes to config (like adding/removing jobs). config-2_1.yml is
actually created as per the CircleCI automation rules - 1st we add and
reuse executors, parameters and commands but I think we can reduce
further things if we add even more parameters probably. I have to look
more into the current file. I am sure there is room for further
improvement. 2nd circleci cli tool can verify the config file for
errors and helps with debugging before we push to CircleCI. There is
circleci config validate. If we make changes manually we are on our
own to verify the long yml and also deal with duplication in
config.yml. My concern is that things that need to be almost identical
might start to diverge easier. Though I made my suggestion in point 1
for what cases probably we can add menu options that potentially will
not require using circleci cli tool. There might be more cases though.
Currently config-2_1.yml is 2256 lines while config.yml is 5793 lines.
I'd say lots of duplication there
[1] https://lists.apache.org/thread/htxoh60zt8zxc4vgxj9zh71trk0zxwhl
On Wed, 19 Oct 2022 at 17:20, David Capwell <dcapw...@apple.com> wrote:
1. Tune parallelism levels per job (David and Ekaterina have
insight on this)
+1 to this! I drastically lower our parallelism as
only python-dtest upgrade tests need many resources…
What I do for JVM unit/jvm-dtest is the following
def java_parallelism(src_dir, kind, num_file_in_worker,
include = lambda a, b: True):
d = os.path.join(src_dir, 'test', kind)
num_files = 0
for root, dirs, files in os.walk(d):
for f in files:
if f.endswith('Test.java') and
include(os.path.join(root, f), f):
num_files += 1
return math.floor(num_files / num_file_in_worker)
def fix_parallelism(args, contents):
jobs = contents['jobs']
unit_parallelism =
java_parallelism(args.src, 'unit', 20)
jvm_dtest_parallelism =
java_parallelism(args.src, 'distributed', 4, lambda full,
name: 'upgrade' not in full)
jvm_dtest_upgrade_parallelism =
java_parallelism(args.src, 'distributed', 2, lambda full,
name: 'upgrade' in full)
TL;DR - I find all test files we are going to run, and based off a
pre-defined variable that says “idea” number of files per
worker, I then calculate how many workers we need. So unit tests
are num_files / 20 ~= 35 workers. Can I be “smarter” by knowing
which files have higher cost? Sure… but the “perfect” and
the “average” are too similar that it wasn’t worth it...
2. Rename jobs on circle to be more indicative of their function
Have an example? I am not against, I just don’t know the problem
you are referring to.
3. Unify j8 and j11 workflow pairs into single
Fine by me, but we need to keep in mind j17 is coming. Also, most
developmental CI builds don’t really need to run cross every JDK,
so we need some way to disable different JDKs…
When I am testing out a patch I tend to run the following (my
script): "circleci-enable.py --no-jdk11”; this will remove the
JDK11 builds. I know I am going to run them pre-merge so I know
its safe for me.
5. Flag on generate.sh to allow auto-run on push
I really hate that we don’t do this by default… I still to this
day strongly feel you should *opt-out* of CI rather than opt-in…
seen several commits get merged as they didn’t see a error in
circle… because circle didn’t do any work…. Yes, I am fully aware
that I am beating a dead horse…
TL;DR +1
7. Consider flag on generate.sh to run and commit with "[DO NOT
MERGE] temporary circleci config" as the commit message
+0 from me… I have seen people not realize you have to commit
after typing “higher” (wrapper around my personal
circleci-enable.py script to apply my defaults to the build) but
not an issue I have… so I don’t mind if people want the tool to
integrate with git…
With all that said, I do feel there is more, and something I feel
Ekaterina is probably dealing with with her JDK17 work…
1) resource_class used is not because its needed… in HIGHER file
we default to xlarge but only python upgrade tests need that…
reported in CASSANDRA-17600
2) our current patching allows MID/HIGHER to drift as changes need
new patches else patching may do the wrong thing… reported in
CASSANDRA-17600
3) CI is a combinatorial problem, we need to run all jobs for all
JDKs, vnode on/of, cdc on/off, compression on/of, etc…. But this
is currently controlled and fleshed out by humans who want to add
new jobs.. we should move away from
maintaining .circleci/config-2_1.yml and instead auto-generate
it. Simple example of this problem is jdk11 support… we run a
subset of tests on jdk11 and say its supported… will jdk17 have
the same issue? Will it be even less tests? Why does the burden
lie on everyone to “do the right thing” when all they want is a
simple job?
4) why do we require people to install “circleci” command to
contribute? If you rename .circleci/config-2_1.yml
to .circleci/config.yml then CI will work just fine… we don’t need
to call “circleci config process” every time we touch circle
config…. Also, seems that w/e someone new to circle config (but
not cassandra) touch it they always mutate LOW/MID/HIGH and
not .circleci/config-2_1.yml… so I keep going back to
fix .circleci/config-2_1.yml….
On Oct 19, 2022, at 1:32 PM, Miklosovic, Stefan
<stefan.mikloso...@netapp.com> wrote:
1) would be nice to have. The first thing I do is that I change
the parallelism to 20. None of committed config.yaml's are
appropriate for our company CircleCI so I have to tweak this
manually. I think we can not run more that 25/30 containers in
parallel, something like that. HIGHRES has 100 and MIDRES has
some jobs having parallelism equal to 50 or so so that is not
good either. I would be happy with simple way to modify default
config.yaml on parallelism. I use "sed" to change parallelism: 4
to parallelism: 20 and leave parallelism: 1 where it does not
make sense to increase it. However I noticed that there is not
"4" set everywhere, some jobs have it set to "1" so I have to
take extra care of these cases (I consider that to be a bug, I
think there are two or three, I do not remember). Once set, I
have that config in "git stash" so I just apply it every time I
need it.
5) would be nice too.
7) is nice but not crucial, it takes no time to commit that.
________________________________________
From: Josh McKenzie <jmcken...@apache.org>
Sent: Wednesday, October 19, 2022 21:50
To: dev
Subject: [DISCUSS] Potential circleci config and workflow changes
NetApp Security WARNING: This is an external email. Do not click
links or open attachments unless you recognize the sender and
know the content is safe.
While working w/Andres on CASSANDRA-17939 a variety of things
came up regarding our circleci config and opportunities to
improve it. Figured I'd hit the list up here to see what people's
thoughts are since many of us intersect with these systems daily
and having your workflow disrupted without having a chance to
provide input is bad.
The ideas:
1. Tune parallelism levels per job (David and Ekaterina have
insight on this)
2. Rename jobs on circle to be more indicative of their function
3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see:
https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595
<https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595>)
4. Update documentation w/guidance on using circle,
.circleci/generate.sh examples, etc
4a. How to commit:
https://cassandra.apache.org/_/development/how_to_commit.html
4b. Testing: https://cassandra.apache.org/_/development/testing.html
5. Flag on generate.sh to allow auto-run on push
6. Clean up the -l, -m, -h flags (test and indicate -l
feasibility for all suites, default to -m, deprecate -h?) <- may
not be a code-change issue and instead be a documentation issue
7. Consider flag on generate.sh to run and commit with "[DO NOT
MERGE] temporary circleci config" as the commit message
Curious to see what folks think.
~Josh