Re: [DISCUSS] Potential circleci config and workflow changes

Ekaterina Dimitrova Wed, 19 Oct 2022 19:12:10 -0700

Thank you Josh. Glad to see that our CI is getting more attention. As no
Cassandra feature will be there if we don't do proper testing, right?
Important as all the suites and tools we have. With that being said I am
glad to see Derek is volunteering to spend more time on this as I believe
this is always the main issue - ideas and willingness for improvements are
there but people are swamped with other things and we lack manpower for
something so important.
1. Tune parallelism levels per job (David and Ekaterina have insight on
this)
Question for David, do you tune only parallelism and use only xlarge? If
yes, we need to talk :D

Reading what Stefan shared as experience/feedback, I think we can revise
the current config and move to a more reasonable config that can work for
most people but there will always be someone who needs something a bit
different. With that said maybe we can add to our scripts/menu an option to
change from command line through parameters parallelism and/or resources?
For those who want further customization? I see this as a separate
additional ticket probably. In that case we might probably skip the
use of circleci
config process for that part of the menu. (but not for adding new jobs and
meaningful permanent updates)

2. Rename jobs on circle to be more indicative of their function
+0 I am probably super used to the current names but Derek brought it to my
attention that there are names which are confusing for someone new to the
cassandra world. With that said I would say we can do this in a separate
ticket, mass update.

3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see:
https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595
)
I am against unifying per JDK workflows but I am all in for unifying the
pre-commit/separate workflows and getting back to 2 workflows as suggested
by Andres. If we think of how that will look in the UI I think it will be
super hard to follow. (the case of having unified both jdks in one workflow)
4. Update documentation w/guidance on using circle, .circleci/generate.sh
examples, etc 4a. How to commit:
https://cassandra.apache.org/_/development/how_to_commit.html 4b. Testing:
https://cassandra.apache.org/_/development/testing.html
I will open a ticket and post the guide I was working on. But it also
doesn't make sense to fully update it now if we are going to significantly
change the workflow soon. Until then I believe Andres has updated the
circleci readme and provided good usage examples.
5. Flag on generate.sh to allow auto-run on push
Auto-run on push? Can you elaborate? Like to start your whole workflow
directly without using the UI? There is an approval step in the config
file, we can probably add some flags to change pre-commit workflows to
start build without approval when we use those mentioned flags. But having
by default everything to start on push is an overkill in my opinion. People
will be forgetting it and pushing builds for nothing on WIP branches.
Talking from experience :D

6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for all
suites, default to -m, deprecate -h?) <- may not be a code-change issue and
instead be a documentation issue
If we agree except the free tier config file we want one more reasonable
config which doesn't bump resources to the max without a need but provides
balanced use of resources - absolutely. -h was kept as there
was understanding there are people in the community actively using it.
7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE]
temporary circleci config" as the commit message
+0

I also wanted to address a few of the points David made.
"Ekaterina is probably dealing with with her JDK17 work" - if you mean to
ensure we have all jobs for all jdks properly, yes. That was my plan. Until
Derek was so good at suggesting to work on adding missing jobs in CircleCI
now so my work on that will be a bit less for certain things. This is an
effort related to the recent changes in our release document. Ticket
CASSANDRA-17950 <https://issues.apache.org/jira/browse/CASSANDRA-17950> :-)
I am helping with mentoring/reviews. Everyone is welcome to join the party.
"1) resource_class used is not because its needed… in HIGHER file we
default to xlarge but only python upgrade tests need that… reported in
CASSANDRA-17600" - one of the reasons. we had the MIDRES in the first place
as I mentioned in my other email the other day. [1]

"our current patching allows MID/HIGHER to drift as changes need new
patches else patching may do the wrong thing… reported in CASSANDRA-17600"
- I'd say the patching is annoying sometimes, indeed but with/without the
patching any changes to config mean we need to check it by reading through
diff and pushing a run to CI before commit. With that said I am all in for
automation but this will not change the fact we need to push test runs and
verify the changes did not hurt us in a way. Same as testing patches on all
branches, running all needed tests and confirming no regressions. Nothing
new or changing here IMHO

"CI is a combinatorial problem, we need to run all jobs for all JDKs, vnode
on/of, cdc on/off, compression on/of, etc…. But this is currently
controlled and fleshed out by humans who want to add new jobs.. we should
move away from maintaining .circleci/config-2_1.yml and instead
auto-generate it. Simple example of this problem is jdk11 support… we run a
subset of tests on jdk11 and say its supported… will jdk17 have the same
issue? Will it be even less tests? Why does the burden lie on everyone to
“do the right thing” when all they want is a simple job?"
 Controlled and fleshed by humans it will always be but I agree we need to
automate the steps to make it easier for people to add most of the
combinations and not to skip any because it is too much work. We will
always need a human to decide which jdks, cdc, vnodes, etc. With that said
I shared your ticket/patch with Derek as he had similar thoughts, we need
to get back to that one at some point. (CASSANDRA-17600) Thanks for working
on that!

"why do we require people to install “circleci” command to contribute? If
you rename .circleci/config-2_1.yml to .circleci/config.yml then CI will
work just fine… we don’t need to call “circleci config process” every time
we touch circle config…. Also, seems that w/e someone new to circle config
(but not cassandra) touch it they always mutate LOW/MID/HIGH and not
.circleci/config-2_1.yml… so I keep going back to fix
.circleci/config-2_1.yml…."
I'd say config-2_1.yml is mainly for those who will make permanent changes
to config (like adding/removing jobs). config-2_1.yml is actually created
as per the CircleCI automation rules - 1st we add and reuse executors,
parameters and commands but I think we can reduce further things if we add
even more parameters probably. I have to look more into the current file. I
am sure there is room for further improvement. 2nd circleci cli tool can
verify the config file for errors and helps with debugging before we push
to CircleCI. There is circleci config validate. If we make changes manually
we are on our own to verify the long yml and also deal with duplication in
config.yml. My concern is that things that need to be almost identical
might start to diverge easier. Though I made my suggestion in point 1 for
what cases probably we can add menu options that potentially will not
require using circleci cli tool. There might be more cases though.
Currently config-2_1.yml is 2256 lines while config.yml is 5793 lines. I'd
say lots of duplication there

[1] https://lists.apache.org/thread/htxoh60zt8zxc4vgxj9zh71trk0zxwhl

On Wed, 19 Oct 2022 at 17:20, David Capwell <[email protected]> wrote:

> 1. Tune parallelism levels per job (David and Ekaterina have insight on
> this)
>
>
> +1 to this!  I drastically lower our parallelism as only python-dtest
> upgrade tests need many resources…
>
> What I do for JVM unit/jvm-dtest is the following
>
> def java_parallelism(src_dir, kind, num_file_in_worker, include = lambda
> a, b: True):
>     d = os.path.join(src_dir, 'test', kind)
>     num_files = 0
>     for root, dirs, files in os.walk(d):
>         for f in files:
>             if f.endswith('Test.java') and include(os.path.join(root, f),
> f):
>                 num_files += 1
>     return math.floor(num_files / num_file_in_worker)
>
> def fix_parallelism(args, contents):
>     jobs = contents['jobs']
>
>     unit_parallelism                = java_parallelism(args.src, 'unit',
> 20)
>     jvm_dtest_parallelism           = java_parallelism(args.src,
> 'distributed', 4, lambda full, name: 'upgrade' not in full)
>     jvm_dtest_upgrade_parallelism   = java_parallelism(args.src,
> 'distributed', 2, lambda full, name: 'upgrade' in full)
>
> TL;DR - I find all test files we are going to run, and based off a
> pre-defined variable that says “idea” number of files per worker, I then
> calculate how many workers we need.  So unit tests are num_files / 20 ~= 35
> workers.  Can I be “smarter” by knowing which files have higher cost?
> Sure… but the “perfect” and the “average” are too similar that it wasn’t
> worth it...
>
> 2. Rename jobs on circle to be more indicative of their function
>
>
> Have an example?  I am not against, I just don’t know the problem you
> are referring to.
>
> 3. Unify j8 and j11 workflow pairs into single
>
>
> Fine by me, but we need to keep in mind j17 is coming.  Also, most
> developmental CI builds don’t really need to run cross every JDK, so we
> need some way to disable different JDKs…
>
> When I am testing out a patch I tend to run the following (my script):
> "circleci-enable.py --no-jdk11”; this will remove the JDK11 builds.  I
> know I am going to run them pre-merge so I know its safe for me.
>
> 5. Flag on generate.sh to allow auto-run on push
>
>
> I really hate that we don’t do this by default… I still to this day
> strongly feel you should *opt-out* of CI rather than opt-in… seen several
> commits get merged as they didn’t see a error in circle… because circle
> didn’t do any work…. Yes, I am fully aware that I am beating a dead horse…
>
> TL;DR +1
>
> 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE]
> temporary circleci config" as the commit message
>
>
> +0 from me… I have seen people not realize you have to commit after
> typing “higher” (wrapper around my personal circleci-enable.py script to
> apply my defaults to the build) but not an issue I have… so I don’t mind if
> people want the tool to integrate with git…
>
>
> With all that said, I do feel there is more, and something I feel
> Ekaterina is probably dealing with with her JDK17 work…
>
> 1) resource_class used is not because its needed… in HIGHER file we
> default to xlarge but only python upgrade tests need that… reported in
> CASSANDRA-17600
> 2) our current patching allows MID/HIGHER to drift as changes need new
> patches else patching may do the wrong thing… reported in CASSANDRA-17600
> 3) CI is a combinatorial problem, we need to run all jobs for all JDKs,
> vnode on/of, cdc on/off, compression on/of, etc…. But this is currently
> controlled and fleshed out by humans who want to add new jobs..  we should
> move away from maintaining .circleci/config-2_1.yml and instead
> auto-generate it.  Simple example of this problem is jdk11 support… we run
> a subset of tests on jdk11 and say its supported… will jdk17 have the same
> issue?  Will it be even less tests?  Why does the burden lie on everyone
> to “do the right thing” when all they want is a simple job?
> 4) why do we require people to install “circleci” command to contribute?
> If you rename .circleci/config-2_1.yml to .circleci/config.yml then CI will
> work just fine… we don’t need to call “circleci config process” every time
> we touch circle config…. Also, seems that w/e someone new to circle config
> (but not cassandra) touch it they always mutate LOW/MID/HIGH and
> not .circleci/config-2_1.yml… so I keep going back to
> fix .circleci/config-2_1.yml….
>
>
> On Oct 19, 2022, at 1:32 PM, Miklosovic, Stefan <
> [email protected]> wrote:
>
> 1) would be nice to have. The first thing I do is that I change the
> parallelism to 20. None of committed config.yaml's are appropriate for our
> company CircleCI so I have to tweak this manually. I think we can not run
> more that 25/30 containers in parallel, something like that. HIGHRES has
> 100 and MIDRES has some jobs having parallelism equal to 50 or so so that
> is not good either. I would be happy with simple way to modify default
> config.yaml on parallelism. I use "sed" to change parallelism: 4 to
> parallelism: 20 and leave parallelism: 1 where it does not make sense to
> increase it. However I noticed that there is not "4" set everywhere, some
> jobs have it set to "1" so I have to take extra care of these cases (I
> consider that to be a bug, I think there are two or three, I do not
> remember). Once set, I have that config in "git stash" so I just apply it
> every time I need it.
>
> 5) would be nice too.
> 7) is nice but not crucial, it takes no time to commit that.
>
> ________________________________________
> From: Josh McKenzie <[email protected]>
> Sent: Wednesday, October 19, 2022 21:50
> To: dev
> Subject: [DISCUSS] Potential circleci config and workflow changes
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> While working w/Andres on CASSANDRA-17939 a variety of things came up
> regarding our circleci config and opportunities to improve it. Figured I'd
> hit the list up here to see what people's thoughts are since many of us
> intersect with these systems daily and having your workflow disrupted
> without having a chance to provide input is bad.
>
> The ideas:
> 1. Tune parallelism levels per job (David and Ekaterina have insight on
> this)
> 2. Rename jobs on circle to be more indicative of their function
> 3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see:
> https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595
> )
> 4. Update documentation w/guidance on using circle, .circleci/generate.sh
> examples, etc
> 4a. How to commit:
> https://cassandra.apache.org/_/development/how_to_commit.html
> 4b. Testing: https://cassandra.apache.org/_/development/testing.html
> 5. Flag on generate.sh to allow auto-run on push
> 6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for all
> suites, default to -m, deprecate -h?) <- may not be a code-change issue and
> instead be a documentation issue
> 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE]
> temporary circleci config" as the commit message
>
> Curious to see what folks think.
>
> ~Josh
>
>
>

Re: [DISCUSS] Potential circleci config and workflow changes

Reply via email to