Thank you Josh. Glad to see that our CI is getting more attention. As no Cassandra feature will be there if we don't do proper testing, right? Important as all the suites and tools we have. With that being said I am glad to see Derek is volunteering to spend more time on this as I believe this is always the main issue - ideas and willingness for improvements are there but people are swamped with other things and we lack manpower for something so important. 1. Tune parallelism levels per job (David and Ekaterina have insight on this) Question for David, do you tune only parallelism and use only xlarge? If yes, we need to talk :D
Reading what Stefan shared as experience/feedback, I think we can revise the current config and move to a more reasonable config that can work for most people but there will always be someone who needs something a bit different. With that said maybe we can add to our scripts/menu an option to change from command line through parameters parallelism and/or resources? For those who want further customization? I see this as a separate additional ticket probably. In that case we might probably skip the use of circleci config process for that part of the menu. (but not for adding new jobs and meaningful permanent updates) 2. Rename jobs on circle to be more indicative of their function +0 I am probably super used to the current names but Derek brought it to my attention that there are names which are confusing for someone new to the cassandra world. With that said I would say we can do this in a separate ticket, mass update. 3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see: https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595 ) I am against unifying per JDK workflows but I am all in for unifying the pre-commit/separate workflows and getting back to 2 workflows as suggested by Andres. If we think of how that will look in the UI I think it will be super hard to follow. (the case of having unified both jdks in one workflow) 4. Update documentation w/guidance on using circle, .circleci/generate.sh examples, etc 4a. How to commit: https://cassandra.apache.org/_/development/how_to_commit.html 4b. Testing: https://cassandra.apache.org/_/development/testing.html I will open a ticket and post the guide I was working on. But it also doesn't make sense to fully update it now if we are going to significantly change the workflow soon. Until then I believe Andres has updated the circleci readme and provided good usage examples. 5. Flag on generate.sh to allow auto-run on push Auto-run on push? Can you elaborate? Like to start your whole workflow directly without using the UI? There is an approval step in the config file, we can probably add some flags to change pre-commit workflows to start build without approval when we use those mentioned flags. But having by default everything to start on push is an overkill in my opinion. People will be forgetting it and pushing builds for nothing on WIP branches. Talking from experience :D 6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for all suites, default to -m, deprecate -h?) <- may not be a code-change issue and instead be a documentation issue If we agree except the free tier config file we want one more reasonable config which doesn't bump resources to the max without a need but provides balanced use of resources - absolutely. -h was kept as there was understanding there are people in the community actively using it. 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE] temporary circleci config" as the commit message +0 I also wanted to address a few of the points David made. "Ekaterina is probably dealing with with her JDK17 work" - if you mean to ensure we have all jobs for all jdks properly, yes. That was my plan. Until Derek was so good at suggesting to work on adding missing jobs in CircleCI now so my work on that will be a bit less for certain things. This is an effort related to the recent changes in our release document. Ticket CASSANDRA-17950 <https://issues.apache.org/jira/browse/CASSANDRA-17950> :-) I am helping with mentoring/reviews. Everyone is welcome to join the party. "1) resource_class used is not because its needed… in HIGHER file we default to xlarge but only python upgrade tests need that… reported in CASSANDRA-17600" - one of the reasons. we had the MIDRES in the first place as I mentioned in my other email the other day. [1] "our current patching allows MID/HIGHER to drift as changes need new patches else patching may do the wrong thing… reported in CASSANDRA-17600" - I'd say the patching is annoying sometimes, indeed but with/without the patching any changes to config mean we need to check it by reading through diff and pushing a run to CI before commit. With that said I am all in for automation but this will not change the fact we need to push test runs and verify the changes did not hurt us in a way. Same as testing patches on all branches, running all needed tests and confirming no regressions. Nothing new or changing here IMHO "CI is a combinatorial problem, we need to run all jobs for all JDKs, vnode on/of, cdc on/off, compression on/of, etc…. But this is currently controlled and fleshed out by humans who want to add new jobs.. we should move away from maintaining .circleci/config-2_1.yml and instead auto-generate it. Simple example of this problem is jdk11 support… we run a subset of tests on jdk11 and say its supported… will jdk17 have the same issue? Will it be even less tests? Why does the burden lie on everyone to “do the right thing” when all they want is a simple job?" Controlled and fleshed by humans it will always be but I agree we need to automate the steps to make it easier for people to add most of the combinations and not to skip any because it is too much work. We will always need a human to decide which jdks, cdc, vnodes, etc. With that said I shared your ticket/patch with Derek as he had similar thoughts, we need to get back to that one at some point. (CASSANDRA-17600) Thanks for working on that! "why do we require people to install “circleci” command to contribute? If you rename .circleci/config-2_1.yml to .circleci/config.yml then CI will work just fine… we don’t need to call “circleci config process” every time we touch circle config…. Also, seems that w/e someone new to circle config (but not cassandra) touch it they always mutate LOW/MID/HIGH and not .circleci/config-2_1.yml… so I keep going back to fix .circleci/config-2_1.yml…." I'd say config-2_1.yml is mainly for those who will make permanent changes to config (like adding/removing jobs). config-2_1.yml is actually created as per the CircleCI automation rules - 1st we add and reuse executors, parameters and commands but I think we can reduce further things if we add even more parameters probably. I have to look more into the current file. I am sure there is room for further improvement. 2nd circleci cli tool can verify the config file for errors and helps with debugging before we push to CircleCI. There is circleci config validate. If we make changes manually we are on our own to verify the long yml and also deal with duplication in config.yml. My concern is that things that need to be almost identical might start to diverge easier. Though I made my suggestion in point 1 for what cases probably we can add menu options that potentially will not require using circleci cli tool. There might be more cases though. Currently config-2_1.yml is 2256 lines while config.yml is 5793 lines. I'd say lots of duplication there [1] https://lists.apache.org/thread/htxoh60zt8zxc4vgxj9zh71trk0zxwhl On Wed, 19 Oct 2022 at 17:20, David Capwell <dcapw...@apple.com> wrote: > 1. Tune parallelism levels per job (David and Ekaterina have insight on > this) > > > +1 to this! I drastically lower our parallelism as only python-dtest > upgrade tests need many resources… > > What I do for JVM unit/jvm-dtest is the following > > def java_parallelism(src_dir, kind, num_file_in_worker, include = lambda > a, b: True): > d = os.path.join(src_dir, 'test', kind) > num_files = 0 > for root, dirs, files in os.walk(d): > for f in files: > if f.endswith('Test.java') and include(os.path.join(root, f), > f): > num_files += 1 > return math.floor(num_files / num_file_in_worker) > > def fix_parallelism(args, contents): > jobs = contents['jobs'] > > unit_parallelism = java_parallelism(args.src, 'unit', > 20) > jvm_dtest_parallelism = java_parallelism(args.src, > 'distributed', 4, lambda full, name: 'upgrade' not in full) > jvm_dtest_upgrade_parallelism = java_parallelism(args.src, > 'distributed', 2, lambda full, name: 'upgrade' in full) > > TL;DR - I find all test files we are going to run, and based off a > pre-defined variable that says “idea” number of files per worker, I then > calculate how many workers we need. So unit tests are num_files / 20 ~= 35 > workers. Can I be “smarter” by knowing which files have higher cost? > Sure… but the “perfect” and the “average” are too similar that it wasn’t > worth it... > > 2. Rename jobs on circle to be more indicative of their function > > > Have an example? I am not against, I just don’t know the problem you > are referring to. > > 3. Unify j8 and j11 workflow pairs into single > > > Fine by me, but we need to keep in mind j17 is coming. Also, most > developmental CI builds don’t really need to run cross every JDK, so we > need some way to disable different JDKs… > > When I am testing out a patch I tend to run the following (my script): > "circleci-enable.py --no-jdk11”; this will remove the JDK11 builds. I > know I am going to run them pre-merge so I know its safe for me. > > 5. Flag on generate.sh to allow auto-run on push > > > I really hate that we don’t do this by default… I still to this day > strongly feel you should *opt-out* of CI rather than opt-in… seen several > commits get merged as they didn’t see a error in circle… because circle > didn’t do any work…. Yes, I am fully aware that I am beating a dead horse… > > TL;DR +1 > > 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE] > temporary circleci config" as the commit message > > > +0 from me… I have seen people not realize you have to commit after > typing “higher” (wrapper around my personal circleci-enable.py script to > apply my defaults to the build) but not an issue I have… so I don’t mind if > people want the tool to integrate with git… > > > With all that said, I do feel there is more, and something I feel > Ekaterina is probably dealing with with her JDK17 work… > > 1) resource_class used is not because its needed… in HIGHER file we > default to xlarge but only python upgrade tests need that… reported in > CASSANDRA-17600 > 2) our current patching allows MID/HIGHER to drift as changes need new > patches else patching may do the wrong thing… reported in CASSANDRA-17600 > 3) CI is a combinatorial problem, we need to run all jobs for all JDKs, > vnode on/of, cdc on/off, compression on/of, etc…. But this is currently > controlled and fleshed out by humans who want to add new jobs.. we should > move away from maintaining .circleci/config-2_1.yml and instead > auto-generate it. Simple example of this problem is jdk11 support… we run > a subset of tests on jdk11 and say its supported… will jdk17 have the same > issue? Will it be even less tests? Why does the burden lie on everyone > to “do the right thing” when all they want is a simple job? > 4) why do we require people to install “circleci” command to contribute? > If you rename .circleci/config-2_1.yml to .circleci/config.yml then CI will > work just fine… we don’t need to call “circleci config process” every time > we touch circle config…. Also, seems that w/e someone new to circle config > (but not cassandra) touch it they always mutate LOW/MID/HIGH and > not .circleci/config-2_1.yml… so I keep going back to > fix .circleci/config-2_1.yml…. > > > On Oct 19, 2022, at 1:32 PM, Miklosovic, Stefan < > stefan.mikloso...@netapp.com> wrote: > > 1) would be nice to have. The first thing I do is that I change the > parallelism to 20. None of committed config.yaml's are appropriate for our > company CircleCI so I have to tweak this manually. I think we can not run > more that 25/30 containers in parallel, something like that. HIGHRES has > 100 and MIDRES has some jobs having parallelism equal to 50 or so so that > is not good either. I would be happy with simple way to modify default > config.yaml on parallelism. I use "sed" to change parallelism: 4 to > parallelism: 20 and leave parallelism: 1 where it does not make sense to > increase it. However I noticed that there is not "4" set everywhere, some > jobs have it set to "1" so I have to take extra care of these cases (I > consider that to be a bug, I think there are two or three, I do not > remember). Once set, I have that config in "git stash" so I just apply it > every time I need it. > > 5) would be nice too. > 7) is nice but not crucial, it takes no time to commit that. > > ________________________________________ > From: Josh McKenzie <jmcken...@apache.org> > Sent: Wednesday, October 19, 2022 21:50 > To: dev > Subject: [DISCUSS] Potential circleci config and workflow changes > > NetApp Security WARNING: This is an external email. Do not click links or > open attachments unless you recognize the sender and know the content is > safe. > > > > While working w/Andres on CASSANDRA-17939 a variety of things came up > regarding our circleci config and opportunities to improve it. Figured I'd > hit the list up here to see what people's thoughts are since many of us > intersect with these systems daily and having your workflow disrupted > without having a chance to provide input is bad. > > The ideas: > 1. Tune parallelism levels per job (David and Ekaterina have insight on > this) > 2. Rename jobs on circle to be more indicative of their function > 3. Unify j8 and j11 workflow pairs into single (for 2 and 3 see: > https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17616595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17616595 > ) > 4. Update documentation w/guidance on using circle, .circleci/generate.sh > examples, etc > 4a. How to commit: > https://cassandra.apache.org/_/development/how_to_commit.html > 4b. Testing: https://cassandra.apache.org/_/development/testing.html > 5. Flag on generate.sh to allow auto-run on push > 6. Clean up the -l, -m, -h flags (test and indicate -l feasibility for all > suites, default to -m, deprecate -h?) <- may not be a code-change issue and > instead be a documentation issue > 7. Consider flag on generate.sh to run and commit with "[DO NOT MERGE] > temporary circleci config" as the commit message > > Curious to see what folks think. > > ~Josh > > >