A couple months ago, I trialled a new proof-of-concept during a bugfix sprint 
to help reduce the investigation time required following failure of an 
integration in CI. This was met with a very good reception internally at TQtC, 
and so work continued to improve reliability-- The new CI Failure Analysis Bot 
is now in production deployment and will comment on most changes that fail CI, 
with some intentional exceptions like submodule updates. The bot runs in all 
repos.

So what is it?
The bot operates in a few stages:

  1.
For each change in the failed integration, the log is gathered and the last 
1000 or so lines are fed into an LLM (today GPT 4o) with a prompt to extract 
the most relevant failure in the log and snip it out, along with some other 
relevant data points like error type classification, relevant filenames, etc...
  2.
The bot then collects the source for the relevant files, and test source if it 
was involved.
  3.
The diff of the change itself is also then collected.
  4.
When all needed data are collected, the log snip, sources, and change diff is 
again fed into an LLM (today also GPT 4o) and asked to produce a brief summary 
of the failure, along with an attempted determination of if the failure was 
directly caused by the change being analysed.
  5.
The resulting summary is posted to the change which failed CI.

Why?
Historically, the shortcut to determining if your change caused the CI failure 
has been to simply restage it until its obvious that your change isn't passing 
for some reason. Running pre-checks has reduced this practice significantly, 
but changes may still cause failures on platforms that can't be tested in a 
standard pre-check.

This bot aims to give a quick direction to users. Namely, if it is fairly clear 
that the failure was not caused by the change, it will simply state so and 
suggest a restage. If there is a clear reason how the change was the cause, the 
bot will attempt to point that out, but intentionally avoids giving specific 
solutions.

Is it accurate?
Fairly, but analysis is done in a zero-shot manner; This means that only one 
result is generated and isn't cross-checked by running the analysis multiple 
times. If TQtC invests in on-premises LLM hardware, or the price of online 
models continues to fall, accuracy can be improved by a multi-shot consensus. 
No matter what, it's still not a human, and isn't actually intelligent. It may 
make mistakes or create a misleading analysis, so always weigh that when 
troubleshooting a change.

What about data privacy and copyright?
Ther current online LLM analysis is performed using Microsoft Azure OpenAI 
services. No data is stored, and no data is used for LLM training purposes. The 
service used is fully GDPR compliant and has been cleared for use in TQtC and 
with open-source Qt Project code.

What's next?
If you see an analysis that isn't correct, contact me. Either the bot didn't 
find the correct error in the log, failed to retrieve sources correctly, or may 
otherwise simply benefit from changes to the prompts.


Best regards,
-Daniel


-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Reply via email to