[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Valeriy Savchenko via Phabricator via cfe-commits Tue, 08 Jun 2021 06:58:22 -0700

vsavchenko added a comment.

In D103605#2804421 <https://reviews.llvm.org/D103605#2804421>, @NoQ wrote:


> These are not other topics. We're discussing the overall direction into which 
> this patchset is a large step. I absolutely welcome this step and am 
> fascinated to see how it turns out but I want us to agree on the overall 
> direction first.

OK, fair point, but for further plan, we need to have an incremental model of 
how we want to achieve it.

> The reason why I come from this perspective is because that's how I think it 
> is //supposed// to work and I'm not aware of a single good reason why it 
> can't. I think the reason why it's this way is that the original authors seem 
> to have bombarded the code with extra clauses until it starts demonstrating 
> the more or less expected behavior, without trying to solve the problem on 
> paper first. Like with visitor fixpoint iteration, it sounds like an 
> unnecessary piece of complexity that we will be ultimately able to eliminate.

Right now it is all very tightly coupled.  It is a lot of moving pieces that I 
don't think anyone fully understands.  If we take time and decompose existing 
tracking into handlers, it can be a step in getting to know some implicit 
connections between different tracking activities we have right now.

> Except, well, the distant constraints example a-la @RedDocMD's problem but 
> for raw pointers. This is indeed an exceptional case which may cause two 
> tracking efforts to run in parallel. This can be probably generalized to any 
> constraint-like traits: if we want to explain a constraint, we'll have to be 
> prepared that the constraint isn't part of the value track that we already 
> have. This is the first time the above assumption was challenged and I 
> definitely need time to process this but I still think there shouldn't be any 
> other situations like this and there should be something special about this 
> situation that will allow us to incorporate it into the linear-track solution.
>
> "At the moment such refactoring is difficult to achieve" is a perfectly 
> satisfying answer to me with respect to the current state of things. But I 
> want us to agree as early as possible about whether we want to achieve this. 
> Simply so that as we make every step, we keep this in mind and evaluate every 
> step as taking us either closer to this goal or farther from this goal and 
> try to avoid the latter. And I don't think that anything prevents us from 
> discussing this goal right now; the problem we're trying to solve isn't 
> really ever going to change and we already have a fairly large amount of data.

I get it, but I think we got stuck and should put goal first and think about 
what can be done to achieve this.  Is it bad that tracking exists and the 
analyzer reverse engineers what it did itself? - No doubt!  Is note mechanism a 
good alternative? - Sure!  Is it enough right now? - No.  Let's think how we 
can get there with incremental changes.

> Ok, do I understand correctly that:
>
> - Functionality-wise, this is entirely equivalent to my NoteTag solution with 
> "`isTracked()`";

Not really, you also get the context of tracking: two nodes and options (so you 
can figure out if it's being tracked thoroughly or as a condition).

> - Boilerplate-wise it makes the `isTracked()` check implicit ("we were 
> invoked at all therefore the value of the expression is tracked by... 
> somebody") at the cost of asking the user to define a class;

I suppose we can put it this way.

> - Convenience-wise, it potentially requires the developer to reverse-engineer 
> the symbolic execution step when the logic gets more complicated;

I don't see ANY difference with the situation where we use `isTracked`.  If you 
talk about notes, you can access them here as well.  If you don't, then both 
ways require the same.

> - Flexibility-wise... I guess you can make custom trackers now, which will 
> track the value in a different manner, and communicate with the tracker in 
> order to produce different notes?

Exactly, tracker object is the thing that is guaranteed to stay whenever you go 
with tracking, and since you can customize it, you can affect every part of the 
tracking process.

> Yeah, the store handler part seems perfectly legit because *I* don't see any 
> better approaches here :) It's indeed a big problem that we can't customize 
> existing notes and this sounds like the right data structure to solve the 
> problem.

And we do need to have a central customizable object (aka Tracker) that we pass 
around the tracking process to make it happen, right?

> Ok, I don't understand! If we have anyway tracked the value back to 
> `b.get()`, why not track further through `b`? What makes `b.get()` the exact 
> place at which we want to stop tracking the value? If we wanted to know why 
> we choose the true path at the branch, shouldn't we track further through `b` 
> as well in order to answer that question? If we didn't want to know why we 
> choose the true path, why did we track the value back to `b.get()` in the 
> first place?
>
> I guess one possible difference between `a` and `b` here would be in 
> messaging, i.e. how we refer to that value.

Bingo!

> For example, we could refer to `a` as "the pointer value" (implying that this 
> is the null-dereferenced pointer value) and refer to `b` as "a pointer value" 
> (implying that it's some other value, maybe the one that participates in 
> condition later). But even then, this distinction is much more coarse-grained 
> than "I track the value" vs. "Someone else tracks the value". It's more like 
> "Buggy value" vs. "Condition value" which is a distinction we already 
> "almost" make with `TrackingKind`. Even if we wanted to identify our values 
> as "pointer #1", "pointer #2", etc., we could totally do with a sufficiently 
> advanced definition of `isTracked()`.

Correct!

>> Can we move inline defensive checks out of the BugReportVisitors.cpp with 
>> `isTracked()` only?
>
> My original thinking was along the lines of, eg.,:
>
>   void ExprEngine::processBranch(...) {
>     Bldr.addTransition(..., getNoteTag([] (BugReport &BR) {
>       if (BR.isTracked(Condition) && !N->getLocationContext().inTopFrame())
>         BR.markInvalid();
>     }));
>   }
>
> And this is how, basically, the entirety of `trackExpressionValue()` could be 
> decomposed into individual stateless note tags, with all the necessary state 
> stuffed into the bug report. Including the tracking process itself: if note 
> tags are sufficient to help with tracking, they're sufficient to do the 
> tracking from the start.

Maybe, or maybe not.  Maybe it takes more context than we think.  You 
essentially assume that the only thing that we need to reimplement is one 
boolean variable `isTracked` and the expression that is being tracked.  I doubt 
it.
And it is harder to debug polling approach as opposed to the callback approach.

> This code could also be moved into a checker's `checkBranchCondition()` so 
> that to make it truly checker-specific. That would come at the cost of 
> creating a lot of redundant exploded nodes just for the sake of attaching the 
> tag that will probably never be used so this is rather bad, wouldn't 
> recommend.

Why not model defensive checks in the checker, so we don't make undesired 
assumptions in the first place?

> I still think the solution that relies on well-defined end of tracking is 
> significantly more elegant. Which, again, could be implemented with note 
> tags: we could have a note tag that'd call a callback stuffed directly into 
> the bug report.

I'm not sure I understood the last thing, can you please elaborate?

> So I think it can work both ways given that the tags solution would rely on 
> having a sufficiently flexible mutable state as part of the `BugReport` 
> object.

OK, let me re-iterate to get some things straight.
Refactoring of `trackExpressionValue` is good, but we need to think further 
(and even further).  You have a dream of getting rid of visitors and tracking 
altogether so that users don't need to reverse engineer their own checkers 
(preferably using tags).  As a matter of transformation, you see that tracking 
should provide two new concepts: `isTracking` predicate (similar to the 
existing interesting-ness mechanism) and the way to detect where the tracking 
stopped.  You also don't want tracking to be a gigantic tree and have 
everything, but to have one source of value without anything else.  All other 
parts should be decoupled and maybe even given a different name.  We still want 
to remove all different parts of other checkers from `BugReporterVisitors.cpp` 
and probably can do it using just one abstraction of `isTracking` to do that.

And here is my take on things (it is a bit chaotic, sorry about that):

- If we want to move to `isTracking`, we still need a central component that 
participates in the tracking or something that is shared through all tracking 
logic.  In my patches it is a `Tracker` object.  It can be done differently by 
passing such object directly into every `trackExpressionValue` and its 
sub-visitors, but probably something more extensible would be a preference here.
- If we want to split `trackExpressionValue` and extract checker-specific 
parts, we need a good and reliable way of splitting it.  I used a time-tested 
approach of function extraction: if we have `foo(A a, B b, C c) { ... }` with N 
independent parts, we can refactor them into `foo(A a, B b, C c) { foo1(a, b, 
c); foo2(a, b, c); ... fooN(a, b, c); }`. If one of the `fooK` is big, we can 
repeat the same procedure. Instead of doing it plainly, I decided that the user 
might be interested in putting their own `fooK` into the mix.
- Can we implement `isTracked` first and then split `trackExpressionValue`?  My 
answer is no.  It implies bigger refactoring because `isTracking` doesn't 
provide all the bits of `A a, B b, C c` that the logic used to receive before.  
One example here is the `SuppressInlineDefensiveChecksVisitor`.  When we add 
it, we give it a node where the tracking of the expression was initiated, not 
the node with expression itself.  We also have a very specific comment there 
that this is a requirement and expression own node might be "too late".  If you 
use `isTracked` in the Visitor you will be automatically in the "too late" 
situation.  And it is only one example.  So, I insist that `isTracked` couldn't 
be the initial step for refactoring.
- Can we implement "tracking has stopped mechanism" first and then split 
`trackExpressionValue`?  My answer is no.  Tracking right now is a tree and it 
is very convoluted.  There is no one single point where it stops, it stops 100 
of times.  The concept of "stopped" in this situation is not useful (I'm still 
not convinced that it's useful at all, but this is a part of a different 
conversation).  In order to fix that we need to split `trackExpressionValue`.  
Contradiction.
- Can we implement `isTracked` afterwards?  Yes, we can.  Implementation of 
`isTracked` on it's own is fairly simple.  And we can have users of `isTracked` 
and `ExpressionHandler` at the same time.  Maybe, once we migrate all uses, 
`ExpressionHandler` can be safely retired.
- Can we implement "tracking has stopped mechanism".  Probably.  This is 
something that I doubt in usefulness, and don't really want to plan.
- Is there a way to get rid of tracking?  Honestly, I don't think so.  I do 
believe, though, that it is possible to stop reverse engineering ourselves.  
While the analyzer explores different paths, both the core and checkers somehow 
affect the state.  We can cal such a change of the state as "event".  Events 
can be both relevant and irrelevant (e.g. happened to a different smart 
pointer) to the issue that is found.  Many of these events are pivotal to the 
user's understanding of the issue and we should explain them with notes.  The 
way to understand that it is relevant is through tracking.  We might want to 
keep these traces as we go, but it might be some extra price that we don't want 
to pay.
- Let's get back to the events!  Can we build our entire solution around them?  
I do think so, actually.  I think that the analyzer can produce a bunch of 
events (maybe implemented as note tags).  Then each bug report can provide a 
visitor, not exploded graph visitor as it is now, but event visitor.  It will 
be automatically called for relevant events only (with tracking hidden under 
the hood).  Many events will have reasonable defaults so you don't need to 
write your own implementations every time (e.g. "a is assigned with b here"), 
but you can always override them.  This is something that I started with 
`StoreHandler` because "store" is one of those events.  It is different though 
from the perfect vision because we don't want to reverse engineer what the 
engine/the checker did before, perfectly we would have this information in the 
node (or on the edge) that "store event happened".   Tracker will be very 
trivial in this plan.
- Is it possible to be done incrementally?  I do think so.  We can extract 
different events from `trackExpressionValue` and change them into various 
`EventHandler`s (like `StoreHandler`).  Probably, in the process, we will 
understand what kind of data describes each type of events.  Another step is 
pretty hard, we need to design how we are going to generate (easier) and store 
(harder) events during analysis.  It should be pretty implicit, so it has a lot 
of sensible defaults produced by the core.  And checker developers should be 
able to generate their own events (maybe even for other checkers, if they model 
something that everyone can benefit from).  I think it will be possible to 
generate one event type at a time, and change how we handle it in the tracker, 
i.e. move from reverse engineering to detecting relevant events in the graph.  
Once we are finished with this, we can extract a separate thing called 
`EventVisitor` and migrate a big chunk of `Tracker` and `EventHandler` 
interfaces there.
- Why do we even need to introduce tracker and event handlers if we are going 
to hide/delete them in this plan?  The trick is about making this whole plan 
manageable.  One enormous refactoring is hard to plan and test.  Big work can 
demotivate and drain if you don't see the end of it.  I believe that we can 
make great things if we take a lot of baby steps.

Phew, I hope this answers it.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Reply via email to