alejandro-alvarez-sonarsource wrote:

> @alejandro-alvarez-sonarsource @steakhal What do you think about the code 
> duplication situation? What are your reasons for proposing this 
> implementation?

I tend to be conservative about existing behavior because

1) It is working!
2) We have been relying on it for long, and even a small divergence could turn 
into hundreds, if not thousands, of new positives and negatives coming and 
going on existing projects (changes that unfortunately are not caught by lit 
tests)
3) Not to mention a slight change in assumptions can cascade to other CSA 
checkers (both up and downstream) adding and/or removing another big set of 
raised issues.

Of course this is not news for you!, but this is why I have preferred to keep 
the surface of the changes small (at the cost of duplication).

Trade-offs, though, and I agree that I have probably lean too far into 
duplication this time.

> If I understand correctly the "new" type-based bounds checking differs from 
> the "old" region/extent-based logic in the following areas:
> 
> * it performs calculations with an index instead of a byte offset,
> * it uses a different logic to calculate the array size (which will be 
> compared with the index),
> * in multidimensional arrays (and probably other similar situations?) it has 
> a different (more strict) understanding about the beginning of the array,

Correct. It relies purely on the type, even if we know little or nothing about 
the memory region.

> * it reports slightly different messages (I admit that I didn't analyze the 
> differences in detail).

TBH this started a side effect of using indexes (so now "index" is used instead 
of "offset" in some issues), because I wanted to stay close to existing 
reports. Some details like using "subarray" to be more specific came later 
since now we could do it.

> To achieve these differences the current version of the PR duplicates the 
> following components of the code:
> 
> * the "main method" `performCheck` / `performCheckArrayTypeIndex`,
> * the helper class `StateUpdateReporter` / `StateIndexUpdateReporter`,
> * the various helper functions that produce the `Messages`.
> 
> Among these my rough impression is that the state update handler and message 
> generation helpers are relatively straightforward and they could be unified 
> by introducing a few well-placed additional parameters, boolean flags etc. 
> (or even a BaseStateUpdateReporter class with two subclasses that implement 
> slightly different behavior).

As a first approximation, this is where most of the duplication could be 
removed, yes.

> In the "main" methods I see that the "old" and "new" approaches begin with 
> completely different logic for calculating the index/offset value, the base 
> region (i.e. the "full array" to which the accessed location is compared) and 
> the array size. However, after that I find it very unfortunate that lots of 
> complex logic (the deeply nested `if`s that check underflow and then 
> overflow, plus past-the-end expressions, taint etc.) is duplicated between 
> the two variants. I hope that it would be possible to factor out a "check 
> whether this index/offset is between zero and this extent value" method which 
> could be shared between the "new" and "old" checks.

Makes sense.

> I see that there are some additional differences in this latter part of the 
> "main" `performCheck` methods, but I feel that these are not _neccessary_ 
> differences:
> 
> * I would guess that the new heuristics for better fake-FAM handling would be 
> valuable in the old code as well (although `getDynamicExtent` has some 
> flexible array member heuristics);
> * I don't immediately see why can the new variant omit my 
> "isObviouslyNonnegative" hack (and if you do have a better solution, then I 
> would be very happy to see it applied in the "old" check as well).

I didn't quite omit it (because it's great at suppressing FP). 
`getAsCleanArraySubscriptExpr` needs memory regions to be laid out in a 
particular way ("not modified by pointer arithmetic)", and since the new logic 
doesn't look at the memory region, but at the type as-is in the AST, I inlined 
the `E->getIdx()->getType()->isUnsignedIntegerOrEnumerationType()` check.

> Moreover I feel that we could and perhaps should reduce the amount of 
> duplicated code by **removing some parts of the old logic** and keeping only 
> the new approach:

I have to admit this didn't cross my mind.

> * Perhaps we don't need the complex old logic that handles multidimensional 
> arrays by calculating the offset from the beginning of the whole array (by 
> combining multiple `ElementRegion`s) -- IIUC that logic was introduced when 
> this checker was still using `check::Location` and `check::Bind` callbacks 
> that provided less information than the current callbacks.
> * Perhaps we could switch to using indices instead of byte offsets by default 
> -- because this would produce simpler symbolic expressions that would be 
> "more useful" for the rest of the analyzer (e.g. instead of assuming `4 * x < 
> 40` we could assume `x < 10` which can also occur as e.g. an `if` condition).

Can we? How about cases such as this lit test?

```cpp
struct two_bytes convertedArray2(void) {
  // We report this with byte offsets because the offset is not divisible by 
the element size.
  struct two_bytes a = {0, 0};
  char *p = (char*)&a;
  return *((struct two_bytes*)(p + 7));
  // expected-warning@-1 {{Out of bound access to memory after the end of 'a'}}
  // expected-note@-2 {{Access of 'a' at byte offset 7, while it holds only 2 
bytes}}
}
```

> 
> In this review I'll try to be as constructive as possible and I'm open to 
> investing work into refactoring this checker to get an end result that 
> provides the features that you need with as little code duplication as 
> possible. (Edit: I can also pragmatically accept some code duplication if 
> that yields the overall best code.)
> 
> Note that I'm also planning significant changes in this file: I'm planning to 
> reimplement `alpha.security.ReturnPtrRange` as a separate `CheckerFrontend` 
> of `ArrayBound` (which would become a checker family) because 
> `ReturnPtrRange` and `ArrayBound` check the exactly same thing (is this 
> location in bounds?) in two different situations (returned pointer value vs 
> accessed location). (In fact, I started to work on this change two days ago, 
> but I'll probably need to restart that work because this PR affects lots of 
> code within `ArrayBoundChecker.cpp`.) On a longer time horizon, I'm also 
> planning to use the `ArrayBound` logic as a "backend" for the 
> `CStringOutOfBound` checker, so perhaps we might also think about separating 
> some core logic into a library-like bounds checking source file.

It's quite a coincidence (or perhaps not!). I am also working in parallel in 
`PointerArithChecker` which is turning out to be very similar too (except it 
raises the moment the pointer is formed, and that one-after-the-end is still 
ok, as long as it is not derefenced). So indeed a "core bound checker" library 
sounds very interesting.

> In fact, I started to work on this change two days ago, but I'll probably 
> need to restart that work because this PR affects lots of code within 
> `ArrayBoundChecker.cpp`.

Sorry 😅  I am thinking, maybe you go ahead with your work on those changes, 
since you have already progressed. I don't think I have the bandwidth to move 
fast enough adapting this PR as not to block you for a few weeks. I am happy to 
revisit these changes and rebase (or even redo) them on top of yours.

And, by the way, I would also be happy to review those changes once you submit 
them.

https://github.com/llvm/llvm-project/pull/159357
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to