steakhal added a comment. In D86874#inline-803844 <https://reviews.llvm.org/D86874#inline-803844>, @martong wrote:
> I really feel that this check would have a better place in the implementation > of `eval`. This seems really counter-intuitive to do this stuff at the > Checker's level. Is there any reason why we can't do this in `eval`? > `evalBinOpNN` could return with Unknown, and the state should remain > unchanged. Am I missing something? Ah, sort of. To make everything clear, I have to provide examples, building-blocks, reasoning and stuff, so please don't be mad if it's too long. **I hope you have a warm cup of tee to read through this - seriously - you will need that!** Diving in --------- It is really important to make a clear distinction between evaluating an expression according to the semantics of the //meta-language// or of the //object-language//, because we might get different answers for the same expression. In fact, `evalBinOpNN` evaluates expressions according to the semantics of the //object-language//. In our context, meta-language is //mathematics//, and the //object-language// is the semantics of the abstract machine of the c/c++ language. In mathematics, there is no such thing as overflow or value ranges, unlike in C++. Let's see an example: Assuming that `x` is in range `[1,UINT_MAX]`. `x + 1` will be in range `[2,ULL_MAX+1]` in the //meta-language//. `x + 1` will be in range `[0,0]u[2,UINT_MAX]` or in `[2,UINT_MAX+1]` weather the type of `x` is capable representing the (+1) value and the overflow is well-defined or not. Another valuable comment is that non-contiguous ranges (range sets) are not really useful. Knowing that `x` is in range `[0,0]u[2,UINT_MAX]` doesn't help much if you want to prove that: `x < 5` holds for all possible interpretations of `x`. We can not disproof that either. So, overflow/underflow can really screw the value ranges, preventing us from evaluating expressions over relational operators. Which is technically accurate, but not satisfiable in all cases - like in the checker `ArrayBoundCheckerV2`. --- Before describing why is it so problematic in this checker, I should give an overview, how this checker works. Overview of the logic of the ArrayBoundCheckerV2 ------------------------------------------------ The checker checks `Location` accesses (aka. pointer dereferences). We construct the `RegionRawOffsetV2` object of the access (Which consists of the //base region//, and the symbolic //byte offset// expression of the access). Then we check, whether we access an element //preceding// the **first valid index** of the //base// memory region. In other words, we check if the //byte offset// symbolic expression is **less then** 0: - If YES: Report that we access an out-of-bounds element. - If NO: Infer the range constraints on the symbol and add the constraint to make this array access valid. - If MAYBE: Infer and constrain, just as you would do in the previous case. Then we check, whether we access an element //exceeding// the **last valid index** of the memory region. In other words, we check if the //byte offset// symbolic expression is greater then or equal to the //extent// of the region: - If YES: Report the bug... - If NO: Infer & constrain... - If MAYBE: Infer & constrain... However, in the checker, we don't check `byte offset < 0` directly. We //simplify// the left part of the `byte offset < 0` inequality by substracting/dividing both sides with the constant `C`, if the `byte offset` is a `SymintExpr` of `SymExpr OP C` over the plus or multiplication operator (OP). We do this //simplification// recursively. In the end, we will have a //symbolic root// (call it `RootSym`) expression and a `C'` constant over the right-hand side of the original relational operator. So, after the //simplification// process we get the `RootSym < C'` inequality, which we check. This //simplification// is the //infer// operation in the pseudo-code. And the //constrain// step is where we assume that the negation of `RootSym < C'` holds. **SPOILER**: This //simplification// process is only **valid** using the semantics of the //meta-language//. ElementRegions -------------- Pointer values, which describe a particular element of a memory region, can get quite complex. Even more complex, when we reinterpret cast a pointer to a different type. In such cases, we might wrap the `SubRegion` symbol within an `ElementRegion` with the given target type and offset `0`. Eg: void foo(int x) { int buf[10]; char *p = (char*)(buf + 2) + 3; ^-- 2*sizeof(int) in bytes which is 8. *p = 42; // Not UB. int *q = (int*)p; // unaligned pointer! *q = 5; // UB. char *r = (char*)(buf + x) + 3; *r = 66; // Might be safe, if x has a value to make it so. } RegionRawOffsetV2 ----------------- In the previous example `p` would have the `&Element{buf,11 S64b,char}` `SymbolRegionVal` (`SVal` instance) associated with. And `q`: `&Element{Element{buf,11 S64b,char},0 S64b,int}`. However, the `RegionRawOffsetV2` of both `p` and `q` would be the same: {BaseRegion: `buf`, ByteOffset: `11`} Therefore,`RegionRawOffsetV2` gives us a unified //view// of the memory we are dealing with. It is also useful dealing with pointer aliasing, just in the example with `p` and `q`. In other cases, where we index with a **symbol**, the `ByteOffset` might contain a complex //symbolic expression//, instead of a `ConcreteInt`. Eg: the `RegionRawOffsetV2` of `buf[x+1]` will be {BaseRegion: `buf`, ByteOffset: `x+1`}. It's important to note that now the //symbolic expression// represented by the `ByteOffset`, is an expression in the //object-language//, where we have to deal with overflows, underflows, implicit conversions, promotions another nasty stuff according to the //abstract machine//. //Note: I don't know what `RegionRawOffset` is, or what that does.// ElementRegions, again --------------------- ElementRegions are might be nested as in this example: void foo(int x) { int buf[10][3]; int *p = &buf[1][2]; clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2 S64b,int} clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b} int *q = &buf[x+1][6]; clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) + 1,int [3]},6 S64b,int} clang_analyzer_DumpRegionRawOffsetV2(q); // raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24} } In the example, the subscript expressions are expressions in the //object-language//. So, when we //touch// (evaluate, simplify, reorder, etc.) the `(reg_$0<int x>) + 1` we should //thouch// them just as the //abstract machine// would. Calculation of the RegionRawOffsetV2 ------------------------------------ Let's see how does these subscript expressions used during the calculation of the `RegionRawOffsetV2`: For multi-dimension arrays we //fold// the index expressions of the nested `ElementRegion`s. We are doing that by simply calculating the byte-offset of the nested region, and adding the current level's //index*sizeof(Type)//. So, the `ByteOffset` that we build up will contain mostly multiplications by a constant OR additions of symbolic expressions (like the `x+1` in the example). During this //folding// operation we use the semantics of the //object-language// in the code <https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342>. The resulting `RegionRawOffsetV2` for `p`, `q` will be: p: {BaseRegion: `buf`, ByteOffset: `20 S64b`} q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`} ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ | | `@a` This is an //object-language// expression. --/ | / `@b` The rest should be //meta-language// expression. -------/ SPOILER: This distinction is really important here. So, we made an expression, we should not evaluate in either of the worlds. We should not evaluate it using the semantics of the //object-language// since only the `@a` is of that world. We should model overflow in `@a` if that expression is of //unsigned// type (OR signed but `-fwrapv`...) etc. according to the //abstract machine//. BUT modeling the possibility of an overflow of the rest of the expression (`@b`) would be a flaw. This reasoning is the same for the other way around. --- In general, such //mixed// expressions make no sense at all. Trying to evaluate such would lead to esoteric bugs just like the mentioned bug in the summary. Simplify step, again -------------------- Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` inequality... ^^^^^^^^^^ `@b` Result: `reg_$0<int x> < -3 S64b` ^^^^^^^^^^^^^ ^^^^^^^ `RootSym` --/ | / `C'` ----------------------/ `#1`: This step supposed to fold **ONLY** the //meta-language// expression parts (`@b`) of the previously aquired `ByteOffset`. `#2`: This step requires the expression under simplified to be of //meta-language// to be able to reorder the constants. (aka. to fold into the right hand side's constant). `#3`: This also requires the right-hand side to be of the //meta-language//. Cause of the bug ---------------- We don't respect the `#1` and the `#3` requirements of the //simplification// step. We treat the complete expression under //simplification// as an expression of the //meta-language//. I'm not changing this, but I probably should though. Ideally, after //simplification// we should have this inequality: `x+1 < -2` That way we would not fold any subexpression of the //object-language//, so the `#1` requirement would be preserved. The right-hand side is of an expression of the //meta-language//. It makes sense, to evaluate the `operator<` as the semantics of the //meta-language//. But the left-hand side is an expression of the //object-language//. We need some sort of //logical// conversion here. If `x+1` is //unsigned// then there is no way that expression could result a //negative// value, so it will be always less then `-2`. Therefore, we proved that no out-of-bounds buffer access can happen //before// the first valid index (0). Otherwise, we evaluate the inequality according to the //object-language//. We call the `evalBinOpNN` to accomplish this query. That function, however, should respect the `-fwrapv`, overflow/underflow, and other nasty things of C++ but I don't care how that is implemented. The checker's job is done at this point. Note that, if the //inequality// is of the form `x - C < C'` after the //ideological-simplification// step, we have to deal with underflow similarly. In this patch, I haven't done it though. Since currently, we just fold `C` into `C'`... If the `LHS` is signed, ehm, I would have to think it through - but I'm too tired of writing this... The proposed fix ---------------- I know that this doesn't fix the root of the problem, but we have to start somewhere :D Check if the resulting //folded constant// (`C'`) is negative or not. If it is //negative// and the type of the //root expression// (`RootSym`) has **unsigned** type then I state that this inequality doesn't hold for any (unsigned) interpretation of `RootSym`. So just move on and check the upper-bound... --- I'm not sure if this fixes all the false-positives, but certainly removes some of them. I'm not an expert in //abstract interpretation// and in //formal verification//. I hope you enjoyed reading this, now you should take a rest too! Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D86874/new/ https://reviews.llvm.org/D86874 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits