Hi Balázs, Since reviews.llvm.org is offline, I am sending my comments below, inline. Thanks for your huge effort in explaining all this!
Overall, I have a feeling that this approach targets only one specific case, which is fine. But I believe we should think about all the other possible cases, so we could get rid of other false positives too: 1) In case of multidimensional arrays, there may be a symbolic value in any dimension. 2) What if there are more symbolic values in the dimensions. Cheers, Gábor On Thu, Sep 3, 2020 at 4:52 PM Balázs Benics via Phabricator < revi...@reviews.llvm.org> wrote: > steakhal added a comment. > > In D86874#inline-803844 <https://reviews.llvm.org/D86874#inline-803844>, > @martong wrote: > > > I really feel that this check would have a better place in the > implementation of `eval`. This seems really counter-intuitive to do this > stuff at the Checker's level. Is there any reason why we can't do this in > `eval`? > > `evalBinOpNN` could return with Unknown, and the state should remain > unchanged. Am I missing something? > > Ah, sort of. > To make everything clear, I have to provide examples, building-blocks, > reasoning and stuff, so please don't be mad if it's too long. > **I hope you have a warm cup of tee to read through this - seriously - you > will need that!** > Man, this requires a warm lunch and then hot cups(!) of coffee. :D > > Diving in > --------- > > It is really important to make a clear distinction between evaluating an > expression according to the semantics of the //meta-language// or of the > //object-language//, because we might get different answers for the same > expression. > > In fact, `evalBinOpNN` evaluates expressions according to the semantics of > the //object-language//. > > In our context, meta-language is //mathematics//, and the > //object-language// is the semantics of the abstract machine of the c/c++ > language. > In mathematics, there is no such thing as overflow or value ranges, unlike > in C++. > > Let's see an example: > Assuming that `x` is in range `[1,UINT_MAX]`. > `x + 1` will be in range `[2,ULL_MAX+1]` in the //meta-language//. > `x + 1` will be in range `[0,0]u[2,UINT_MAX]` or in `[2,UINT_MAX+1]` > weather the type of `x` is capable representing the (+1) value and the > overflow is well-defined or not. > > Another valuable comment is that non-contiguous ranges (range sets) are > not really useful. > Knowing that `x` is in range `[0,0]u[2,UINT_MAX]` doesn't help much if you > want to prove that: > `x < 5` holds for all possible interpretations of `x`. > We can not disproof that either. > > So, overflow/underflow can really screw the value ranges, preventing us > from evaluating expressions over relational operators. > Which is technically accurate, but not satisfiable in all cases - like in > the checker `ArrayBoundCheckerV2`. > > --- > > Before describing why is it so problematic in this checker, I should give > an overview, how this checker works. > > Overview of the logic of the ArrayBoundCheckerV2 > ------------------------------------------------ > > The checker checks `Location` accesses (aka. pointer dereferences). > We construct the `RegionRawOffsetV2` object of the access (Which consists > of the //base region//, and the symbolic //byte offset// expression of the > access). > > Then we check, whether we access an element //preceding// the **first > valid index** of the //base// memory region. > In other words, we check if the //byte offset// symbolic expression is > **less then** 0: > > - If YES: Report that we access an out-of-bounds element. > - If NO: Infer the range constraints on the symbol and add the > constraint to make this array access valid. > - If MAYBE: Infer and constrain, just as you would do in the previous case. > > Then we check, whether we access an element //exceeding// the **last valid > index** of the memory region. > In other words, we check if the //byte offset// symbolic expression is > greater then or equal to the //extent// of the region: > > - If YES: Report the bug... > - If NO: Infer & constrain... > - If MAYBE: Infer & constrain... > > However, in the checker, we don't check `byte offset < 0` directly. > We //simplify// the left part of the `byte offset < 0` inequality by > substracting/dividing both sides with the constant `C`, if the `byte > offset` is a `SymintExpr` of `SymExpr OP C` over the plus or multiplication > operator (OP). > We do this //simplification// recursively. > In the end, we will have a //symbolic root// (call it `RootSym`) > expression and a `C'` constant over the right-hand side of the original > relational operator. > So, after the //simplification// process we get the `RootSym < C'` > inequality, which we check. > Just for the record, this is in `getSimplifiedOffsets`. > > This //simplification// is the //infer// operation in the pseudo-code. > And the //constrain// step is where we assume that the negation of > `RootSym < C'` holds. > > **SPOILER**: This //simplification// process is only **valid** using the > semantics of the //meta-language//. > > ElementRegions > -------------- > > Pointer values, which describe a particular element of a memory region, > can get quite complex. > Even more complex, when we reinterpret cast a pointer to a different type. > In such cases, we might wrap the `SubRegion` symbol within an > `ElementRegion` with the given target type and offset `0`. > Eg: > > void foo(int x) { > int buf[10]; > char *p = (char*)(buf + 2) + 3; > ^-- 2*sizeof(int) in bytes which is 8. > *p = 42; // Not UB. > int *q = (int*)p; // unaligned pointer! > *q = 5; // UB. > > char *r = (char*)(buf + x) + 3; > *r = 66; // Might be safe, if x has a value to make it so. > } > > > > RegionRawOffsetV2 > ----------------- > > In the previous example `p` would have the `&Element{buf,11 S64b,char}` > `SymbolRegionVal` (`SVal` instance) associated with. > And `q`: `&Element{Element{buf,11 S64b,char},0 S64b,int}`. > > However, the `RegionRawOffsetV2` of both `p` and `q` would be the same: > {BaseRegion: `buf`, ByteOffset: `11`} > Therefore,`RegionRawOffsetV2` gives us a unified //view// of the memory we > are dealing with. > It is also useful dealing with pointer aliasing, just in the example with > `p` and `q`. > > In other cases, where we index with a **symbol**, the `ByteOffset` might > contain a complex //symbolic expression//, instead of a `ConcreteInt`. > Eg: the `RegionRawOffsetV2` of `buf[x+1]` will be {BaseRegion: `buf`, > ByteOffset: `x+1`}. > It's important to note that now the //symbolic expression// represented by > the `ByteOffset`, is an expression in the //object-language//, where we > have to deal with overflows, underflows, implicit conversions, promotions > another nasty stuff according to the //abstract machine//. > > //Note: I don't know what `RegionRawOffset` is, or what that does.// > Seems like that is just a pair of a `MemRegion` plus a concrete int `Offset`. And this is the return type for `ElementRegion::getAsArrayOffset()` where we handle only `nonloc::ConcreteInt`s. So, this is similar to RegionRawOffsetV2, but without the possibility of the symbolic index value. > > ElementRegions, again > --------------------- > > ElementRegions are might be nested as in this example: > > void foo(int x) { > int buf[10][3]; > int *p = &buf[1][2]; > clang_analyzer_dump(p); // &Element{Element{buf,1 S64b,int [3]},2 > S64b,int} > clang_analyzer_DumpRegionRawOffsetV2(p); // raw_offset_v2{buf,20 S64b} > > int *q = &buf[x+1][6]; > clang_analyzer_dump(q); // &Element{Element{buf,(reg_$0<int x>) + > 1,int [3]},6 S64b,int} > clang_analyzer_DumpRegionRawOffsetV2(q); // > raw_offset_v2{buf,(((reg_$0<int x>) + 1) * 12) + 24} > } > > In the example, the subscript expressions are expressions in the > //object-language//. > So, when we //touch// (evaluate, simplify, reorder, etc.) the `(reg_$0<int > x>) + 1` we should //thouch// them just as the //abstract machine// would. > > Calculation of the RegionRawOffsetV2 > ------------------------------------ > > Let's see how does these subscript expressions used during the calculation > of the `RegionRawOffsetV2`: > For multi-dimension arrays we //fold// the index expressions of the nested > `ElementRegion`s. > We are doing that by simply calculating the byte-offset of the nested > region, and adding the current level's //index*sizeof(Type)//. > So, the `ByteOffset` that we build up will contain mostly multiplications > by a constant OR additions of symbolic expressions (like the `x+1` in the > example). > We have these lines in the case of handling the ElementRegion: if (!index.getAs<NonLoc>()) return RegionRawOffsetV2(); So, if the index is symbolic we give up, don't we? So, how could this work with `x+1`? What am I missing here? > During this //folding// operation we use the semantics of the > //object-language// in the code < > https://github.com/llvm/llvm-project/blob/bda3dd0d986b33c3a327c0ee0eb8ba43aa140699/clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp#L335-L342 > >. > > The resulting `RegionRawOffsetV2` for `p`, `q` will be: > p: {BaseRegion: `buf`, ByteOffset: `20 S64b`} > q: {BaseRegion: `buf`, ByteOffset: `(((reg_$0<int x>) + 1) * 12) + 24`} > ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ > | | > `@a` This is an //object-language// expression. --/ | > / > `@b` The rest should be //meta-language// expression. -------/ > > SPOILER: This distinction is really important here. > > So, we made an expression, we should not evaluate in either of the worlds. > We should not evaluate it using the semantics of the //object-language// > since only the `@a` is of that world. > We should model overflow in `@a` if that expression is of //unsigned// > type (OR signed but `-fwrapv`...) etc. according to the //abstract > machine//. > BUT modeling the possibility of an overflow of the rest of the expression > (`@b`) would be a flaw. > Why? I'd expect that the value of the whole expression `@a@b` could overflow. > > This reasoning is the same for the other way around. > > --- > > In general, such //mixed// expressions make no sense at all. > Trying to evaluate such would lead to esoteric bugs just like the > mentioned bug in the summary. > > Simplify step, again > -------------------- > > Simplification of the `(((reg_$0<int x>) + 1) * 12) + 24` < `0` > inequality... > ^^^^^^^^^^ > `@b` > Result: `reg_$0<int x> < -3 S64b` > ^^^^^^^^^^^^^ ^^^^^^^ > `RootSym` --/ | > / > `C'` ----------------------/ > > `#1`: This step supposed to fold **ONLY** the //meta-language// expression > parts (`@b`) of the previously aquired `ByteOffset`. > `#2`: This step requires the expression under simplified to be of > //meta-language// to be able to reorder the constants. (aka. to fold into > the right hand side's constant). > `#3`: This also requires the right-hand side to be of the > //meta-language//. > Do I understand this correctly, that none of the binary operations can have a symbolic RHS, because that would mean we have a VLA? > Cause of the bug > ---------------- > > We don't respect the `#1` and the `#3` requirements of the > //simplification// step. > > We treat the complete expression under //simplification// as an expression > of the //meta-language//. > I'm not changing this, but I probably should though. > Again, I don't understand why you are sure that the value of //whole// expression cannot overflow. > > Ideally, after //simplification// we should have this inequality: `x+1 < > -2` > That way we would not fold any subexpression of the //object-language//, > so the `#1` requirement would be preserved. > The right-hand side is of an expression of the //meta-language//. > It makes sense, to evaluate the `operator<` as the semantics of the > //meta-language//. > But the left-hand side is an expression of the //object-language//. > We need some sort of //logical// conversion here. > What if the second index is also symbolic? E.g `buf[x+1][y+1]`? This would result in `(((reg_$0<int x>) + 1) * 12) + (reg_$1<int y>) + 1)` < `0` . And after simplification, the RHS cannot be interpreted as //meta//. > > If `x+1` is //unsigned// then there is no way that expression could result > a //negative// value, so it will be always less then `-2`. > Therefore, we proved that no out-of-bounds buffer access can happen > //before// the first valid index (0). > > Otherwise, we evaluate the inequality according to the //object-language//. > We call the `evalBinOpNN` to accomplish this query. > That function, however, should respect the `-fwrapv`, overflow/underflow, > and other nasty things of C++ but I don't care how that is implemented. > The checker's job is done at this point. > > Note that, if the //inequality// is of the form `x - C < C'` after the > //ideological-simplification// step, we have to deal with underflow > similarly. > In this patch, I haven't done it though. Since currently, we just fold `C` > into `C'`... > > If the `LHS` is signed, ehm, I would have to think it through - but I'm > too tired of writing this... > > The proposed fix > ---------------- > > I know that this doesn't fix the root of the problem, but we have to start > somewhere :D > > Check if the resulting //folded constant// (`C'`) is negative or not. What happens if the symbolic index is not the most inner index? E.g. `buf[6][x+1]`. Then `C` is no longer constant, and no longer //meta// ... > If it is //negative// and the type of the //root expression// (`RootSym`) > has **unsigned** type then I state that this inequality doesn't hold for > any (unsigned) interpretation of `RootSym`. > So just move on and check the upper-bound... > > --- > > I'm not sure if this fixes all the false-positives, but certainly removes > some of them. > I'm not an expert in //abstract interpretation// and in //formal > verification//. > I hope you enjoyed reading this, now you should take a rest too! > > > Repository: > rG LLVM Github Monorepo > > CHANGES SINCE LAST ACTION > https://reviews.llvm.org/D86874/new/ > > https://reviews.llvm.org/D86874 > >
_______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits