donat.nagy added a comment. In D159105#4627785 <https://reviews.llvm.org/D159105#4627785>, @steakhal wrote:
> In D159105#4627724 <https://reviews.llvm.org/D159105#4627724>, @donat.nagy > wrote: > >> I was thinking about adding an improvement like this for the sake of >> consistency, but I fear that this might cause a surprising amount of false >> positives. (Did you test it on larger codebases?) > > I'm just done backporting the first batch to our fork and scheduling a > large-scale measurement. > But my gut instinct suggests there won't be too many new reports, as I > believe "tainted" pointers shouldn't be really a thing, except for really > exotic cases, like `scanf` a pointer - which is dubious at best in presence > of ASLR. > >> As far as I understand (correct me if I'm wrong), there are situations when >> TaintChecker marks the memory region of e.g. a returned string as tainted to >> mark that the //contents// of the region are unreliable. (There may be other >> more exotic situations when even the //pointer value// or the //extent// >> becomes unreliable e.g. your testcases where you `scanf` into a pointer.) > > Taint can only bind to symbols, thus a memregion per-say cannot be tainted, > only the "conjured address of it" or the "symbol bound to the > pointee's/referred lvalue" can be tainted. Thanks for the clarification! As I dig deeper I see that the functions that look like "let's taint a MemRegion" are in fact only affecting `SymbolicRegion`s (targeting the corresponding symbol). This taint may be seen by the logic that checks the taintedness of the ElementRegion -- but as it's limited to symbolic regions it probably won't appear in the underflow test [which is skipped for symbolic regions //in unknown space//]. (There are also some calls which taint the pointee's/referred value -- that won't affect us.) > WDYM by "are unreliable" and "becomes unreliable"? "may be controlled by a malicious actor", basically a synonym of "tainted" but I used "tainted" for things that are marked as tainted by the SA code and "unreliable" for things that would be marked as tainted by an ideal algorithm. >> I fear that the overzealous handling of tainted data would produce too many >> false positives in situations when code indexes strings that contain user >> input and the SA engine cannot understand the logic that calculates the >> indices. For example if I understand it correctly a function like >> >> char f(int n) { >> char *var = getenv("FOO"); >> return var[n]; >> } >> >> would be reported as an out of bounds memory access, because the region is >> tainted and the index value is not known. This is especially problematic on >> the underflow side (which is introduced by your commit), because I'd assume >> that it's common to have numeric values (e.g. function arguments) where the >> programmer knows that they're nonnegative, but the analyzer cannot deduce >> this. > > To me, this is a TP. The envvar might not be present, and even if it's > present, the relation of the string length of `var` to `n` is not expressed > or checked; thus this deserves a report. > But I see your concern, however, I cannot think of valid counterexamples, > aka. FPs off the top of my head. I'll come back once I see the real results. Hmm, you're right that my example is a TP as written; however I could imagine similar FPs where we e.g. check that `n < strlen(var)` but don't check that `n` is nonnegative (because that's the responsibility of the caller). Anyway, let's wait for the test results, they'll be enlightening. (You mostly convinced me that your commit will work, but Murphy's laws are still standing...) >> Also note that the error message "Out of bound memory access (index is >> tainted)" becomes misleading after this patch -- but don't spend too much >> time on resolving this, because right now I'm working on a complete rewrite >> the message generation to replace the current spartan messages with useful >> error reporting. > > I agree, but I'm not sure how much more readable something more accurate like > "the location where this pointer points to potentially depends on untrusted > tainted data". (Note that this would also work for tainted indexes as well). > > I must admit, that reporting and diagnostics were low priorities for this > patch stack, so I'd focus on addressing logic concerns first for the whole > stack and then come back to reporting to make it consistent across the stack. Improving reporting and diagnostics is also on my TODO list, so we should coordinate progress in that area to avoid redundant work. If you wish to work on it, then I'm happy to review; otherwise I'll do something with it soon (perhaps after you merged these commits to avoid merge conflicts). Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D159105/new/ https://reviews.llvm.org/D159105 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits