NoQ added a comment.

Thanks for working on the taint! I really wish the taint analysis in the 
analyzer to flourish, and the part you've digged into is particularly 
sensitive, so i'd dump some thoughts here, hopefully helpful.

**What works as it should: **

> In the second example, the SVal obtained in 
> `GenericTaintChecker::getPointedToSymbol()` is a `LazyCompoundVal` so 
> `SVal::getAsSymbol()` returns a NULL SymbolRef, meaning the symbol is not 
> tainted.

Only symbols currently can, and in my opinion ever should, directly carry taint.

We have symbols, regions, concrete values, and compound values.

A symbol //$s// denotes a particular but unknown value from a certain set 
//**S**// (say, of all integers or all pointers); we are not generally sure if 
it can denote any value from //**S**//, or only from a smaller subset 
//**S'**// within //**S**// (for example, the analyzer may be unaware that 
//$s// is always positive, because the small part of the code it's looking into 
doesn't give any hints). Analyzer also gathers constraints //**C**// when he 
picks execution paths - for instance, on a branch condition "//if ($s < 5)//", 
if we pick the true branch, //**C**// gets cropped into //[-INT_MIN, 4]//, and 
the possible values of //$s// stay within //**S'**// intersected with 
//**C**//, while the analyzer thinks that possible values of //$s// are within 
//**S**// intersected with //**C**//.

A tainted symbol merely corresponds to the special case when we know that 
//**S'** = **S**//, which we normally don't - that is, the analyzer knows that 
for some reason the symbol's value is so much chaotic that it may definitely 
take any value from //**S**// in run-time. The analyzer can know it by seeing 
that the symbol comes from an "untrusted source" such as from "outside".

We do not really talk about tainted regions - instead, we talk about tainted 
pointer values, which are symbols. In this sense, in code

  char buffer[100];

the region of variable `buffer` cannot be tainted. No matter what we do, the 
buffer region itself comes from a perfectly trusted source - it's always in the 
same well-known segment of memory (stack or in global memory). In practice this 
means that the attacker cannot affect or forge the positioning of this segment 
in any way. On the other hand, if //$i// is a tainted index symbol, then region 
//buffer[$i]// of the //$i//'th element would be "said to be tainted" - in the 
sense, its pointer is known to possibly take any value, not necessarily within 
range of the buffer. Similarly, if a pointer-symbol is tainted, the segment of 
memory (region) it points to (called `SymbolicRegion`) is "said to be tainted". 
So //whenever we're talking about a "tainted region", it's always a shorthand 
for talking about a certain tainted symbol//.

Concrete values shouldn't be tainted for the same reason - they are already 
"known", their //**S**// would consist of one element, so there's little 
interest in saying that they can take any value from this set - they're bound 
to take that only value.

Now, we have (possibly lazy) compound values. The whole point of compound 
values is that they consist of multiple other values, and therefore there's 
usually little sense in even considering them as a whole. Some sub-values of 
them may be null, over-constrained, uninitialized, irrelevant (eg. padding), or 
tainted, while other sub-values may be completely different. In your example, 
you obtain a compound value of an array, which consists of all values of all 
elements of the array. There's no point in asking if the compound value itself 
is tainted; instead, you might be interested in particular elements. For 
example, `buffer[1]` may be tainted, while `buffer[2]` was overwritten with 
`'\0'` and is no longer tainted. If you had a structure, you may ask if a 
certain field is tainted, or if all fields are tainted.

**What works but not as it should:**

> In the first example, buf has it's symbol correctly extracted (via 
> `GenericTaintChecker::getPointedToSymbol())` as a 
> `SymbolDerived{conj_$N{char}, element{buf,0 S64b,char}}`.

Nope, unfortunately, in fact this is not correctly extracted either. This 
symbol denotes the value of the first element of the buffer. The taint checker 
marks the first element as tainted, and then later reads that the first element 
is tainted, and throws the warning. However, if we pass, say, `buffer+1` to 
`execl`, it breaks.

The root cause of the problem is how the analyzer denotes casted/typed 
pointers: the `SVal` for a pointer of type `T*` is said to be the element 
region with index 0 of the region behind the pointer. You can read this as "The 
pointer points to the first `T` on this segment of memory". In this case, we 
try to say that the value behind the pointer is tainted, and we forget that the 
value behind the pointer is not only the first element of the array, but the 
whole array, even though the pointer points to the first element only.

How this should work: //we should instead put the taint over the conjured 
symbol//, on which the derived symbol you mention is //based//. The conjured 
symbol denotes the unknown noise that wipes the whole array after reading 
tainted data into it, which in other words means that it denotes the tainted 
data itself. Derived symbols denote values for sections of this noise that 
correspond to sub-regions. You have already seen from the code that derived 
symbols inherit taint from parent symbols (on which they are "based", or from 
which they are "derived"), so if we taint the conjured symbol, then all the 
derived symbols for all elements will be tainted automagically; however, 
elements overwritten later will not carry taint.

Then when you try to find out if the value passed to `execl` is tainted, some 
imagination is required to conduct a proper check. You definitely shouldn't try 
to inspect the same conjured symbol - instead, this time you should try to pay 
attention to the elements. Probably the first element's being tainted is 
already a good time to panic. Probably you could improve upon this by scanning 
until a null character and see how much taint is there. So the 
taint-discovering code currently works more or less correctly, but it requires 
a different version of `getPointedToSymbol` than the taint-origination code 
discussed above.

**What doesn't work and you're trying to fix:**

Now you observe that by changing the array-to-pointer cast syntax slightly, you 
make the analyzer core drop the element region (which, as i mentioned, only 
represents pointer casts) from your lvalue - and receive the rvalue of the 
whole array (the lazy compound value of all elements), instead of the rvalue of 
the first element (the derived symbol).

How it should work? Same as above: when adding the initial taint, we should 
instead put the taint over the conjured symbol. //It is the pointed-to symbol 
you're looking for.// Similarly, when detecting tainted values, we should 
unpack the lazy value and see if its elements or fields in which we are 
interested are tainted.

**Regarding the origin regions:**

Origin regions for `SymbolRegionValue`,  and even more so for`SymbolDerived`, 
and even more so for `LazyCompoundVal`, are immaterial for most purposes. 
Usually the region isn't at fault that a certain tainted value was written into 
it.

Well, in case of `SymbolRegionValue`, he is actually at fault, because if the 
origin region is a pointer that comes from "outside", then the values that came 
with it may also have been forged; after all, `SymbolRegionValue` denotes "what 
have been there in this region before we noticed".

However, the value of `SymbolDerived` is known to be a "sub-section" of the 
parent symbol, and no matter how hard you try to forge the parent region, you 
can't change that fact. So unless the parent symbol is tainted, we shouldn't 
call the derived symbol tainted.

You may say that if the parent region's address can be an arbitrary forged 
pointer, then deriving the symbol may cause buffer overflow and therefore 
denote a forged data. Unfortunately, our implemenetation of the `Store` allows 
such examples, but ideally it shouldn't construct a `SymbolDerived` for such 
cases.

Finally, `LazyCompoundVal`'s origin (parent) region is also not at fault. The 
value only captures the contents of this region. All we know is that at some 
prior point of time contents of this region were as represented by the lazy 
compound value. But because this moment occurs during analysis, the value has 
little connection with the region; this region may have been overwritten many 
times, completely or partially, with values of completely different nature, 
both before and after taking the snapshot we call a lazy compound value.

**What to do:**

I'd suggest to consider moving taint to the invalidation-related conjured 
symbols. It wouldn't be easy because API of the Store doesn't allow obtaining 
this symbol easily, so it needs some work, which would be useful for other 
purposes as well (this is brought up often in C++ discussions, where we also 
need to work with complete structures a lot, eg. iterators).


https://reviews.llvm.org/D28445



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to