On Apr 18, 2011, at 2:01 AM, 章磊 wrote:
> Hi Ted,
>
> Sorry for this late reply, i focused on something else these days.
>
> I think it's great to keep a statistical checker working fine in IPA mode, so
> if the statistics for a checker can store in the checker instance, i'm glad
> to make it happen.
>
> IMO, statistical checker should use (1) "global" data. When i started this
> job, as you say the lifetime of a Checker object was bound to the lifetime of
> an ExprEngine. But my checker want to gather "global" data, so i creat a
> global DenseMap to store them.
>
> Later, after your reply to my patch, i thought maybe it should be done in two
> steps, first step in the LocationContexts, second step in some structure has
> the lifetime of whole translation unit analysis.
>
> Now, since the Checkers managed by CheckerManager, cam we store the "global"
> data rather than the "local" data in the Checker instance?
Yes, precisely. It's not clear what we will do when we start doing
cross-translation unit analysis, but we can cross that bridge when we come to
it.
>
> Then we can focus on how to correctly gather the statistics in specific
> checker.
>
> What i am afraid most is to miscount the statistics, this will make our
> statistical checker meaningless.
>
> here are two examples, the first one:
>
> int f1();
>
> void f2() {
> f1();
> }
>
> int main() {
> f2();
> f2();
> }
>
> So there are two f2() calls in main. When we inlining the first f2(), f1()'s
> unchecked return count +1; but when we inlining the second f2(), f1()'s
> unchecked return count should not change.
> The unchecked return count should depends on how many times the programmer
> miswrite the code, not how many times a function was called. So it likes a
> "textual" problem.
I think this is easily handled by recording in the Checker whether or not the
body of f2() was already analyzed (and thus don't record any more statistics
for calls within that function body) OR, possibly more precisely, whether or
not a callsite was previously counted when f2() was inlined the first time (or
any other previous time). Strategies like these allow the Checker to be the
decider on what statistics to collect and when. The bookkeeping for this could
be kept as global data within the Checker object itself (just like the
statistics).
>
> Then comes the second example:
>
> int f1();
> int f2(int arg);
> int f3(int arg);
>
> int f4(int arg) {
> int i = f1();
> if (arg) {
> f2(i);
> return 0;
> }
>
> if (i) {
> f3(i);
> return 1;
> }
>
> return 2;
> }
>
> There are several paths in the Example, in some paths i is checked, while not
> in others. How do we count the statistics?
That's ultimately a policy decision on how you want to collect statistics.
Here are a couple possibilities:
(1) Take the average. Count the number of *uses* (not paths) a value was
checked/unchecked within a function body, and then average. That could
represent a single "count" for the call site of the function that returned the
value. For example, there are 2 uses of 'i', one where it is never checked
("f2(i)"), and one where it is always checked ("f3(i)"). The average count
could be 0.5.
(2) Don't take the average, and count by uses instead. That skews to large
functions with many uses.
There are many options. Simple statistical counting will probably not taking
into account correlated counts; e.g. within a given function body certain
functions will consistently be used correctly/incorrectly, and that should be
weighed when doing the global tabulation of the trend.
> IMO, unless in all the paths i was checked, we do the checked count +1;
> otherwise the unchecked count +1;
That's definitely conservative, but it may penalize the amount of information
you extract from a given call site.
>
> After some more thought, i think statistical checkers(unchecked return) need
> not to deactivate IPA. Instead, we can do something like:
>
> When enter a function body in the first time, we do the unchecked return
> analysis and update the statistic.
> When enter a function body what we have analysised before, we deactivate the
> unchecked return checker.
>
That's a reasonable first cut, but even that may be overly pessimistic.
Consider the case where we look at the counts for *f3*:
void f1(int x);
void f2() {
f1(0);
f1(1);
}
int f3();
void f4(int z);
void f1(int x) {
int y = f3();
if (x == 1) {
f4(y);
return;
}
if (z) {
f4(z);
}
}
When we analyze 'f1', if we do it via inlining we may analyze it first with the
parameter being known to be '0'. This means that the return value from f3() is
considered checked. The second time we analyze f1() with the parameter known
to be '1', we return value is not checked. Thus, depending on what assumptions
we make on the input values (and the possible paths they imply), we get
different counts.
In the end, it really depends on the distribution of the data you want to
analyze. Inlining gives you (possibly) more precise information about feasible
paths, which means it could possibly make your statistics more accurate.
Alternatively, it can make your counts skewed. I think if you employ some kind
of averaging, you should be able to reconcile some of these conflicts, merging
counts and in the process making individual counts (potentially) more accurate.
At any rate, once you get a clear counting scheme that takes into account
potentially multiple contexts that might cause the same thing to be counted
multiple times, I think you will have the flexibility of looking at the
statistical results when both inlining is enabled/disabled and be able to
directly compare the results. This will allow you to decide whether inlining
provides any benefits for the statistical checker, or if it just adds noise or
skew to your inference.
Cheers,
Ted
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits