On 09/04/2017 12:12 AM, Xinliang David Li wrote:
On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel <hfin...@anl.gov
<mailto:hfin...@anl.gov>> wrote:
On 09/03/2017 11:06 PM, Xinliang David Li wrote:
I think we can think this in another way.
For modern CPU architectures which supports store forwarding with
store queues, it is generally not "safe" to blindly do local
optimizations to widen the load/stores
Why not widen stores? Generally the problem with store forwarding
is where the load is wider than the store (plus alignment issues).
True, but probably with some caveats which are target dependent.
Store widening also requires additional bit operations (and possibly
addition load), so the it is tricky to model the the overall benefit.
without sophisticated inter-procedural analysis. Doing so will
run the risk of greatly reduce performance of some programs. Keep
the naturally aligned load/store using its natural type is safer.
Does it make sense?
It makes sense. I could, however, say the same thing about
inlining. We need to make inlining decisions locally, but they
have global impact. Nevertheless, we need to make inlining
decisions, and there's no practical way to make them in a truly
non-local way.
Speaking of inlining, we are actually thinking of ways to make the
decisions more globally optimal, but that is off topic.
Neat.
We also don't pessimize common cases to improve performance in
rare cases. In the common case, reducing pressure on the memory
units, and reducing the critical path, seem likely to me to be
optimal. If that's not true, or doing otherwise has negligible
cost (but can prevent rare downsides), we should certainly
consider those options.
Since we don't do load widening for non-bitfield cases (but the only
the very limited case of naturally aligned bitfields) so it is hard to
say we pessimize common cases for rare cases:
1) the upside doing widening such access is not high from experience
with other compiler (which does not do so)
2) there is obvious downside of introducing additional extract
instructions which degrades performance
3) there is obvious downside of severely degrading performance when
store forwarding is blocked.
I suspect that it's relatively rare to hit these store-to-load
forwarding issues compared to the number of times the systems stores or
loads to bitfields. In any case, I did some experiments on my Haswell
system and found that the load from Wei's benchmark which is split into
two loads, compared to the single load version, is 0.012% slower. I,
indeed, won't worry about that too much. On my P8, I couldn't measure a
difference. Obviously, this does somewhat miss the point, as the real
cost in this kind of thing comes in stressing the memory units in code
with a lot going on, not in isolated cases.
Nevertheless, I think that you've convinced me that this is a least-bad
solution. I'll want a flag preserving the old behavior. Something like
-fwiden-bitfield-accesses (modulo bikeshedding).
And none of this answers the question of whether it's better to
have the store wider or the load split and narrow.
It seems safer to do store widening more aggressively to avoid store
forwarding stall issue, but doing this aggressively may also mean
other runtime overhead introduced (extra load, data merge etc).
Yes. Wei confirmed that this is slower.
Thanks again,
Hal
Thanks,
David
Thanks again,
Hal
David
On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfin...@anl.gov
<mailto:hfin...@anl.gov>> wrote:
On 09/03/2017 10:38 PM, Xinliang David Li wrote:
Store forwarding stall cost is usually much higher compared
with a load hitting L1 cache. For instance, on Haswell, the
latter is ~4 cycles, while the store forwarding stalls cost
about 10 cycles more than a successful store forwarding,
which is roughly 15 cycles. In some scenarios, the store
forwarding stalls can be as high as 50 cycles. See Agner's
documentation.
I understand. As I understand it, there are two potential
ways to fix this problem:
1. You can make the store wider (to match the size of the
wide load, thus permitting forwarding).
2. You can make the load smaller (to match the size of the
small store, thus permitting forwarding).
At least in this benchmark, which is a better solution?
Thanks again,
Hal
In other words, the optimizer needs to be taught to avoid
defeating the HW pipeline feature as much as possible.
David
On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfin...@anl.gov
<mailto:hfin...@anl.gov>> wrote:
On 09/03/2017 03:44 PM, Wei Mi wrote:
On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
<hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:
On 08/22/2017 10:56 PM, Wei Mi via llvm-commits
wrote:
On Tue, Aug 22, 2017 at 7:03 PM, Xinliang
David Li <davi...@google.com
<mailto:davi...@google.com>>
wrote:
On Tue, Aug 22, 2017 at 6:37 PM,
Chandler Carruth via Phabricator
<revi...@reviews.llvm.org
<mailto:revi...@reviews.llvm.org>> wrote:
chandlerc added a comment.
I'm really not a fan of the degree
of complexity and subtlety that this
introduces into the frontend, all to
allow particular backend
optimizations.
I feel like this is Clang working
around a fundamental deficiency in
LLVM
and we should instead find a way to
fix this in LLVM itself.
As has been pointed out before, user
code can synthesize large integers
that small bit sequences are
extracted from, and Clang and LLVM
should
handle those just as well as actual
bitfields.
Can we see how far we can push the
LLVM side before we add complexity to
Clang here? I understand that there
remain challenges to LLVM's stuff,
but I
don't think those challenges make
*all* of the LLVM improvements off the
table, I don't think we've exhausted
all ways of improving the LLVM
changes
being proposed, and I think we
should still land all of those and
re-evaluate how important these
issues are when all of that is in place.
The main challenge of doing this in
LLVM is that inter-procedural
analysis
(and possibly cross module) is needed
(for store forwarding issues).
Wei, perhaps you can provide concrete
test case to illustrate the issue
so
that reviewers have a good understanding.
David
Here is a runable testcase:
-------------------- 1.cc
------------------------
class A {
public:
unsigned long f1:2;
unsigned long f2:6;
unsigned long f3:8;
unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 1000000000;
__attribute__((noinline))
void foo() {
a.f3 = 3;
}
__attribute__((noinline))
void goo() {
b = a.f3;
}
int main() {
unsigned long i;
for (i = 0; i < N; i++) {
foo();
goo();
}
}
------------------------------------------------------------
Now trunk takes about twice running time
compared with trunk + this
patch. That is because trunk shrinks the
store of a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3
in goo, so store
forwarding will be blocked.
I can confirm that this is true on Haswell and
also on an POWER8.
Interestingly, on a POWER7, the performance is
the same either way (on the
good side). I ran the test case as presented and
where I replaced f3 with a
non-bitfield unsigned char member. Thinking that
the POWER7 result might be
because it's big-Endian, I flipped the order of
the fields, and found that
the version where f3 is not a bitfield is faster
than otherwise, but only by
12.5%.
Why, in this case, don't we shrink the load? It
seems like we should (and it
seems like a straightforward case).
Thanks again,
Hal
Hal, thanks for trying the test.
Yes, it is straightforward to shrink the load in the
test. I can
change the testcase a little to show why it is
sometime difficult to
shrink the load:
class A {
public:
unsigned long f1:16;
unsigned long f2:16;
unsigned long f3:16;
unsigned long f4:8;
};
A a;
bool b;
unsigned long N = 1000000000;
__attribute__((noinline))
void foo() {
a.f4 = 3;
}
__attribute__((noinline))
void goo() {
b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
}
int main() {
unsigned long i;
for (i = 0; i < N; i++) {
foo();
goo();
}
}
For the load a.f4 in goo, it is diffcult to motivate
its shrink after
instcombine because the comparison with a.f3 and the
comparison with
a.f4 are merged:
define void @_Z3goov() local_unnamed_addr #0 {
%1 = load i64, i64* bitcast (%class.A* @a to
i64*), align 8
%2 = and i64 %1, 0xffffff00000000
%3 = icmp eq i64 %2, 0xffff00000000
%4 = zext i1 %3 to i8
store i8 %4, i8* @b, align 1, !tbaa !2
ret void
}
Exactly. But this behavior is desirable, locally.
There's no good answer here: We either generate extra
load traffic here (because we need to load the fields
separately), or we widen the store (which generates
extra load traffic there). Do you know, in terms of
performance, which is better in this case (i.e., is it
better to widen the store or split the load)?
-Hal
Thanks,
Wei.
The testcases shows the potential problem of
store shrinking. Before
we decide to do store shrinking, we need to
know all the related loads
will be shrunk, and that requires IPA
analysis. Otherwise, when load
shrinking was blocked for some difficult
case (Like the instcombine
case described in
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
<https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
performance regression will happen.
Wei.
Repository:
rL LLVM
https://reviews.llvm.org/D36562
<https://reviews.llvm.org/D36562>
_______________________________________________
llvm-commits mailing list
llvm-comm...@lists.llvm.org
<mailto:llvm-comm...@lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits