On 09/04/2017 12:12 AM, Xinliang David Li wrote:


On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel <hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:


    On 09/03/2017 11:06 PM, Xinliang David Li wrote:
    I think we can think this in another way.

    For modern CPU architectures which supports store forwarding with
    store queues, it is generally not "safe" to blindly do local
    optimizations to widen the load/stores

    Why not widen stores? Generally the problem with store forwarding
    is where the load is wider than the store (plus alignment issues).


True, but probably with some caveats which are target dependent. Store widening also requires additional bit operations (and possibly addition load), so the it is tricky to model the the overall benefit.

    without sophisticated inter-procedural analysis. Doing so will
    run the risk of greatly reduce performance of some programs. Keep
    the naturally aligned load/store using its natural type is safer.

    Does it make sense?

    It makes sense. I could, however, say the same thing about
    inlining. We need to make inlining decisions locally, but they
    have global impact. Nevertheless, we need to make inlining
    decisions, and there's no practical way to make them in a truly
    non-local way.


Speaking of inlining, we are actually thinking of ways to make the decisions more globally optimal, but that is off topic.

Neat.


    We also don't pessimize common cases to improve performance in
    rare cases. In the common case, reducing pressure on the memory
    units, and reducing the critical path, seem likely to me to be
    optimal. If that's not true, or doing otherwise has negligible
    cost (but can prevent rare downsides), we should certainly
    consider those options.


Since we don't do load widening for non-bitfield cases (but the only the very limited case of naturally aligned bitfields) so it is hard to say we pessimize common cases for rare cases:

1) the upside doing widening such access is not high from experience with other compiler (which does not do so) 2) there is obvious downside of introducing additional extract instructions which degrades performance 3) there is obvious downside of severely degrading performance when store forwarding is blocked.

I suspect that it's relatively rare to hit these store-to-load forwarding issues compared to the number of times the systems stores or loads to bitfields. In any case, I did some experiments on my Haswell system and found that the load from Wei's benchmark which is split into two loads, compared to the single load version, is 0.012% slower. I, indeed, won't worry about that too much. On my P8, I couldn't measure a difference. Obviously, this does somewhat miss the point, as the real cost in this kind of thing comes in stressing the memory units in code with a lot going on, not in isolated cases.

Nevertheless, I think that you've convinced me that this is a least-bad solution. I'll want a flag preserving the old behavior. Something like -fwiden-bitfield-accesses (modulo bikeshedding).




    And none of this answers the question of whether it's better to
    have the store wider or the load split and narrow.



It seems safer to do store widening more aggressively to avoid store forwarding stall issue, but doing this aggressively may also mean other runtime overhead introduced (extra load, data merge etc).

Yes. Wei confirmed that this is slower.

Thanks again,
Hal


Thanks,

David


    Thanks again,
    Hal



    David



    On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfin...@anl.gov
    <mailto:hfin...@anl.gov>> wrote:


        On 09/03/2017 10:38 PM, Xinliang David Li wrote:
        Store forwarding stall cost is usually much higher compared
        with a load hitting L1 cache. For instance, on Haswell,  the
        latter is ~4 cycles, while the store forwarding stalls cost
        about 10 cycles more than a successful store forwarding,
        which is roughly 15 cycles. In some scenarios, the store
        forwarding stalls can be as high as 50 cycles. See Agner's
        documentation.

        I understand. As I understand it, there are two potential
        ways to fix this problem:

         1. You can make the store wider (to match the size of the
        wide load, thus permitting forwarding).
         2. You can make the load smaller (to match the size of the
        small store, thus permitting forwarding).

        At least in this benchmark, which is a better solution?

        Thanks again,
        Hal



        In other words, the optimizer needs to be taught to avoid
        defeating  the HW pipeline feature as much as possible.

        David

        On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfin...@anl.gov
        <mailto:hfin...@anl.gov>> wrote:


            On 09/03/2017 03:44 PM, Wei Mi wrote:

                On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
                <hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:

                    On 08/22/2017 10:56 PM, Wei Mi via llvm-commits
                    wrote:

                        On Tue, Aug 22, 2017 at 7:03 PM, Xinliang
                        David Li <davi...@google.com
                        <mailto:davi...@google.com>>
                        wrote:


                            On Tue, Aug 22, 2017 at 6:37 PM,
                            Chandler Carruth via Phabricator
                            <revi...@reviews.llvm.org
                            <mailto:revi...@reviews.llvm.org>> wrote:

                                chandlerc added a comment.

                                I'm really not a fan of the degree
                                of complexity and subtlety that this
                                introduces into the frontend, all to
                                allow particular backend
                                optimizations.

                                I feel like this is Clang working
                                around a fundamental deficiency in
                                LLVM
                                and we should instead find a way to
                                fix this in LLVM itself.

                                As has been pointed out before, user
                                code can synthesize large integers
                                that small bit sequences are
                                extracted from, and Clang and LLVM
                                should
                                handle those just as well as actual
                                bitfields.

                                Can we see how far we can push the
                                LLVM side before we add complexity to
                                Clang here? I understand that there
                                remain challenges to LLVM's stuff,
                                but I
                                don't think those challenges make
                                *all* of the LLVM improvements off the
                                table, I don't think we've exhausted
                                all ways of improving the LLVM
                                changes
                                being proposed, and I think we
                                should still land all of those and
                                re-evaluate how important these
                                issues are when all of that is in place.


                            The main challenge of doing  this in
                            LLVM is that inter-procedural
                            analysis
                            (and possibly cross module) is needed
                            (for store forwarding issues).

                            Wei, perhaps you can provide concrete
                            test case to illustrate the issue
                            so
                            that reviewers have a good understanding.

                            David

                        Here is a runable testcase:
                        -------------------- 1.cc
                        ------------------------
                        class A {
                        public:
                            unsigned long f1:2;
                            unsigned long f2:6;
                            unsigned long f3:8;
                            unsigned long f4:4;
                        };
                        A a;
                        unsigned long b;
                        unsigned long N = 1000000000;

                        __attribute__((noinline))
                        void foo() {
                            a.f3 = 3;
                        }

                        __attribute__((noinline))
                        void goo() {
                            b = a.f3;
                        }

                        int main() {
                            unsigned long i;
                            for (i = 0; i < N; i++) {
                              foo();
                              goo();
                            }
                        }
                        
------------------------------------------------------------
                        Now trunk takes about twice running time
                        compared with trunk + this
                        patch. That is because trunk shrinks the
                        store of a.f3 in foo (Done by
                        DagCombiner) but not shrink the load of a.f3
                        in goo, so store
                        forwarding will be blocked.


                    I can confirm that this is true on Haswell and
                    also on an POWER8.
                    Interestingly, on a POWER7, the performance is
                    the same either way (on the
                    good side). I ran the test case as presented and
                    where I replaced f3 with a
                    non-bitfield unsigned char member. Thinking that
                    the POWER7 result might be
                    because it's big-Endian, I flipped the order of
                    the fields, and found that
                    the version where f3 is not a bitfield is faster
                    than otherwise, but only by
                    12.5%.

                    Why, in this case, don't we shrink the load? It
                    seems like we should (and it
                    seems like a straightforward case).

                    Thanks again,
                    Hal

                Hal, thanks for trying the test.

                Yes, it is straightforward to shrink the load in the
                test. I can
                change the testcase a little to show why it is
                sometime difficult to
                shrink the load:

                class A {
                public:
                   unsigned long f1:16;
                   unsigned long f2:16;
                   unsigned long f3:16;
                   unsigned long f4:8;
                };
                A a;
                bool b;
                unsigned long N = 1000000000;

                __attribute__((noinline))
                void foo() {
                   a.f4 = 3;
                }

                __attribute__((noinline))
                void goo() {
                   b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
                }

                int main() {
                   unsigned long i;
                   for (i = 0; i < N; i++) {
                     foo();
                     goo();
                   }
                }

                For the load a.f4 in goo, it is diffcult to motivate
                its shrink after
                instcombine because the comparison with a.f3 and the
                comparison with
                a.f4 are merged:

                define void @_Z3goov() local_unnamed_addr #0 {
                   %1 = load i64, i64* bitcast (%class.A* @a to
                i64*), align 8
                   %2 = and i64 %1, 0xffffff00000000
                   %3 = icmp eq i64 %2, 0xffff00000000
                   %4 = zext i1 %3 to i8
                   store i8 %4, i8* @b, align 1, !tbaa !2
                   ret void
                }


            Exactly. But this behavior is desirable, locally.
            There's no good answer here: We either generate extra
            load traffic here (because we need to load the fields
            separately), or we widen the store (which generates
            extra load traffic there). Do you know, in terms of
            performance, which is better in this case (i.e., is it
            better to widen the store or split the load)?

             -Hal



                Thanks,
                Wei.

                        The testcases shows the potential problem of
                        store shrinking. Before
                        we decide to do store shrinking, we need to
                        know all the related loads
                        will be shrunk,  and that requires IPA
                        analysis. Otherwise, when load
                        shrinking was blocked for some difficult
                        case (Like the instcombine
                        case described in
                        
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
                        
<https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
                        performance regression will happen.

                        Wei.



                                Repository:
                                    rL LLVM

                                https://reviews.llvm.org/D36562
                                <https://reviews.llvm.org/D36562>



                        _______________________________________________
                        llvm-commits mailing list
                        llvm-comm...@lists.llvm.org
                        <mailto:llvm-comm...@lists.llvm.org>
                        
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
                        
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>


                    --
                    Hal Finkel
                    Lead, Compiler Technology and Programming Languages
                    Leadership Computing Facility
                    Argonne National Laboratory


-- Hal Finkel
            Lead, Compiler Technology and Programming Languages
            Leadership Computing Facility
            Argonne National Laboratory



-- Hal Finkel
        Lead, Compiler Technology and Programming Languages
        Leadership Computing Facility
        Argonne National Laboratory



-- Hal Finkel
    Lead, Compiler Technology and Programming Languages
    Leadership Computing Facility
    Argonne National Laboratory



--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to