10 regression] 60% speed drop on neon intrinsic loop

wilco at gcc dot gnu.org Fri, 30 Aug 2019 04:46:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598


Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|arm                         |aarch64
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2019-08-30
                 CC|                            |wilco at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #3 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Maxim Kuvyrkov from comment #2)
> Created attachment 46784 [details]
> Patch for 70% of the regression

Confirmed. Note this is not about auto prefetching but basic scheduling for
load latency.

The key issue is the use of asm in arm_neon.h - fixing those will improve
scheduling. It may also be a good idea to fix the scheduler so that it
schedules asm instructions. For example always use the latencies of input
registers and assign a fixed latency to outputs depending on the mode (eg.
integer =1, FP = 4, int simd = 2).

It's not clear what the point is of the "auto prefetch" scheduling - while it
may be a good idea to order loads/stores on increasing addresses, grouping all
loads or stores together is counterproductive.

[Bug target/91598] [8/9/10 regression] 60% speed drop on neon intrinsic loop

Reply via email to