Siddhesh Poyarekar wrote: > The current cost model will disable reg offset for loads as well as > stores, which doesn't work well since loads with reg offset are faster > for falkor.
Why is that a bad thing? With the patch as is, the testcase generates: .L4: ldr q0, [x2, x3] add x5, x1, x3 add x3, x3, 16 cmp x3, x4 str q0, [x5] bne .L4 With a change in address cost (for loads and stores) we would get: .L4: ldr q0, [x3], 16 str q0, [x4], 16 cmp x3, x5 bne .L4 This looks better to me, especially if there are more loads and stores and some have offsets as well (the writeback is once per stream while the extra add happens for every store). It may be worth trying both possibilities on a large body of code and see which comes out smallest/fastest. Note using the cost model as intended means the compiler tries to use the lowest cost possibility rather than never emitting the instruction, not even when optimizing for size. I think it's wrong to always block a valid instruction. > Also, this is a very specific tweak for a specific processor, i.e. I > don't know if there is value in splitting out the costs into loads and > stores and further into 128-bit and lower just to set the 128 store cost > higher. That will increase the size of the change by quite a bit and > may not make it suitable for inclusion into gcc8 at this stage, while > the current one still qualifies given its contained impact. It's not clear whether it is easy to split out the costs today (it could be done in aarch64_rtx_costs but not aarch64_address_cost, and the latter is what IVOpt uses). > Further, it seems like worthwhile work only if there are other parts > that actually have the same quirk and can use this split. Do you know > of any such cores? Currently there are several supported CPUs which use a much higher cost for TImode and for register offsets. So it's a common thing to want, however I don't know whether splitting load/store address costs helps for those. I think a special case for Falkor in aarch64_address_cost would be acceptable in GCC8 - that would be much smaller and cleaner than the current patch. If required we could improve upon this in GCC9 and add a way to differentiate between loads and stores. Wilco