https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96461

            Bug ID: 96461
           Summary: [SVE] Use the HISTCNT instruction for simple histogram
                    loops
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
            Blocks: 53947
  Target Milestone: ---
            Target: aarch64*-*-*

SVE2 has a HISTCNT instruction that, for each element, counts
the number of matching elements in that lane and previous lanes.
It allows things like:

void
update (uint32_t *histogram, uint32_t *records, uint32_t num_records)
{
  for (uint32_t i = 0; i < num_records; i++)
    histogram[records[i]] += 1;
}

to be vectorised using a gather load, HISTCNT and scatter store.

The instruction is intended for loops that do more than the bare
minimum above, but as far as autovec goes, we need to start somewhere.
This PR is therefore about recognising the:

    histogram[…] += 1;

gather/HISTCNT/scatter pattern and generating something like the
asm below for the function above:

        mov     x3, #0
        whilelo p0.s, xzr, x2
.loop
        ld1w    z1.s, p0/z, [x1, x3, lsl #2]
        ld1w    z2.s, p0/z, [x0, z1.s, uxtw #2]
        histcnt z0.s, p0/z, z1.s, z1.s
        add     z2.s, p0/m, z2.s, z0.s
        st1w    z2.s, p0, [x0, z1.s, uxtw #2]
        incw    x3
        whilelo p0.s, x3, x2
        b.first .loop
        ret

Once that's done we could move on to fancier cases.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to