On Tue, 30 Nov 2010 12:25:18 +0200
Ira Rosen <ira.ro...@linaro.org> wrote:

> On 22 November 2010 13:46, Ira Rosen <ira.ro...@linaro.org> wrote:
> > On 17 November 2010 13:21, Julian Brown <jul...@codesourcery.com>
> > wrote:
> >>> > We'd need to figure out what the RTL for such loads/stores
> >>> > should look like, and whether it can represent alignment
> >>> > constraints, or strides, or loads/stores of multiple vector
> >>> > registers simulateously.
> >
> > Alignment info is kept in struct ptr_info_def.
> > Is it necessary to represent stride?
> > Multiple loads/stores seem the most complicated part to me. In
> > neon.md vld is implemented with output_asm_insn. Is it going to
> > change? Does this assure consecutive (or stride two) registers?
> >
> >>> > Getting it right might be a bit awkward, especially if we want
> >>> > to consider a scope wider than just NEON, i.e. other vector
> >>> > architectures also.
> >>>
> >>> I think we need to somehow enhance MEM_REF, or maybe generate a
> >>> MEM_REF for the first vector and a builtin after it.
> >>
> >> Yeah, keeping these things looking like memory references to most
> >> of the compiler seems like a good plan.
> >
> > Is it possible to have a list of MEM_REFs and a builtin after them:
> >
> > v0 = MEM_REF (addr)
> > v1 = MEM_REF (addr + 8B)
> > v2 = MEM_REF (addr + 16B)
> > builtin (v0, v1, v2, stride=3, reg_stride=1,...)

Would the builtin be changing the semantics of the preceding MEM_REF
codes? If so I don't like this much (the potential for the builtin
getting "separated" from the MEM_REFS by optimisation passes and causing
subtle breakage seems too high). But if eliding the builtin would
simply cause the code to degrade into separate loads/stores, I guess
that would be OK.

> > to be expanded into:
> >
> > <regular RTL mem refs> (addr)
> > NOTE (...)
> 
> I guess we can do something similar to load_multiple here (but it
> probably requires changes in neon.md as well).

Yeah, I like that idea. So we might have something like:

  (parallel
    [(set (reg) (mem addr))
     (set (reg+2) (mem (plus addr 8)))
     (set (reg+4) (mem (plus addr 16)))])

That should work fine I think -- but how to do register-allocation on
these values remains an open problem (since GCC has no direct way of
saying "allocate this set of registers contiguously"). ARM load & store
multiple are only used in a couple of places, where hard regnos are
already known, so aren't directly comparable.

Choices I can think of are:

1. Use "big integer" modes (TImode, OImode, XImode...), as in the
   present patterns, but with (post-reload?) splitters to create
   the parallel RTX as above. So, prior to reload, the RTL would look
   different (like the existing patterns, with an UNSPEC?), so as to
   allocate the "big" integer to consecutive vector registers. This
   doesn't really gain anything, and I think it'd really be best if
   those types could be removed from the NEON backend anyway.

2. Use "big vector" modes (representing multiple vector registers --
   up to e.g. V16SImode). We'd have to make sure these *never* end up in
   core registers somehow, since that would certainly lead to reload
   failure. Then the parallel might be written something like this (for
   vld1 with four D registers):

    (parallel
      [(use (reg:V8SI 0))
       (set (subreg:V2SI (match_dup 0) 0) (mem))
       (set (subreg:V2SI (match_dup 0) 8) (plus (mem) 8))
       (set (subreg:V2SI (match_dup 0) 16) (plus (mem) 16))
       (set (subreg:V2SI (match_dup 0) 24) (plus (mem) 24))]

   Or perhaps the same but with vec_select instead of subreg (the
   patch I'm working on suggests that subreg on vector types works fine,
   most of the time). This would require altering vld*/vst* intrinsics
   -- but that's something I'm planning to do anyway, and probably
   also tweaking the way "foo.val[X]" access (again for intrinsics) is
   expanded in the front-ends, as a NEON-specific hack. The main
   worry is that I'm not sure how well the register allocator & reload
   will handle these large vectors.

   The vectorizer would need a way of extracting elements or vectors
   from these extra-wide vectors: in terms of RTL, subreg or vec_select
   should suffice for that.

3. Treat vld* & vst* like (or even as?) libcalls. Regular function
   calls have kind-of similar constraints on register usage to these
   multi-register operations (i.e. arguments must be in consecutive
   registers), so as a hack we could reuse some of that mechanism (or
   create a similar mechanism), provided that we can live with vld* &
   vst* always working on a fixed list of registers. E.g. we'd end up
   with RTL:

   Store,

    (set (reg:V2SI d0) (...))
    (set (reg:V2SI d1) (...))
    (call_insn (fake_vst1) (use (reg:V2SI d0)) (use (reg:V2DI d1)))

   Load,

    (parallel (set (reg:V2SI d0) (call_insn (fake_vld1)))
              (set (reg:V2SI d1) (...)))
    (set (...) (reg:V2SI d0))
    (set (...) (reg:V2SI d1))

   (Not necessarily with actual call_insn RTL: I just wrote it like
   that to illustrate the general idea.)

   One could envisage further hacks to lift the restriction on the
   fixed registers used, e.g. by allocating them in a round-robin
   fashion per function. Doing things this way would also require
   intrinsic-expansion changes, so isn't necessarily any easier than
   (2).

I think I like the second choice best: I might experiment (from the
intrinsics side), to see how feasible it looks.

Julian

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to