[st20] seeking implementation advice for stack architectures (long)

Dimitri Gorokhovik Mon, 04 Jul 2005 15:06:35 -0700

I plan to up-port my ST20 port to the mainline. I thought I'd first give
a description of the port and ask for some opinion. Basically, I would
like to hear, would an implementation along these lines be acceptable in
principle, and otherwise, what are the main no-go points ?


I'll try to keep it short.

ST20 CPU (a transputer derivative) is a (register-)stack architecture
and has no general registers. Our ports too use memory cells from the
stack frame to allocate pseudos. This makes a lot of sense on ST20, as
the first 16 words of the stack frame have special properties and are a
scarce resource, and reloading these memory cells from other memory
cells actually makes the code smaller and faster.

The first port (by Fabio Riccardi, against 2.95.2) was done in a common
way -- mapping pseudos to such "memory registers", then trying to save
the situation in the MACHINE_DEPENDENT_REORG pass by combining several
insns into one -- as long as it evaluates on the 3-level-deep regstack,
then eliminating the memory registers becoming unused.

While maintaining the port to pass the test suite, I had to constantly
extend the machine-dep-reorg pass which started to seriously resemble to
the combine pass in the core. Besides, I felt the port lacked
flexibility, as one couldn't freely use 'define_expand's without
modifying also 'machine-reorg'. Thirdly, combining insns after reload
was frequently yielding suboptimal use of memory registers.


The current port, done from scratch, tried to fix the above and employ
the 'combine' pass to its maximum, instead of duplicating the code:

-- the .md grammar was extended to describe separate stack operations
making up an insn (like, load first word on the stack, load second word
on the stack, subtract, store the result. I named the operation
'define_nonterm'.). Old-style 'define_insn's are fully supported and are
usually intermixed with the new one in the .md file, as some insns won't
use the regstack.

-- generated 'recog_insn' is essentially a BURM labeler (I modified the
algorithm from [1] to account for a finite stack depth and hacked
'genrecog' to generate it.)

-- RTL expansion uses mostly 'define_expand' from .md.

-- 'legitimate_address_p' has the usual sense before 'combine'. 

-- 'combine' combines insns into one iff the combination can be BURM-
labeled by 'recog_insn'.

-- past 'combine', 'legitimate_address_p' says OK to any RTX that
evaluates on the regstack of depth 3, whatever the form the address
parts of it might have.

-- at the point where no new insns are emitted and their relative order
won't change anymore (after reload?), the labeled insns are BURM-reduced
back to separate stack operations. This time, regstack registers are
explicitly mentioned in the RTX, and the .md must have a 'define_insn'
matching each stack op. Splitting insns becomes possible again after
this point.

-- RTX_COST returns the BURM cost of the RTL expression.

The downside:

-- the RTX structure had to be extended, to store the BURM label data.
This is per-port and may be configured in 'config.sub' via a
corresponding compilation option. In my implementation, 12 to 16
results are bytes were added. I didn't attempt to compress this data as
proposed in [1].

-- due to the way the substitution in 'combine' works, insns will
frequently lose their BURM label data, thus requiring relabeling (the
result of recognition is less cacheable than before). This probably
could be improved to some extent.

(I followed the posts on GCC's speed and memory consumption, and am very
well aware of the implications this would have. Please just bear in mind
that currently ST20 developers simply have no alternative to the only
compiler there is. If somebody knows a less costly algorithm for BURM-
labeling RTX, please please let me know.)

-- 'extract_operand' and some other routines expecting operands at fixed
places in an insn had to be made more intelligent, so other 'genxxx'
generators had to be modified too.


The results:

-- actually, rather encouraging. The quality of the emitted code is
largely comparable to the one from the manufacturer's compiler,
sometimes better, and there still is a small margin for improvement.


[1] C.W.Fraser, D.R.Hanson, T.A. Proebsting, Engineering a simple,
efficient code-generator generator. LOPLAS Vol 1, issue 3 (Sep. 1992)



Thanks in advance,


Dimitri

[st20] seeking implementation advice for stack architectures (long)

Reply via email to