Hello,
As i don't really know where to send this email, I thought you may help
to redirect it to the right place.
I have some questions about vectorization which I would like to have for
a special MIPS architecture : Allegrex w/ VFPU.
First, an explanation about what allegrex is :
Allegrex is based on MIPS II with some MIPS III additions and has a
standard FPU. It has no 64-bit ISA for both CPU and FPU (so long
long and double need to be done though a software method).
There is also some instructions usually found in OPCODE3 map to be
in OPCODE2 map for Allegrex (min, max, madd(u), msub(u), etc.).
Allegrex also has a coprocessor 2 we called VFPU. This coprocessor is
quite powerful and easy to use and can even be used as a replacement for
FPU :
Details about VFPU :
It has 128 32-bit single precision floating point registers, which
are organized in 8 banks of 4x4 matrixes :
$0
$1
$2
$3
$32
$33
$34
$35
$64
$65
$66
$67
$96
$97
$98
$99
$4
$5
$6
$7
$36
$37
$38
$39
$68
$69
$70
$71
$100
$101
$102
$103
$8
$9
$10
$11
$40
$41
$42
$43
$72
$73
$74
$75
$104
$105
$106
$107
$12
$13
$14
$15
$44
$45
$46
$47
$76
$77
$78
$79
$108
$109
$110
$111
$16
$17
$18
$19
$48
$49
$50
$51
$80
$81
$82
$83
$112
$113
$114
$115
$20
$21
$22
$23
$52
$53
$54
$55
$84
$85
$86
$87
$116
$117
$118
$119
$24
$25
$26
$27
$56
$57
$58
$59
$88
$89
$90
$91
$120
$121
$122
$123
$28
$29
$30
$31
$60
$61
$62
$63
$92
$93
$94
$95
$124
$125
$126
$127
$0..$127 : 32-bit scalar floats.
those registers can be accessed in scalar format, in row or column
vector format, in matrix or transposed format.
when accessing as a row or a column, we can deal with 2, 3 or 4
components as long as the numbers of those registers are inside the
bank.
when accessing as a matrix, we can deal with 2, 3 or 4 rows or
columns as long as theirs numbers are inside the bank.
Basically, operations on VFPU may look this way :
vadd.s S000, S010, S020 : $0 = $1 + $2
vadd.t R000, R001, C030 : {$0, $1, $2} = {$32, $33, $34} + {$3, $35,
$67} <=> par { $0 = $32 + $3; $1 = $33 + $35; $2 = $34 + $67; }
vtfm.p R200, M000, C010 : {$8, $9} = {{$0, $1}, {$32, $33}} x {$4,
$36} <=> par { $8 = $0 * $4 + $1 * $36; $9 = $32 * $4 + $33 * $36 }
etc.
The questions now :
I would like to extend the use of VFPU in psp-gcc, the free PSP port of
gcc through the vectorization
* How can I make coexist the SF mode between the FPU registers and
the VFPU registers in the argument list of a function ? there is
no direct tranfer between them, you need to use a GPR register or
a memory slot as an intermediate to do so, which is very slow. I
would like to be able to distinguish them through an attribute. Is
there any examples which address this problem ?
* Another way to distinguish a VFPU scalar is to use "typedef float
__attribute__((vector_size(4))) V1SF;". Is that difficult to make
it possible (right now, gcc refuses it) ?
* Same question for V3SF, is that difficult to make it possible ?
* V2SF and V4SF are possible (they are respectively row vector of
two or four components). If I choose to let gcc allocate only the
first 32 VFPU registers, the other being associated with one of
first 32 registers, would it be difficult to have combined V2V1SF,
V3V1SF, V4V1SF to define column vectors of two, three or four
components ? and to have V2V2SF, V3V3SF, V4V4SF as matrixes ?
Right now, a vector needs a binary multiple of components. It doesn't
allow autovectorization for 3D.
V4SF means allocation of 4 contigous registers amongst the first 32
registers : { i, i+1, i+2, i+3} with 0 <= i < 32 and i%4 = 0. Its size
is 4*sizeof(float). For VFPU, it would describe a typical 4D row vector.
V4V1SF means allocation of 1 register amongst the first 32 registers : {
i, i+32, i+64, i+96 } with 0 <= i < 32. But its size is 4*sizeof(float).
For VFPU, it would describe a typical 4D column vector.
V2V2SF means allocation of 2 contigous registers amongst the first 32
registers : { { i, i+32 }, { i+1, i+33 } } 0 <= i < 32 and i%2 = 0. But
its size is 2*2*sizeof(float). For VFPU, it would describe a 2D typical
matrix.
V3SF would be equivalent to V4SF for the allocation, but its size is
3*sizeof(float) and its operations different. For VFPU, it would
describe a typical 4D row vector.
I really hope some are feasible as this VFPU has a lot of operations
for 2D, 3D and 4D which may greatly boost gcc maths.
Regards