On Wed, 20 Aug 2025 18:50:48 +0000, Ngan, Robert <[email protected]> wrote:
>I thought "All those vector instructions are better than an a single ED?
>Today, I decided to recompile the module
>
> MVC 536(6,R13),2403(R3) # TS2=119 +2403
> LLH R2,8(,R2) # U010-CO-ID
> CVD R2,1050(,R13) #
> ED 536(6,R13),1055(R13) # TS2=119
> L R2,88(,R8) # BLL_9
> MVC 8(5,R2),537(R13) # U016-CO-ID-DISPLAY TS2=119
Does anyone believe IBM is committed to vector instructions when IBM Telum has
tiny 16 byte vector registers compared to other CPU's (64, 128, 256 or 512
bytes). I'm guessing companies won't buy IBM without vector instructions
proving you can't fix stupid. You can certainly find a use for these
instructions but are they the best solution? Apparently, the Cobol compiler
developers found out the hard way as Robert discovered.
How are we so gullible that we believe everything Unix programmers say? Let's
learn how vectors solve a simple problems the hard way.
Vector instructions are critical to NON-zArch cpu performance. Let's consider
MOVE A TO B because if they can't do the easy stuff right, what makes anyone
think they are doing the difficult stuff right!
1. Traditionally, CPU's have 7 usable registers. MOVE A TO B is very simple but
extremely slow:
L R3,0(R1) get source data into reg
ST R3,0(R2) save source data to destination
AHI R1,## point to next source data
AHI R2,## point to next destination
repeat until all data moved.
4 of the 7 registers are needed to move data. Moving data is very expensive
using this method. Ask yourself why 64 bit was so important. You could move 8
bytes instead of 4 bytes thus doubling CPU speed for moving data.
2. To move data using vector instructions is the same except it uses vector
instructions and vector registers.
2a. There are 32 vector registers (same as zArch)
2b. some vector registers are 64 bytes (512 bits as in Intel x86 AVX-512). 64
bytes times 32 regs is 2KB. IBM Telum vector registers are only 16 bytes. 16
bytes times 32 regs is 512 bytes (25% of AVX-512).
VL VR1,0(R1) get source data into reg
VL VR2,64(R1) get source data into reg
VL VR3,128(R1) get source data into reg
VL VR4,192(R1) get source data into reg
VST VR1,0(R2) save source data to destination
VST VR2,64(R2) save source data to destination
VST VR3,128(R2) save source data to destination
VST VR4,192(R2) save source data to destination
AHI R1,256 point to next source data
AHI R2,256 point to next destination
repeat until all data moved.
Most architectures use 4 or fewer vector registers. They are moving 256 bytes
at a time which is the equivalent of an MVC which IBM architected in the
1960's. Thank god for fast CPUs!
3. zArch MVC, MVCL, and MVCLE are exponentially faster than vector instructions:
3a. MVC and MVCL have existed since 1970 but never changed externally to
improve performance.
3a. zArch microcode must be using hidden buffers instead of user facing vector
registers. I'm guessing at least 4K or larger. Huge compared to vector moving
256 bytes (16X faster on 4K move).
3b. zArch appears to be using a masked move (mask being the size to move). A
few vector architectures have masked but most implementations use calculations
to determine power of 2 moves that fit in the remaining size (e.g. 1 byte, 2,
4, 8, ..., 256 bytes).
3c. Decode of one instruction MVCLE instead of potentially hundreds to
thousands of instructions to decode.
3d. There's a reason IBM Telum pipeline is only 6 instructions versus 15 to 30
instructions of other architectures.
3e. MVCLE can easily calculate prefetch of storage to optimize move. Only a few
non-IBM implementations use prefetch and they simply pick a number (e.g. 756
bytes).
3f. If zArch microcode used the vector implementation internally. then that's
at least 19 instructions in L2 that don't need to be continually decoded.
3g. zArch vector instructions VLM and VSTM that can specify 16 vector registers
(256 bytes) with 6 fewer instructions compared to above.
3g. I suspect that vector microcode resides in L2 cache because of low use.
High use microcode probably resides permanently in L1 cache. Being microcode,
it's probably permanently decoded.
The computer industry (e.g. Google) is delusional to ignore the power of the
IBM Telum with each core being several times faster than any other CPU core
today because of it's instruction set.