Vector instruction performance WAS: Execute-Type Instructions

Jon Perryman Thu, 21 Aug 2025 13:57:25 -0700

On Wed, 20 Aug 2025 18:50:48 +0000, Ngan, Robert <[email protected]> wrote:


>I thought "All those vector instructions are better than an a single ED? 
>Today, I decided to recompile the module 
>
>          MVC     536(6,R13),2403(R3)   #  TS2=119                  +2403
>          LLH     R2,8(,R2)             #  U010-CO-ID
>          CVD     R2,1050(,R13)         #
>          ED      536(6,R13),1055(R13)  #  TS2=119
>          L       R2,88(,R8)            #  BLL_9
>          MVC     8(5,R2),537(R13)      #  U016-CO-ID-DISPLAY      TS2=119

Does anyone believe IBM is committed to vector instructions when IBM Telum has 
tiny 16 byte vector registers compared to other CPU's (64, 128, 256 or 512 
bytes). I'm guessing companies won't buy IBM without vector instructions 
proving you can't fix stupid. You can certainly find a use for these 
instructions but are they the best solution? Apparently, the Cobol compiler 
developers found out the hard way as Robert discovered.

How are we so gullible that we believe everything Unix programmers say?  Let's 
learn how vectors solve a simple problems the hard way. 

Vector instructions are critical to NON-zArch cpu performance. Let's consider 
MOVE A TO B because if they can't do the easy stuff right, what makes anyone 
think they are doing the difficult stuff right! 

1. Traditionally, CPU's have 7 usable registers. MOVE A TO B is very simple but 
extremely slow:
       L R3,0(R1)     get source data into reg
       ST R3,0(R2)   save source data to destination
       AHI  R1,##     point to next source data
       AHI  R2,##     point to next destination 
       repeat until all data moved.
4 of the 7 registers are needed to move data. Moving data is very expensive 
using this method. Ask yourself why 64 bit was so important. You could move 8 
bytes instead of 4 bytes thus doubling CPU speed for moving data.

2. To move data using vector instructions is the same except it uses vector 
instructions and vector registers. 
2a. There are 32 vector registers (same as zArch)
2b. some vector registers are 64 bytes (512 bits as in Intel x86 AVX-512). 64 
bytes times 32 regs is 2KB. IBM Telum vector registers are only 16 bytes. 16 
bytes times 32 regs is 512 bytes (25% of AVX-512).
       VL     VR1,0(R1)         get source data into reg
       VL     VR2,64(R1)      get source data into reg
       VL     VR3,128(R1)     get source data into reg
       VL     VR4,192(R1)     get source data into reg
       VST   VR1,0(R2)         save source data to destination
       VST   VR2,64(R2)      save source data to destination
       VST   VR3,128(R2)      save source data to destination
       VST   VR4,192(R2)      save source data to destination
       AHI  R1,256                 point to next source data
       AHI  R2,256                point to next destination 
       repeat until all data moved.
Most architectures use 4 or fewer vector registers. They are moving 256 bytes 
at a time which is the equivalent of an MVC which IBM architected in the 
1960's. Thank god for fast CPUs! 

3. zArch MVC, MVCL, and MVCLE are exponentially faster than vector instructions:
3a. MVC and MVCL have existed since 1970 but never changed externally to 
improve performance. 
3a. zArch microcode must be using hidden buffers instead of user facing vector 
registers.  I'm guessing at least 4K or larger. Huge compared to vector moving 
256 bytes (16X faster on 4K move).
3b. zArch appears to be using a masked move (mask being the size to move). A 
few vector architectures have masked but most implementations use calculations 
to determine power of 2 moves that fit in the remaining size (e.g. 1 byte, 2, 
4, 8, ..., 256 bytes).
3c. Decode of one instruction MVCLE instead of potentially hundreds to 
thousands of instructions to decode.
3d. There's a reason IBM Telum pipeline is only 6 instructions versus 15 to 30 
instructions of other architectures.
3e. MVCLE can easily calculate prefetch of storage to optimize move. Only a few 
non-IBM implementations use prefetch and they simply pick a number (e.g. 756 
bytes).
3f. If zArch microcode used the vector implementation internally. then that's 
at least 19 instructions in L2 that don't need to be continually decoded.
3g. zArch vector instructions VLM and VSTM that can specify 16 vector registers 
(256 bytes) with 6 fewer instructions compared to above.
3g. I suspect that vector microcode resides in L2 cache because of low use. 
High use microcode probably resides permanently in L1 cache. Being microcode, 
it's probably permanently decoded.

The computer industry (e.g. Google) is delusional to ignore the power of the 
IBM Telum with each core being several times faster than any other CPU core 
today because of it's instruction set.

Vector instruction performance WAS: Execute-Type Instructions

Reply via email to