Issue 146564
Summary Zen3 scheduler model for the latency of VEXTRACTF128rri is probably incorrect
Labels new issue
Assignees
Reporter TiborGY
    See also discussion at https://discourse.llvm.org/t/are-the-latencies-of-vextractf128-correct-for-zen2-3-in-mca/86422

LLVM MCA relies on LLVM's scheduler models to predict cycle counts. This is the predicted timeline graph for a small snippet on Zen3:
```
[0,0] DeeeeeeeeER    .    .   vmovapd       (%rdi), %ymm0
[0,1]     D=eeeeeeeeeeER .    .   vsubpd        (%rsi), %ymm0, %ymm0
[0,2]     D===========eeeER   . vmulpd        %ymm0, %ymm0, %ymm0
[0,3]     D==============eeeeER vextractf128  $1, %ymm0, %xmm1
[0,4]     D==============eE---R   vmovhlps %xmm0, %xmm0, %xmm2
```
As you can see, `vextractf128` is predicted to have 4 cycles of latency. This however is inconsistent with both Agner Fogs latency tables (which list 3 cycles) and my own measurements with llvm-exegesis.

```
./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver3 --benchmark-repeat-count=100000 -min-instructions=1000  --repetition-mode=loop
---
mode: latency
key:
  instructions:
    - 'VEXTRACTF128rri XMM0 YMM0 i_0x1'
 config:          ''
  register_initial_values:
    - 'YMM0=0x0'
cpu_name: znver3
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 1000
measurements:
  - { key: latency, value: 3.15, per_snippet_value: 3.15, validation_counters: {} }
error:           ''
info: Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F04244883C42049B80200000000000000662E0F1F840000000000C4E37D19C001C4E37D19C0014983C0FF75EEC3
...
```

Confusingly, AMD's official instruction latency table for Zen3 (Family_19h_Instruction_Latencies_version_1-00.xlsx, AMD Publication No. 56665 Revision 3.00 November 2020) lists `vextractf128` to have 4 cycles of latency. Perhaps I am misinterpreting my measurement results, but I cannot see how that figure could be correct. My confidence in the accuracy of the official latency table is further eroded by the fact that the two `vextractf128` variants are both listed with empty operand fields.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to