On 9/4/25 11:13, Chao Liu wrote:
+/*
+ * Check whether the i bit of the mask is 0 or 1.
+ *
+ * static inline int vext_elem_mask(void *v0, int index)
+ * {
+ * int idx = index / 64;
+ * int pos = index % 64;
+ * return (((uint64_t *)v0)[idx] >> pos) & 1;
+ * }
+ *
+ * And
+ *
+ * if (vext_elem_mask(v0, i) != 0) {
+ * goto label;
+ * }
+ */
+static void gen_check_vext_elem_mask(TCGLabel *label, TCGv mask, TCGv
mask_offs)
+{
+ TCGv mask_offs_64 = tcg_temp_new();
+ TCGv mask_offs_rem = tcg_temp_new();
+ TCGv mask_elem = tcg_temp_new();
+
+ tcg_gen_shri_tl(mask_offs_64, mask_offs, 3);
+ tcg_gen_add_tl(mask_offs_64, mask_offs_64, mask);
+ tcg_gen_ld_i64((TCGv_i64)mask_elem, (TCGv_ptr)mask_offs_64, 0);
Each and every time you cast a TCGv, you're doing something wrong.
There are a lot of them in this patch.
Your host pointer arithmetic should be using TCGv_ptr and tcg_gen_*_ptr().
This mask_elem should itself be TCGv_i64.
+ tcg_gen_andi_tl(mask_elem, mask_elem, 1);
+ tcg_gen_brcond_tl(TCG_COND_NE, mask_elem, tcg_constant_tl(0), label);
This should be
tcg_gen_brcond_i64(TCG_COND_TSTNE, mask_elem, tcg_constant_i64(1), label);
+/*
+ * Simulate the strided load/store main loop:
+ *
+ * for (i = env->vstart; i < env->vl; env->vstart = ++i) {
+ * k = 0;
+ * while (k < nf) {
+ * if (!vm && !vext_elem_mask(v0, i)) {
+ * vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz,
+ * (i + k * max_elems + 1) * esz);
+ * k++;
+ * continue;
+ * }
+ * target_ulong addr = base + stride * i + (k << log2_esz);
+ * ldst(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+ * k++;
+ * }
+ * }
The form of this loop causes you to do more reads for vext_elem_mask than
necessary.
Better to test once outside of the loop over K:
for (i in vl) {
if (!vm && !vext_elem_mask(v0, i)) {
for (k in nf) {
vd[i, k] = -1;
}
} else {
for (k in nf) {
vd[i, k] = ld(addr);
}
}
}
If vl_eq_vlmax, and VL is a multiple of 64, you can structure this loop like:
i = 0;
do {
mask = v0[i / 64];
do {
if (mask & 1) {
...
}
mask >>= 1;
} while (++i & 63);
} while (i < vl);
If VL is a smaller power of 2, you can load smaller units of mask to match. Though beware
of the big-endian host addressing fixup.
If VM, you should fuse I and K into a single loop over all elements.
r~