On 9/4/25 11:13, Chao Liu wrote:
+/*
+ * Check whether the i bit of the mask is 0 or 1.
+ *
+ * static inline int vext_elem_mask(void *v0, int index)
+ * {
+ *     int idx = index / 64;
+ *     int pos = index % 64;
+ *     return (((uint64_t *)v0)[idx] >> pos) & 1;
+ * }
+ *
+ * And
+ *
+ * if (vext_elem_mask(v0, i) != 0) {
+ *     goto label;
+ * }
+ */
+static void gen_check_vext_elem_mask(TCGLabel *label, TCGv mask, TCGv 
mask_offs)
+{
+    TCGv mask_offs_64 = tcg_temp_new();
+    TCGv mask_offs_rem = tcg_temp_new();
+    TCGv mask_elem = tcg_temp_new();
+
+    tcg_gen_shri_tl(mask_offs_64, mask_offs, 3);
+    tcg_gen_add_tl(mask_offs_64, mask_offs_64, mask);
+    tcg_gen_ld_i64((TCGv_i64)mask_elem, (TCGv_ptr)mask_offs_64, 0);

Each and every time you cast a TCGv, you're doing something wrong.
There are a lot of them in this patch.

Your host pointer arithmetic should be using TCGv_ptr and tcg_gen_*_ptr().
This mask_elem should itself be TCGv_i64.

+    tcg_gen_andi_tl(mask_elem, mask_elem, 1);
+    tcg_gen_brcond_tl(TCG_COND_NE, mask_elem, tcg_constant_tl(0), label);

This should be

    tcg_gen_brcond_i64(TCG_COND_TSTNE, mask_elem, tcg_constant_i64(1), label);


+/*
+ * Simulate the strided load/store main loop:
+ *
+ * for (i = env->vstart; i < env->vl; env->vstart = ++i) {
+ *     k = 0;
+ *     while (k < nf) {
+ *         if (!vm && !vext_elem_mask(v0, i)) {
+ *             vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz,
+ *                               (i + k * max_elems + 1) * esz);
+ *             k++;
+ *             continue;
+ *         }
+ *         target_ulong addr = base + stride * i + (k << log2_esz);
+ *         ldst(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+ *         k++;
+ *     }
+ * }

The form of this loop causes you to do more reads for vext_elem_mask than 
necessary.

Better to test once outside of the loop over K:

    for (i in vl) {
        if (!vm && !vext_elem_mask(v0, i)) {
            for (k in nf) {
                vd[i, k] = -1;
            }
        } else {
            for (k in nf) {
                vd[i, k] = ld(addr);
            }
        }
    }

If vl_eq_vlmax, and VL is a multiple of 64, you can structure this loop like:

    i = 0;
    do {
        mask = v0[i / 64];
        do {
            if (mask & 1) {
                ...
            }
            mask >>= 1;
        } while (++i & 63);
    } while (i < vl);

If VL is a smaller power of 2, you can load smaller units of mask to match. Though beware of the big-endian host addressing fixup.

If VM, you should fuse I and K into a single loop over all elements.


r~

Reply via email to