https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119253
Bug ID: 119253
Summary: RISC-V GCC auto-vectorizes unaligned memory access
even if mvector-strict-align is enabled
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: [email protected]
Target Milestone: ---
The following code produces a bus error on the Banana Pi BPI F3 regardless of
if -mvector-strict-align and/or -mstrict-align are passed to the compiler:
```
// compiled with /path/to/toolchain/riscv64-unknown-linux-gnu-gcc -O3
-march=rv64gcv -mvector-strict-align -mstrict-align vector_test.c -o
vector_test.o;
#include <stdio.h>
// This function replicates the unaligned 32-bit read causing the bus error:
void replicate_bus_error(
uint8_t *flag_buf,
int32_t bit_pos,
int32_t stride,
uint16_t *no_filter_flags,
int32_t ctb_size)
{
for(int32_t row = 0; row < ctb_size; row++)
{
no_filter_flags[row] =
(*(uint32_t *)(flag_buf + (bit_pos >> 3))) >> (bit_pos & 7);
bit_pos += stride;
}
}
void fixed_bus_error(
uint8_t *flag_buf,
int32_t bit_pos,
int32_t stride,
uint16_t *no_filter_flags,
int32_t ctb_size)
{
for(int32_t row = 0; row < ctb_size; row++)
{
uint8_t *base = flag_buf + (bit_pos >> 3);
uint32_t val = (uint32_t)base[0] |
((uint32_t)base[1] << 8) |
((uint32_t)base[2] << 16)|
((uint32_t)base[3] << 24);
val >>= (bit_pos & 7);
no_filter_flags[row] = (uint16_t)val;
bit_pos += stride;
}
}
int main(void)
{
uint8_t buffer[32];
for(int i = 0; i < 32; i++)
{
buffer[i] = (uint8_t)i; // Fill with some test data
}
uint16_t out_flags[8] = {0};
int32_t start_bit_pos = 5;
int32_t stride = 8;
int32_t ctb_size = 9;
replicate_bus_error(buffer, start_bit_pos, stride, out_flags, ctb_size);
for(int i = 0; i < 8; i++)
{
printf("out_flags[%d] = %u\n", i, out_flags[i]);
}
return 0;
}
```
The example is from a video decoding library, which I attempted to simplify to
narrow down the issue. It also includes a version of the function, doing
practically the same thing but written in such a way it isn't being vectorized,
thus ensuring valid memory access.
Interestingly, -mvector-strict-align doesn't change the resulting assembly.
Also, clang seems to vectorize this example and is able to produce valid
assembly code (when tested on the Banana Pi BPI F3):
https://godbolt.org/z/d56ofvqz5
I tested with gcc-trunk configured in the following way:
Target: riscv64-unknown-linux-gnu
Configured with: /home/dstojkovic/Projects/riscv-gnu-toolchain/gcc/configure
--target=riscv64-unknown-linux-gnu
--prefix=/home/dstojkovic/opt/riscv-autovec-12032025
--with-sysroot=/home/dstojkovic/opt/riscv-autovec-12032025/sysroot
--with-pkgversion=g90f5dabddbd --with-system-zlib --enable-shared --enable-tls
--enable-languages=c,c++,fortran --disable-libmudflap --disable-libssp
--disable-libquadmath --disable-libsanitizer --disable-nls --disable-bootstrap
--src=.././gcc --disable-default-pie --disable-multilib --with-abi=lp64d
--with-arch=rv64gcv --with-tune=rocket --with-isa-spec=20191213
'CFLAGS_FOR_TARGET=-O2 -mcmodel=medlow' 'CXXFLAGS_FOR_TARGET=-O2
-mcmodel=medlow'
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.0.1 20250312 (experimental) (g90f5dabddbd)