[Bug middle-end/111933] memcpy on Xtensa not optimized when n == sizeof(uint32_t) or sizeof(uint64_t)

rsaxvc at gmail dot com via Gcc-bugs Thu, 22 Aug 2024 18:05:17 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111933


rsaxvc at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsaxvc at gmail dot com

--- Comment #3 from rsaxvc at gmail dot com ---
(In reply to Davide Bettio from comment #2)

> ...I was writing a function for reading uint32_t and uint64_t values at any 
> address...

I believe memcpy() is the right approach, as dereferencing a misaligned pointer
is unaligned behaviour.

My suspicion is that assuming unalinged access is unsafe is intentional for
ESP32, because some of the internal memories like IRAM require strict
alignment, though most do not. Quoting from
https://blog.espressif.com/esp32-programmers-memory-model-259444d89387 ,

"...IRAM has access limitations in terms of alignment of address and size. If
an unaligned access is made, it results into an exception. The ESP-IDF, after
release 4.2, handles these exceptions transparently to provide load/store as
desired by the caller. As these unaligned accesses result in exception, the
access is slower than the DRAM access. Typically each exception handling
requires approximately 167 CPU cycles (i.e. 0.7 usec per access at 240 MHz or 1
usec per access at 160 MHz)."

It does look like the equivalent 16-bit unaligned load could be faster:

uint16_t from_unaligned_u16(void*p){
    uint16_t ret;
    memcpy(&ret,p,sizeof(ret));
    return ret;
}

readU16: //round-trips through the stack
        entry   sp, 48
        l8ui    a8, a2, 0
        l8ui    a2, a2, 1
        s8i     a8, sp, 0
        s8i     a2, sp, 1
        l16ui   a2, sp, 0
        retw.n

uint32_t from_unaligned_u16_seq(uint8_t *p){
    uint32_t p1 = p[1];
    uint32_t p0 = p[0];
    return p0 | p1 << 8;
}

readU16Seq: //works in registers
        entry   sp, 32
        l8ui    a8, a2, 1
        l8ui    a2, a2, 0
        slli    a8, a8, 8
        or      a2, a8, a2
        retw.n

But for the 32-bit version I couldn't get anything shorter than what GCC did.

[Bug middle-end/111933] memcpy on Xtensa not optimized when n == sizeof(uint32_t) or sizeof(uint64_t)

Reply via email to