Issue 146483
Summary [Integer division]Optimization missed in looped division
Labels new issue
Assignees
Reporter bionukg
    I learned that, for any `uint32_t i`, `i/3` is equivalent to `(uint64_t(i)*0xAAAAAAAB)>>33`. This is how modern compilers typically optimize integer division by constants.

To verify this equivalence, I've written the following code:

```cpp
#include <iostream>
#include <cstdint>
#include <chrono>
//[[gnu::noinline]] 
uint32_t div3_v0(uint32_t a) { return a / 3; }
//[[gnu::noinline]] 
uint32_t div3_v1(uint32_t a) { return (static_cast<uint64_t>(a) * 0xAAAAAAAB) >> 33; }

uint64_t test0()
{
    uint64_t res0 = 0;
    for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
    {
        res0 += div3_v0(i);
 }
    return res0;
}

uint64_t test1()
{
    uint64_t res1 = 0;
 for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
    {
        res1 += div3_v1(i);
    }
    return res1;
}

int main()
{
    // Iterate through all possible uint32_t values
    for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
    {
        // Calculate result using regular division
        uint32_t division_result = div3_v0(i);

        // Calculate result using multiplication and right shift
        uint32_t multiplication_shift_result = div3_v1(i);

        // Verify if both results are equal
        if (division_result != multiplication_shift_result)
        {
            std::cout << "i = " << i << std::endl;
            std::cout << "i/3 = " << division_result << std::endl;
            std::cout << "(uint64_t(i)*0xAAAAAAAB)>>33 = " << multiplication_shift_result << std::endl;
            return 1;
        }
 }
    std::cout << "for all i in uint32_t, i / 3 == (uint64_t(i)*0xAAAAAAAB)>>33" << std::endl;
    
    // Test time cost
 auto tp0 = std::chrono::high_resolution_clock::now();
    uint64_t res0 = test0();
    auto tp1 = std::chrono::high_resolution_clock::now();

 auto tp2 = std::chrono::high_resolution_clock::now();
    uint64_t res1 = test1();
    auto tp3 = std::chrono::high_resolution_clock::now();
    
 std::cout << "res0 = " << res0 << std::endl;
    std::cout << "res1 = " << res1 << std::endl;
    std::cout << "duration0 = " << std::chrono::duration_cast<std::chrono::milliseconds>(tp1 - tp0).count() << " ms" << std::endl;
    std::cout << "duration1 = " << std::chrono::duration_cast<std::chrono::milliseconds>(tp3 - tp2).count() << " ms" << std::endl;

    return 0;
}
```

When compiled with `-O3` on my system, the output is:

```
for all i in uint32_t, i / 3 == (uint64_t(i)*0xAAAAAAAB)>>33
res0 = 3074457343470774955
res1 = 3074457343470774955
duration0 = 1514 ms
duration1 = 1095 ms
```

The performance discrepancy between `duration0` and `duration1` is unexpected, as both functions should theoretically be optimized to the same machine code.

Upon examining the generated assembly code, I noticed that in `test1`, the multiplication `static_cast<uint64_t>(i) * 0xAAAAAAAB` has undergone strength reduction optimization and been converted into a sequence of addition operations. However, in `test0`, while the division was optimized into a multiplication, the subsequent multiplication was not further optimized into addition operations. This appears to be a missed optimization opportunity.

Additionally, even though `div3_v0` and `div3_v1` produce identical assembly code, they are treated as distinct functions with separate memory addresses. This is another missed optimization, as the compiler should recognize their semantic equivalence and merge them.

Compilation command used: `clang++ -march=x86-64 -O3 -g -mno-sse -mno-avx -mno-avx512f`

This issue can be reproduced with the following compilers:
1. `x86-64 clang 20.1.0` on `godbolt.org`
2. On my WSL environment:
```
Ubuntu clang version 19.1.7 (++20250114103320+cd708029e0b2-1~exp1~20250114103432.75)
Target: x86_64-pc-linux-gnu
```
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to