functionality difference-performance postgreSQLv14-GCC-llvm-clang
Hi, PostgreSQLv14- compiled with LLVM-Clangv13 and GCCv11,And captured performance using HammerDBv4.3-TPC-H. And Observed the functionality differences as LLVM-Clangv13-triggers heapgetpage instead XidInMVCCSnapshot or vice versa with GCC. I would like to know here the functionality differences triggered i.e. function call "heapgetpage vs XidInMVCCSnapshot '' in GCC vs LLVM-Clangv13 And also observed the performance difference GCC performing (Query execution time (small value is better) better than LLVM-Clang 13 on same BareMetal with same H/W and DB configurations perf data top hot functions: LLVM-Clangv13: = TPCH-Query-completed-50.526 seconds OverheadSymbol 19.41%[.] tts_buffer_heap_getsomeattrs * 17.75%[.] heapgetpage* 9.46%[.] bpchareq 5.86%[.] ExecEvalScalarArrayOp 5.85%[.] ExecInterpExpr 4.50%[.] ReadBuffer_common 3.02%[.] heap_getnextslot GCCv11 === TPCH-Query-completed-41.593 seconds 21.13%[.] tts_buffer_heap_getsomeattrs *11.58%[.] XidInMVCCSnapshot* 10.87%[.] bpchareq 7.07%[.] ExecEvalScalarArrayOp 5.93%[.] ExecInterpExpr 5.16%[.] ReadBuffer_common 3.61%[.] heapgetpage Regards Arjun
PostgreSQLv14 TPC-H performance GCC vs Clang
Hi PostgreSQLv14 source code build with GCCv11.2 and Clangv12(without JIT) with optimisation flags like O3 and tested with HammerDB Observed TPC-H , GCC performance better than Clang(without JIT). The performance difference ~22% and also noticed the assembly code difference GCC vs Clang( e.g. GCC inlined functionality compared to Clang). Environment details: - OS :RHEL8.4 Bare metal : Apple/AMD EPYC/IBM Test(TPC-H) Benchmark Environment:HammerDB Is the performance difference mainly because of below points ? 1 data over flow and calculations like int128(int128.c) and C arithmetic operations(functions include in float.h e.g float4_mul) And please suggest is any another functionality or code points need to check on the performance difference
Re: PostgreSQLv14 TPC-H performance GCC vs Clang
Hi @imre : Thank you sharing the links on “ Phoronix has been tested the PostgreSQL 13”. I compared my test results with Phoronix test suit” . It has too deviations(may be hardware environment and PostgreSQL version) I think PostgreSQLv13 may have issues with Auto vacuum and currently I’m using with PostgreSQLv14 In my environment GCC performs better than Clang(llvm) the reason would be “int128”performance better in GCC compared to Clang. 1.Clang(__int128) require 4 additional functions like “__divti3 , __modti3, __udivti3, __umodti3” and these additional not required in GCC . So it may lead performance drop in Clang. 2.__int128 aligned 16 bytes boundaries (MAXALIGN) supported in GCC and may this in not support in Clang @postgresql- performance: kindly let know your view on those two points. On Wednesday, November 3, 2021, Imre Samu wrote: > > .. optimisation flags like O3 > > And please suggest ... to check on the performance difference > > The Phoronix has been tested the PostgreSQL 13 with Clang 12 + GCC 11.1 On > Xeon Ice Lake > * "The CFLAGS/CXXFLAGS set throughout testing were "-O3 -march=native > -flto" * > * as would be common for HPC systems when building performance sensitive > code."* > *and the results:* > https://www.phoronix.com/scan.php?page=article&item=clang12- > gcc11-icelake&num=4 ( see ~ bottom of the page ) > only the Postgres ( GCC 11 vs. LLVM Clang 12 Benchmarks On Xeon Ice Lake ) > https://openbenchmarking.org/result/2105299-IB-COMPILERT91&; > sgm=1&ppt=D&sor&sgm=1&ppt=D&oss=Postgresql > maybe you can replicate the Phoronix results ( but this is only gcc11.1 > ! ) > "Compare your own system(s) to this result file with the Phoronix Test > Suite > by running the command: phoronix-test-suite benchmark > 2105299-IB-COMPILERT91" > > Regards. > Imre > > arjun shetty ezt írta (időpont: 2021. nov. 2., > K, 18:13): > >> Hi >> PostgreSQLv14 source code build with GCCv11.2 and Clangv12(without JIT) >> with optimisation flags like O3 and tested with HammerDB >> Observed TPC-H , GCC performance better than Clang(without JIT). The >> performance difference ~22% and also noticed the assembly code difference >> GCC vs Clang( e.g. GCC inlined functionality compared to Clang). >> >> Environment details: >> - >> OS :RHEL8.4 >> Bare metal : Apple/AMD EPYC/IBM >> Test(TPC-H) Benchmark Environment:HammerDB >> >> Is the performance difference mainly because of below points ? >> 1 data over flow and calculations like int128(int128.c) and C arithmetic >> operations(functions include in float.h e.g float4_mul) >> >> And please suggest is any another functionality or code points need to >> check on the performance difference >> >
PostgreSQLv14 TPC-H performance GCC vs Clang
Yes, currently focusing affects queries as well. In meanwhile on analysis(hardware level) and sample examples noticed 1. GCC performance better than Clang on int128 . 2. Clang performance better than GCC on long long the reference example https://stackoverflow.com/questions/63029428/why-is-int128-t-faster-than-long-long-on-x86-64-gcc 3.GCC enabled with “ fexcess-precision=standard” (precision cast for floating point ). Is these 3 points can make performance difference GCC vs Clang in PostgreSQLv14 in Apple/AMD/()environment(intel environment need to check). In these environment int128 enabled wrt PostgreSQLv14. On Friday, November 5, 2021, Tomas Vondra wrote: > Hi, > > IMO this thread provides so little information it's almost impossible to > answer the question. There's almost no information about the hardware, > scale of the test, configuration of the Postgres instance, the exact build > flags, differences in generated asm code, etc. > > I find it hard to believe merely switching from clang to gcc yields 22% > speedup - that's way higher than any differences we've seen in the past. > > In my experience, the speedup is unlikely to be "across the board". There > will be a handful of affected queries, while most remaining queries will be > about the same. In that case you need to focus on those queries, see if the > plans are the same, do some profiling, etc. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: Lock contention high
Hi Askhil PostgreSQL utilizes lightweight locks(LWLocks) to synchronize and control access to the buffer content. A process acquires an LWLock in a shared mode to read from the buffer and an exclusive mode to write to the buffer. Therefore, while holding an exclusive lock, a process prevents other processes from acquiring a shared or exclusive lock. Also, a shared lock can be acquired concurrently by other processes. The issue starts when many processes acquire an exclusive lock on buffer content. As a result, LwlockAcquire seen as top hot function in profilng. Here need to understand LwlockAcquire is lock contention or cpu time spent inside the method/ function(top function in profiling) It can analysed log “LwStatus” with parameters like ex-acquire-count(exclusive mode) , sh-acquire-count , block-count and spin-delay-count Total lock acquisition request = ex-acquire-count+sh-acquire-count) Time lock contention %= block count)/ Total lock acquisition request. Time lock contention may provide as most of cpu time inside the function rather than spinning/ waiting for lock. On Friday, November 12, 2021, Ashkil Dighin wrote: > Hi > I suspect lock contention and performance issues with __int128. And I > would like to check the performance by forcibly disabling > int128(Maxalign16bytes) and enable like long long(maxlign 8bytes). > Is it possible to disable int128 in PostgreSQL? > > On Thursday, October 28, 2021, Andres Freund wrote: > >> Hi, >> >> On October 27, 2021 2:44:56 PM PDT, Ashkil Dighin < >> [email protected]> wrote: >> >Hi, >> >Yes, lock contention reduced with postgresqlv14. >> >Lock acquire reduced 18% to 10% >> >10.49 %postgres postgres[.] LWLockAcquire >> >5.09% postgres postgres[.] _bt_compare >> > >> >Is lock contention can be reduced to 0-3%? >> >> Probably not, or at least not easily. Because of the atomic instructions >> the locking also includes some other costs (e.g. cache misses, serializing >> store buffers,...). >> >> There's a good bit we can do to increase the cache efficiency around >> buffer headers, but it won't get us quite that low I'd guess. >> >> >> >On pg-stat-activity shown LwLock as “BufferCounter” and “WalInsert” >> >> Without knowing what proportion they have to each and to non-waiting >> backends that unfortunately doesn't help that much.. >> >> Andres >> >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. >> >
Re: Lock contention high
1. How to check which NUMA node in PostgreSQL process fetching from the memory? 2. Is NUMA configuration is better for PostgreSQL? vm.zone_reclaim_mode= 0 numactl --interleave = all /init.d/ PostgreSQL start kernel.numa_balancing= 0 On Wednesday, November 17, 2021, arjun shetty wrote: > Hi Askhil > > PostgreSQL utilizes lightweight locks(LWLocks) to synchronize and > control access to the buffer content. A process acquires an LWLock in a > shared mode to read from the buffer and an exclusive mode to write to > the buffer. Therefore, while holding an exclusive lock, a process prevents > other processes from acquiring a shared or exclusive lock. Also, a shared > lock can be acquired concurrently by other processes. The issue starts when > many processes acquire an exclusive lock on buffer content. As a result, > LwlockAcquire seen as top hot function in profilng. > Here need to understand LwlockAcquire is lock contention or cpu time > spent inside the method/ function(top function in profiling) > > It can analysed log “LwStatus” with parameters like > ex-acquire-count(exclusive mode) , sh-acquire-count , block-count and > spin-delay-count > > Total lock acquisition request = ex-acquire-count+sh-acquire-count) > Time lock contention %= block count)/ Total lock acquisition request. > > Time lock contention may provide as most of cpu time inside the function > rather than spinning/ waiting for lock. > > On Friday, November 12, 2021, Ashkil Dighin > wrote: > >> Hi >> I suspect lock contention and performance issues with __int128. And I >> would like to check the performance by forcibly disabling >> int128(Maxalign16bytes) and enable like long long(maxlign 8bytes). >> Is it possible to disable int128 in PostgreSQL? >> >> On Thursday, October 28, 2021, Andres Freund wrote: >> >>> Hi, >>> >>> On October 27, 2021 2:44:56 PM PDT, Ashkil Dighin < >>> [email protected]> wrote: >>> >Hi, >>> >Yes, lock contention reduced with postgresqlv14. >>> >Lock acquire reduced 18% to 10% >>> >10.49 %postgres postgres[.] LWLockAcquire >>> >5.09% postgres postgres[.] _bt_compare >>> > >>> >Is lock contention can be reduced to 0-3%? >>> >>> Probably not, or at least not easily. Because of the atomic instructions >>> the locking also includes some other costs (e.g. cache misses, serializing >>> store buffers,...). >>> >>> There's a good bit we can do to increase the cache efficiency around >>> buffer headers, but it won't get us quite that low I'd guess. >>> >>> >>> >On pg-stat-activity shown LwLock as “BufferCounter” and “WalInsert” >>> >>> Without knowing what proportion they have to each and to non-waiting >>> backends that unfortunately doesn't help that much.. >>> >>> Andres >>> >>> -- >>> Sent from my Android device with K-9 Mail. Please excuse my brevity. >>> >>
PostgreSQLv14 performance client-server-HammerDB
Hi , PostgreSQLv14 source build/compiled with GCCv11.1 and bin's run different machine like single machine and client-server machine. observed Single Milan machine, the NOPM is more or less half with the Client-Server method. And checked the network bandwidth on Client-Server machine, it is similar bandwidth(transmit request and receive) and tcp/udp ports same bandwidth. Only the difference in Client-Server is RAM size and Cache(L1/L2/L3). is this cause drop in NOPM? Is another recommend configurations or parameters need to check via HammerDBv4.x In Client-server model(HammerDBv4.x run in Client and PostgreSQLv14 run in Server Model) 12 VU:NOPM 431811) On Single or Sole Machine (both HammerDBv4.x & PostgreSQLv14 run same machine ) 12 VU: NOPM:728825
Re: PostgreSQLv14 TPC-H performance GCC vs Clang
Hi All, I checked with LLVM/CLang 14.0 on arch x86-64-O3 in the Mac/AMD EPYC environment , but I see GCC performs better than Clang14. Clang14-https://github.com/llvm/llvm-project(main branch and pull or commitID:3f3fe4a5cfa1797..) [image: image.png] pre analysis GCC vs Clang (1) GCC more inlined functionality compared to Clang in PostgreSQL (2) in few functions GCC are not inlined but Clang consider inline postgresqlv14/src/include/utlis/float.h: float8_mul(),float8_div (arithmetic functions).v postgresqlv14/src/backend/adt/geo_ops.c : point_xxx(). (3) GCC performs better than clang on datatype Int128(need to cross check on instruction level/assembly code on Hardware). (4) as point(2) without inline(remove inline in source code ) on those functions in file's float.h and geo_ops.c and observed performance improvement 6% compared to within inline in Clang. regards, Arjun On Fri, Dec 10, 2021 at 11:51 PM Imre Samu wrote: > > GCC vs Clang > > related: > As I see - with LLVM/Clang 14.0 ( X86_64 -O3 ) ~12% performance increase > expected with the new optimisation ( probably adapted from gcc ) > - https://twitter.com/djtodoro/status/1466808507240386560 > - > https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Clang-14-Hoist-Load > > regards, > Imre > > > > arjun shetty ezt írta (időpont: 2021. nov. > 16., K, 11:10): > >> Yes, currently focusing affects queries as well. >> In meanwhile on analysis(hardware level) and sample examples noticed >> 1. GCC performance better than Clang on int128 . >> 2. Clang performance better than GCC on long long >> the reference example >> https://stackoverflow.com/questions/63029428/why-is-int128-t-faster-than-long-long-on-x86-64-gcc >> >> 3.GCC enabled with “ fexcess-precision=standard” (precision cast for >> floating point ). >> >> Is these 3 points can make performance difference GCC vs Clang in >> PostgreSQLv14 in Apple/AMD/()environment(intel environment need to check). >> In these environment int128 enabled wrt PostgreSQLv14. >> >> On Friday, November 5, 2021, Tomas Vondra >> wrote: >> >>> Hi, >>> >>> IMO this thread provides so little information it's almost impossible to >>> answer the question. There's almost no information about the hardware, >>> scale of the test, configuration of the Postgres instance, the exact build >>> flags, differences in generated asm code, etc. >>> >>> I find it hard to believe merely switching from clang to gcc yields 22% >>> speedup - that's way higher than any differences we've seen in the past. >>> >>> In my experience, the speedup is unlikely to be "across the board". >>> There will be a handful of affected queries, while most remaining queries >>> will be about the same. In that case you need to focus on those queries, >>> see if the plans are the same, do some profiling, etc. >>> >>> >>> regards >>> >>> -- >>> Tomas Vondra >>> EnterpriseDB: http://www.enterprisedb.com >>> The Enterprise PostgreSQL Company >>> >>
