functionality difference-performance postgreSQLv14-GCC-llvm-clang

2022-07-11 Thread arjun shetty
Hi,

PostgreSQLv14- compiled with LLVM-Clangv13 and GCCv11,And captured
performance using HammerDBv4.3-TPC-H.
And Observed the functionality differences as LLVM-Clangv13-triggers
heapgetpage instead XidInMVCCSnapshot or vice versa with GCC.
I would like to know here the functionality differences triggered i.e.
function call "heapgetpage vs XidInMVCCSnapshot '' in GCC vs LLVM-Clangv13
And also observed the performance difference GCC performing (Query
execution time (small value is better) better than LLVM-Clang 13 on same
BareMetal with same H/W and DB configurations

 perf data top hot functions:
 LLVM-Clangv13:
=
 TPCH-Query-completed-50.526 seconds

 OverheadSymbol
  19.41%[.] tts_buffer_heap_getsomeattrs
 * 17.75%[.] heapgetpage*
   9.46%[.] bpchareq
   5.86%[.] ExecEvalScalarArrayOp
   5.85%[.] ExecInterpExpr
   4.50%[.] ReadBuffer_common
   3.02%[.] heap_getnextslot

   GCCv11
  ===

   TPCH-Query-completed-41.593 seconds

  21.13%[.] tts_buffer_heap_getsomeattrs
  *11.58%[.] XidInMVCCSnapshot*
  10.87%[.] bpchareq
   7.07%[.] ExecEvalScalarArrayOp
   5.93%[.] ExecInterpExpr
   5.16%[.] ReadBuffer_common
   3.61%[.] heapgetpage

Regards
Arjun


PostgreSQLv14 TPC-H performance GCC vs Clang

2021-11-02 Thread arjun shetty
Hi
PostgreSQLv14 source code build  with GCCv11.2 and Clangv12(without JIT)
with  optimisation flags like O3 and tested with HammerDB
Observed TPC-H , GCC performance better than Clang(without JIT). The
performance difference ~22% and also noticed the assembly code difference
GCC vs Clang( e.g. GCC inlined functionality compared to Clang).

Environment details:
-
OS :RHEL8.4
Bare metal : Apple/AMD EPYC/IBM
Test(TPC-H) Benchmark Environment:HammerDB

Is the performance difference mainly because of below points ?
1 data over flow and calculations like int128(int128.c) and C arithmetic
operations(functions include in float.h e.g float4_mul)

And please suggest is any another functionality or code points need to
check on the performance difference


Re: PostgreSQLv14 TPC-H performance GCC vs Clang

2021-11-05 Thread arjun shetty
Hi

@imre : Thank you sharing the links on “ Phoronix has been tested the
PostgreSQL 13”.
I compared my test results with Phoronix test suit” . It has too
deviations(may be hardware environment and PostgreSQL version)
I think PostgreSQLv13 may have issues with Auto vacuum and currently I’m
using with PostgreSQLv14


In my environment GCC performs better than Clang(llvm) the reason would  be
“int128”performance better in GCC compared to Clang.
1.Clang(__int128) require 4 additional functions like “__divti3 , __modti3,
__udivti3, __umodti3” and these additional not required in GCC . So it may
lead performance drop in Clang.
2.__int128 aligned 16 bytes boundaries (MAXALIGN) supported in GCC and may
this in not support in Clang

@postgresql- performance: kindly let know your view on those two points.





On Wednesday, November 3, 2021, Imre Samu  wrote:

> > .. optimisation flags like O3
> > And please suggest ...  to check on the performance difference
>
> The Phoronix has been tested the PostgreSQL 13 with Clang 12 + GCC 11.1 On
> Xeon Ice Lake
> *  "The CFLAGS/CXXFLAGS set throughout testing were "-O3 -march=native
> -flto" *
> *  as would be common for HPC systems when building performance sensitive
> code."*
> *and the results:*
>   https://www.phoronix.com/scan.php?page=article&item=clang12-
> gcc11-icelake&num=4 ( see ~ bottom of the page )
> only the Postgres ( GCC 11 vs. LLVM Clang 12 Benchmarks On Xeon Ice Lake )
>   https://openbenchmarking.org/result/2105299-IB-COMPILERT91&;
> sgm=1&ppt=D&sor&sgm=1&ppt=D&oss=Postgresql
>   maybe you can replicate the Phoronix results  ( but this is only gcc11.1
> ! )
>   "Compare your own system(s) to this result file with the Phoronix Test
> Suite
> by running the command: phoronix-test-suite benchmark
> 2105299-IB-COMPILERT91"
>
> Regards.
>   Imre
>
> arjun shetty  ezt írta (időpont: 2021. nov. 2.,
> K, 18:13):
>
>> Hi
>> PostgreSQLv14 source code build  with GCCv11.2 and Clangv12(without JIT)
>> with  optimisation flags like O3 and tested with HammerDB
>> Observed TPC-H , GCC performance better than Clang(without JIT). The
>> performance difference ~22% and also noticed the assembly code difference
>> GCC vs Clang( e.g. GCC inlined functionality compared to Clang).
>>
>> Environment details:
>> -
>> OS :RHEL8.4
>> Bare metal : Apple/AMD EPYC/IBM
>> Test(TPC-H) Benchmark Environment:HammerDB
>>
>> Is the performance difference mainly because of below points ?
>> 1 data over flow and calculations like int128(int128.c) and C arithmetic
>> operations(functions include in float.h e.g float4_mul)
>>
>> And please suggest is any another functionality or code points need to
>> check on the performance difference
>>
>


PostgreSQLv14 TPC-H performance GCC vs Clang

2021-11-16 Thread arjun shetty
Yes, currently focusing affects queries as well.
In meanwhile on analysis(hardware level) and sample examples noticed
1. GCC performance  better than Clang on int128 .
2. Clang performance better than GCC on long long
 the reference example
https://stackoverflow.com/questions/63029428/why-is-int128-t-faster-than-long-long-on-x86-64-gcc

3.GCC enabled with “ fexcess-precision=standard” (precision cast for
floating point ).

Is these 3 points can make performance  difference GCC vs Clang in
PostgreSQLv14 in Apple/AMD/()environment(intel environment need to check).
In these environment int128 enabled wrt PostgreSQLv14.

On Friday, November 5, 2021, Tomas Vondra 
wrote:

> Hi,
>
> IMO this thread provides so little information it's almost impossible to
> answer the question. There's almost no information about the hardware,
> scale of the test, configuration of the Postgres instance, the exact build
> flags, differences in generated asm code, etc.
>
> I find it hard to believe merely switching from clang to gcc yields 22%
> speedup - that's way higher than any differences we've seen in the past.
>
> In my experience, the speedup is unlikely to be "across the board". There
> will be a handful of affected queries, while most remaining queries will be
> about the same. In that case you need to focus on those queries, see if the
> plans are the same, do some profiling, etc.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Lock contention high

2021-11-16 Thread arjun shetty
Hi Askhil

PostgreSQL utilizes  lightweight locks(LWLocks) to synchronize and control
access to the buffer content. A process acquires an LWLock in a  shared
mode to read from the buffer and an exclusive mode  to write to the buffer.
Therefore, while holding an exclusive lock, a process prevents other
processes from acquiring a shared or exclusive lock. Also, a shared lock
can be acquired concurrently by other processes. The issue starts when many
processes acquire an exclusive lock on buffer content. As a result,
LwlockAcquire seen as top hot function in profilng.
Here  need to understand LwlockAcquire is lock contention or cpu time spent
inside the method/ function(top function in profiling)

It can analysed log  “LwStatus” with parameters like
ex-acquire-count(exclusive mode) , sh-acquire-count , block-count and
spin-delay-count

Total lock acquisition request = ex-acquire-count+sh-acquire-count)
Time lock contention %= block count)/ Total lock acquisition request.

Time lock contention may provide as most of cpu time inside the function
rather than spinning/ waiting for lock.

On Friday, November 12, 2021, Ashkil Dighin 
wrote:

> Hi
> I suspect lock contention and performance issues with __int128. And I
> would like to check the performance by forcibly disabling
> int128(Maxalign16bytes) and enable like long long(maxlign 8bytes).
>  Is it possible to disable int128 in PostgreSQL?
>
> On Thursday, October 28, 2021, Andres Freund  wrote:
>
>> Hi,
>>
>> On October 27, 2021 2:44:56 PM PDT, Ashkil Dighin <
>> [email protected]> wrote:
>> >Hi,
>> >Yes, lock contention reduced with postgresqlv14.
>> >Lock acquire reduced 18% to 10%
>> >10.49 %postgres  postgres[.] LWLockAcquire
>> >5.09%  postgres  postgres[.] _bt_compare
>> >
>> >Is lock contention can be reduced to 0-3%?
>>
>> Probably not, or at least not easily. Because of the atomic instructions
>> the locking also includes  some other costs (e.g. cache misses, serializing
>> store buffers,...).
>>
>> There's a good bit we can do to increase the cache efficiency around
>> buffer headers, but it won't get us quite that low I'd guess.
>>
>>
>> >On pg-stat-activity shown LwLock as “BufferCounter” and “WalInsert”
>>
>> Without knowing what proportion they have to each and to non-waiting
>> backends that unfortunately doesn't help that much..
>>
>> Andres
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>
>


Re: Lock contention high

2021-11-29 Thread arjun shetty
1. How to check which NUMA node in PostgreSQL process fetching from the
memory?

2. Is NUMA configuration is better for PostgreSQL?
  vm.zone_reclaim_mode= 0
   numactl --interleave = all  /init.d/ PostgreSQL start
kernel.numa_balancing= 0





On Wednesday, November 17, 2021, arjun shetty 
wrote:

> Hi Askhil
>
> PostgreSQL utilizes  lightweight locks(LWLocks) to synchronize and
> control access to the buffer content. A process acquires an LWLock in a
> shared mode to read from the buffer and an exclusive mode  to write to
> the buffer. Therefore, while holding an exclusive lock, a process prevents
> other processes from acquiring a shared or exclusive lock. Also, a shared
> lock can be acquired concurrently by other processes. The issue starts when
> many processes acquire an exclusive lock on buffer content. As a result,
> LwlockAcquire seen as top hot function in profilng.
> Here  need to understand LwlockAcquire is lock contention or cpu time
> spent inside the method/ function(top function in profiling)
>
> It can analysed log  “LwStatus” with parameters like
> ex-acquire-count(exclusive mode) , sh-acquire-count , block-count and
> spin-delay-count
>
> Total lock acquisition request = ex-acquire-count+sh-acquire-count)
> Time lock contention %= block count)/ Total lock acquisition request.
>
> Time lock contention may provide as most of cpu time inside the function
> rather than spinning/ waiting for lock.
>
> On Friday, November 12, 2021, Ashkil Dighin 
> wrote:
>
>> Hi
>> I suspect lock contention and performance issues with __int128. And I
>> would like to check the performance by forcibly disabling
>> int128(Maxalign16bytes) and enable like long long(maxlign 8bytes).
>>  Is it possible to disable int128 in PostgreSQL?
>>
>> On Thursday, October 28, 2021, Andres Freund  wrote:
>>
>>> Hi,
>>>
>>> On October 27, 2021 2:44:56 PM PDT, Ashkil Dighin <
>>> [email protected]> wrote:
>>> >Hi,
>>> >Yes, lock contention reduced with postgresqlv14.
>>> >Lock acquire reduced 18% to 10%
>>> >10.49 %postgres  postgres[.] LWLockAcquire
>>> >5.09%  postgres  postgres[.] _bt_compare
>>> >
>>> >Is lock contention can be reduced to 0-3%?
>>>
>>> Probably not, or at least not easily. Because of the atomic instructions
>>> the locking also includes  some other costs (e.g. cache misses, serializing
>>> store buffers,...).
>>>
>>> There's a good bit we can do to increase the cache efficiency around
>>> buffer headers, but it won't get us quite that low I'd guess.
>>>
>>>
>>> >On pg-stat-activity shown LwLock as “BufferCounter” and “WalInsert”
>>>
>>> Without knowing what proportion they have to each and to non-waiting
>>> backends that unfortunately doesn't help that much..
>>>
>>> Andres
>>>
>>> --
>>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>>
>>


PostgreSQLv14 performance client-server-HammerDB

2021-12-16 Thread arjun shetty
Hi ,

PostgreSQLv14 source build/compiled with GCCv11.1 and bin's run different
machine like single machine and client-server machine.

observed Single Milan machine, the NOPM is more or less half with the
Client-Server method.

And checked the network bandwidth on Client-Server machine, it is similar
bandwidth(transmit request and receive) and tcp/udp ports same bandwidth.

Only the difference in Client-Server is RAM size and Cache(L1/L2/L3). is
this cause drop in NOPM?

Is another recommend configurations or parameters need to check via
HammerDBv4.x

In Client-server model(HammerDBv4.x run in Client and PostgreSQLv14 run in
Server Model)

12 VU:NOPM 431811)

On Single or Sole Machine (both HammerDBv4.x & PostgreSQLv14 run same
machine )

12 VU: NOPM:728825


Re: PostgreSQLv14 TPC-H performance GCC vs Clang

2022-01-18 Thread arjun shetty
Hi All,

I checked with LLVM/CLang 14.0 on arch x86-64-O3 in the Mac/AMD EPYC
environment , but I see  GCC performs better than Clang14.
Clang14-https://github.com/llvm/llvm-project(main branch and pull or
commitID:3f3fe4a5cfa1797..)
[image: image.png]
pre analysis GCC vs Clang
 (1) GCC more inlined functionality compared to Clang in PostgreSQL
 (2) in few functions  GCC are not inlined but Clang consider inline
   postgresqlv14/src/include/utlis/float.h: float8_mul(),float8_div
(arithmetic functions).v
  postgresqlv14/src/backend/adt/geo_ops.c : point_xxx().
(3) GCC performs better than clang on datatype Int128(need to cross check
on instruction level/assembly code on Hardware).
(4) as point(2) without inline(remove inline in source code ) on those
functions in file's float.h and geo_ops.c and observed performance
improvement 6% compared to  within inline in Clang.

regards,
Arjun


On Fri, Dec 10, 2021 at 11:51 PM Imre Samu  wrote:

> > GCC vs Clang
>
> related:
> As I see - with LLVM/Clang 14.0 ( X86_64 -O3 )   ~12% performance increase
> expected with the new optimisation ( probably adapted from gcc  )
> - https://twitter.com/djtodoro/status/1466808507240386560
> -
> https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Clang-14-Hoist-Load
>
> regards,
>  Imre
>
>
>
> arjun shetty  ezt írta (időpont: 2021. nov.
> 16., K, 11:10):
>
>> Yes, currently focusing affects queries as well.
>> In meanwhile on analysis(hardware level) and sample examples noticed
>> 1. GCC performance  better than Clang on int128 .
>> 2. Clang performance better than GCC on long long
>>  the reference example
>> https://stackoverflow.com/questions/63029428/why-is-int128-t-faster-than-long-long-on-x86-64-gcc
>>
>> 3.GCC enabled with “ fexcess-precision=standard” (precision cast for
>> floating point ).
>>
>> Is these 3 points can make performance  difference GCC vs Clang in
>> PostgreSQLv14 in Apple/AMD/()environment(intel environment need to check).
>> In these environment int128 enabled wrt PostgreSQLv14.
>>
>> On Friday, November 5, 2021, Tomas Vondra 
>> wrote:
>>
>>> Hi,
>>>
>>> IMO this thread provides so little information it's almost impossible to
>>> answer the question. There's almost no information about the hardware,
>>> scale of the test, configuration of the Postgres instance, the exact build
>>> flags, differences in generated asm code, etc.
>>>
>>> I find it hard to believe merely switching from clang to gcc yields 22%
>>> speedup - that's way higher than any differences we've seen in the past.
>>>
>>> In my experience, the speedup is unlikely to be "across the board".
>>> There will be a handful of affected queries, while most remaining queries
>>> will be about the same. In that case you need to focus on those queries,
>>> see if the plans are the same, do some profiling, etc.
>>>
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra
>>> EnterpriseDB: http://www.enterprisedb.com
>>> The Enterprise PostgreSQL Company
>>>
>>