to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Nico Heller

Good day,

consider the following query:

WITH aggregation(
    SELECT
   a.*,
  (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",
  (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",
  (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",
  (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"
    FROM a WHERE a.id IN (Imagine that for each "a" there exists between 5-100 "b", "c", "d" and 
"e" which makes the result of this pretty big (worst case: around 300kb 
when saved to a text file).
I noticed that adding the "to_jsonb" increases the query time by 100%, 
from 9-10ms to 17-23ms on average.
This may not seem slow at all but this query has another issue: on an 
AWS Aurora Serverless V2 instance we are running into a RAM usage of 
around 30-50 GB compared to < 10 GB when using a simple LEFT JOINed 
query when under high load (> 1000 queries / sec). Furthermore the CPU 
usage is quite high.


Is there anything I could improve? I am open for other solutions but I 
am wondering if I ran into an edge case of "to_jsonb" for "anonymous 
records" (these are just rows without a defined UDT) - this is just a 
wild guess though.
I am mostly looking to decrease the load (CPU and memory) on Postgres 
itself. Furthermore I would like to know why the memory usage is so 
significant. Any tips on how to analyze this issue are appreciated as 
well -  my knowledge is limited to being average at interpreting EXPLAIN 
ANALYZE results.


Here's a succinct list of the why's, what I have found out so far and 
solution I already tried/ don't want to consider:


- LEFT JOINing potentially creates a huge resultset because of the 
cartesian product, thats a nono
- not using "to_jsonb" is sadly also not possible as Postgres' array + 
record syntax is very unfriendly and hard to parse (it's barely 
documented if at all and the quoting rules are cumbersome, furthermore I 
lack column names in the array which would make the parsing sensitive to 
future table changes and thus cumbersome to maintain) in my application
- I know I could solve this with a separate query for a,b,c,d and e 
while "joinining" the result in my application, but I am looking for 
another way to do this (bear with me, treat this as an academic question :))
- I am using "to_jsonb" to simply map the result to my data model via a 
json mapper
- EXPLAIN ANALYZE is not showing anything special when using "to_jsonb" 
vs. not using it, the outermost (hash) join just takes more time - is 
there a more granular EXPLAIN that shows me the runtime of functions 
like "to_jsonb"?
- I tried an approach where b,c,d,e where array columns of UDTs: UDTs 
are not well supported by my application stack (JDBC) and are generally 
undesireable for me (because of a lack of migration possibilities)
- I don't want to duplicate my data into another table (e.g. that has 
jsonb columns)
- MATERIALIZED VIEWS are also undesirable as the manual update, its 
update is non-incremental which would make a refresh on a big data set 
take a long time
- split the query into chunks to reduce the IN()-statement list size 
makes no measurable difference
- I don't want to use JSONB columns for b,c,d and e because future 
changes of b,c,d or e's structure (e.g. new fields, changing a datatype) 
are harder to achieve with JSONB and it lacks constraint checks on 
insert (e.g. not null on column b.xy)


Kind regards and thank you for your time,
Nico Heller

P.S: Sorry for the long list of "I don't want to do this", some of them 
are not possible because of other requirements







Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Justin Pryzby
What version of postgres ?

I wonder if you're hitting the known memory leak involving jit.
Try with jit=off or jit_inline_above_cost=-1.

-- 
Justin




Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Nico Heller
I knew I forgot something: We are currently on 13.6. When was this issue 
fixed?


Am 12.08.2022 um 20:56 schrieb Justin Pryzby:

What version of postgres ?

I wonder if you're hitting the known memory leak involving jit.
Try with jit=off or jit_inline_above_cost=-1.



Good day,



consider the following query:



WITH aggregation(

     SELECT

    a.*,

   (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",

   (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",

   (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",

   (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"

     FROM a WHERE a.id IN ( 1000 queries / sec). Furthermore the CPU
usage is quite high.



Is there anything I could improve? I am open for other solutions but I
am wondering if I ran into an edge case of "to_jsonb" for "anonymous
records" (these are just rows without a defined UDT) - this is just a
wild guess though.

I am mostly looking to decrease the load (CPU and memory) on Postgres
itself. Furthermore I would like to know why the memory usage is so
significant. Any tips on how to analyze this issue are appreciated as
well -  my knowledge is limited to being average at interpreting EXPLAIN
ANALYZE results.



Here's a succinct list of the why's, what I have found out so far and
solution I already tried/ don't want to consider:



- LEFT JOINing potentially creates a huge resultset because of the
cartesian product, thats a nono

- not using "to_jsonb" is sadly also not possible as Postgres' array +
record syntax is very unfriendly and hard to parse (it's barely
documented if at all and the quoting rules are cumbersome, furthermore I
lack column names in the array which would make the parsing sensitive to
future table changes and thus cumbersome to maintain) in my application

- I know I could solve this with a separate query for a,b,c,d and e
while "joinining" the result in my application, but I am looking for
another way to do this (bear with me, treat this as an academic question :))

- I am using "to_jsonb" to simply map the result to my data model via a
json mapper

- EXPLAIN ANALYZE is not showing anything special when using "to_jsonb"
vs. not using it, the outermost (hash) join just takes more time - is
there a more granular EXPLAIN that shows me the runtime of functions
like "to_jsonb"?

- I tried an approach where b,c,d,e where array columns of UDTs: UDTs
are not well supported by my application stack (JDBC) and are generally
undesireable for me (because of a lack of migration possibilities)

- I don't want to duplicate my data into another table (e.g. that has
jsonb columns)

- MATERIALIZED VIEWS are also undesirable as the manual update, its
update is non-incremental which would make a refresh on a big data set
take a long time

- split the query into chunks to reduce the IN()-statement list size
makes no measurable difference

- I don't want to use JSONB columns for b,c,d and e because future
changes of b,c,d or e's structure (e.g. new fields, changing a datatype)
are harder to achieve with JSONB and it lacks constraint checks on
insert (e.g. not null on column b.xy)



Kind regards and thank you for your time,

Nico Heller



P.S: Sorry for the long list of "I don't want to do this", some of them
are not possible because of other requirements










Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Nico Heller

Am 12.08.2022 um 21:02 schrieb Rick Otten:




On Fri, Aug 12, 2022 at 2:50 PM Nico Heller  wrote:

Good day,

consider the following query:

WITH aggregation(
 SELECT
    a.*,
   (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id
) as "bs",
   (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id
) as "cs",
   (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id
) as "ds",
   (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id
) as "es"
 FROM a WHERE a.id  IN (- You do have an index on `b.a_id` and `c.a_id`, etc... ?  You didn't 
say...
Yes there are indices on all referenced columns of the subselect (they 
are all primary keys anyway)

- Are you sure it is the `to_jsonb` that is making this query slow?
Yes, EXPLAIN ANALYZE shows a doubling of execution time - I don't have 
numbers on the memory usage difference though


- Since you are serializing this for easy machine readable consumption 
outside of the database, does it make a difference if you use 
`to_json` instead?


Using to_json vs. to_jsonb makes no difference in regards to runtime, I 
will check if the memory consumption is different on monday - thank you 
for the idea!


Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Justin Pryzby
On Fri, Aug 12, 2022 at 07:02:36PM +, Nico Heller wrote:
> I knew I forgot something: We are currently on 13.6. When was this issue
> fixed?

There's a WIP/proposed fix, but the fix is not released.
I asked about your version because jit was disabled by default in v11.
But it's enabled by default in v12.

https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items#Older_bugs_affecting_stable_branches

-- 
Justin




Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Andres Freund
Hi,

On 2022-08-12 18:49:58 +, Nico Heller wrote:
> WITH aggregation(
>     SELECT
>    a.*,
>   (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",
>   (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",
>   (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",
>   (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"
>     FROM a WHERE a.id IN ( )
> SELECT to_jsonb(aggregation.*) as "value" FROM aggregation;

> Imagine that for each "a" there exists between 5-100 "b", "c", "d" and "e"
> which makes the result of this pretty big (worst case: around 300kb when
> saved to a text file).
> I noticed that adding the "to_jsonb" increases the query time by 100%, from
> 9-10ms to 17-23ms on average.

Could we see the explain?

Have you tried using json[b]_agg()?


> This may not seem slow at all but this query has another issue: on an AWS
> Aurora Serverless V2 instance we are running into a RAM usage of around
> 30-50 GB compared to < 10 GB when using a simple LEFT JOINed query when
> under high load (> 1000 queries / sec). Furthermore the CPU usage is quite
> high.

We can't say much about aurora. It's a heavily modified fork of postgres.  Did
you reproduce this with vanilla postgres? And if so, do you have it in a form
that somebody could try out?

Greetings,

Andres Freund




Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Nico Heller


Am 12.08.2022 um 21:15 schrieb Rick Otten:



On Fri, Aug 12, 2022 at 3:07 PM Nico Heller  wrote:

Am 12.08.2022 um 21:02 schrieb Rick Otten:




On Fri, Aug 12, 2022 at 2:50 PM Nico Heller
 wrote:

Good day,

consider the following query:

WITH aggregation(
 SELECT
    a.*,
   (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id
) as "bs",
   (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id
) as "cs",
   (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id
) as "ds",
   (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id
) as "es"
 FROM a WHERE a.id  IN (
Yes there are indices on all referenced columns of the subselect
(they are all primary keys anyway)

- Are you sure it is the `to_jsonb` that is making this query slow?

Yes, EXPLAIN ANALYZE shows a doubling of execution time - I don't
have numbers on the memory usage difference though


- Since you are serializing this for easy machine readable
consumption outside of the database, does it make a difference if
you use `to_json` instead?


Using to_json vs. to_jsonb makes no difference in regards to
runtime, I will check if the memory consumption is different on
monday - thank you for the idea!


One other thought.  Does it help if you convert the arrays to json 
first before you convert the whole row?  ie, add some to_json()'s 
around the bs, cs, ds, es columns in the CTE.  I'm wondering if 
breaking the json conversions up into smaller pieces will let the 
outer to_json() have less work to do and overall run faster.  You 
could even separately serialize the elements inside the array too.  I 
wouldn't think it would make a huge difference, you'd be making a 
bunch of extra to_json calls, but maybe it avoids some large memory 
structure that would otherwise have to be constructed to serialize all 
of those objects in all of the arrays all at the same time.


Using jsonb_array_agg and another to_jsonb at the (its still needed to 
create one value at the end and to include the columns "a.*") worsens 
the query performance by 100%, I can't speak for the memory usage 
because I would have to push these changes to preproduction - will try 
this on monday, thanks.


Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Nico Heller
Here are the query plans (I hope my anonymization didn't break them). I 
ran every query a couple times before copying the plan to avoid timing 
issues because of disk access.
Ignore the sequential scan on one of the tables, it's very small (will 
change in the future) so Postgres opts for a faster sequential scan - 
the other sequential scan is on the IN()-statement which uses a VALUE 
list in the actual query (using a non-VALUE list makes no difference).
Overall the plan is quite optimal for me and performs really well 
considering the amount of rows it extracts and converts to json.


Notice how removing to_jsonb improves the query performance 
significantly (see last query plan) and how the cost is attributed to 
the hash join.
Using to_jsonb instead of to_jsonb or json_agg instead of jsonb_agg 
makes no difference in query plan or execution time.


I used random id's so I don't know how how big the result got but it 
shouldn't matter for the query plan:



_*array_agg, then to_jsonb (my initially posted query)*_

Hash Semi Join  (cost=5.00..15947.39 rows=200 width=32) (actual 
time=0.266..18.128 rows=200 loops=1)

"  Hash Cond: (a.id = ""*VALUES*"".column1)"
  ->  Seq Scan on a  (cost=0.00..41.02 rows=502 width=422) (actual 
time=0.013..0.268 rows=502 loops=1)
  ->  Hash  (cost=2.50..2.50 rows=200 width=32) (actual 
time=0.091..0.092 rows=200 loops=1)

    Buckets: 1024  Batches: 1  Memory Usage: 21kB
"    ->  Values Scan on ""*VALUES*""  (cost=0.00..2.50 rows=200 
width=32) (actual time=0.001..0.040 rows=200 loops=1)"

  SubPlan 1
    ->  Aggregate  (cost=42.20..42.21 rows=1 width=32) (actual 
time=0.020..0.020 rows=1 loops=200)
  ->  Bitmap Heap Scan on b  (cost=4.38..42.17 rows=12 
width=156) (actual time=0.012..0.017 rows=12 loops=200)

    Recheck Cond: (a_id = a.id)
    Heap Blocks: exact=382
    ->  Bitmap Index Scan on fk_b_idx (cost=0.00..4.37 
rows=12 width=0) (actual time=0.008..0.008 rows=14 loops=200)

  Index Cond: (a_id = a.id)
  SubPlan 2
    ->  Aggregate  (cost=27.68..27.69 rows=1 width=32) (actual 
time=0.012..0.012 rows=1 loops=200)
  ->  Bitmap Heap Scan on c  (cost=4.35..27.66 rows=9 width=98) 
(actual time=0.009..0.010 rows=5 loops=200)

    Recheck Cond: (a_id = a.id)
    Heap Blocks: exact=169
    ->  Bitmap Index Scan on fk_c_idx (cost=0.00..4.35 
rows=9 width=0) (actual time=0.007..0.007 rows=5 loops=200)

  Index Cond: (a_id = a.id)
  SubPlan 3
    ->  Aggregate  (cost=8.30..8.31 rows=1 width=32) (actual 
time=0.009..0.010 rows=1 loops=200)
  ->  Index Scan using fk_d_idx on d  (cost=0.28..8.29 rows=1 
width=81) (actual time=0.008..0.008 rows=1 loops=200)

    Index Cond: (a_id = a.id)
  SubPlan 4
    ->  Aggregate  (cost=1.27..1.28 rows=1 width=32) (actual 
time=0.005..0.005 rows=1 loops=200)
  ->  Seq Scan on e  (cost=0.00..1.26 rows=1 width=76) (actual 
time=0.004..0.004 rows=0 loops=200)

    Filter: (a_id = a.id)
    Rows Removed by Filter: 21
Planning Time: 0.520 ms
Execution Time: 18.650 ms

_*jsonb_agg instead of array_agg, then to_jsonb*_

Hash Semi Join  (cost=5.00..15947.39 rows=200 width=32) (actual 
time=0.338..23.921 rows=200 loops=1)

"  Hash Cond: (a.id = ""*VALUES*"".column1)"
  ->  Seq Scan on a  (cost=0.00..41.02 rows=502 width=422) (actual 
time=0.012..0.244 rows=502 loops=1)
  ->  Hash  (cost=2.50..2.50 rows=200 width=32) (actual 
time=0.090..0.091 rows=200 loops=1)

    Buckets: 1024  Batches: 1  Memory Usage: 21kB
"    ->  Values Scan on ""*VALUES*""  (cost=0.00..2.50 rows=200 
width=32) (actual time=0.001..0.040 rows=200 loops=1)"

  SubPlan 1
    ->  Aggregate  (cost=42.20..42.21 rows=1 width=32) (actual 
time=0.050..0.050 rows=1 loops=200)
  ->  Bitmap Heap Scan on b  (cost=4.38..42.17 rows=12 
width=156) (actual time=0.012..0.018 rows=12 loops=200)

   Recheck Cond: (a_id = a.id)
    Heap Blocks: exact=382
    ->  Bitmap Index Scan on fk_b_idx (cost=0.00..4.37 
rows=12 width=0) (actual time=0.008..0.008 rows=14 loops=200)

  Index Cond: (a_id = a.id)
  SubPlan 2
    ->  Aggregate  (cost=27.68..27.69 rows=1 width=32) (actual 
time=0.028..0.028 rows=1 loops=200)
  ->  Bitmap Heap Scan on c  (cost=4.35..27.66 rows=9 width=98) 
(actual time=0.009..0.011 rows=5 loops=200)

    Recheck Cond: (a_id = a.id)
    Heap Blocks: exact=169
    ->  Bitmap Index Scan on fk_c_idx (cost=0.00..4.35 
rows=9 width=0) (actual time=0.007..0.007 rows=5 loops=200)

  Index Cond: (a_id = a.id)
  SubPlan 3
    ->  Aggregate  (cost=8.30..8.31 rows=1 width=32) (actual 
time=0.014..0.014 rows=1 loops=200)
  ->  Index Scan using fk_d_idx on d  (cost=0.28..8.29 rows=1 
width=81) (actual time=0.008..0.008 rows=1 loops=200)

    Index Cond: (a

Re: to_jsonb performance on array aggregated correlated subqueries

2022-08-12 Thread Rick Otten
On Fri, Aug 12, 2022 at 3:02 PM Rick Otten  wrote:

>
>
> On Fri, Aug 12, 2022 at 2:50 PM Nico Heller  wrote:
>
>> Good day,
>>
>> consider the following query:
>>
>> WITH aggregation(
>>  SELECT
>> a.*,
>>(SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",
>>(SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",
>>(SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",
>>(SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"
>>  FROM a WHERE a.id IN (> )
>> SELECT to_jsonb(aggregation.*) as "value" FROM aggregation;
>>
>>
> - You do have an index on `b.a_id` and `c.a_id`, etc... ?  You didn't
> say...
>
> - Are you sure it is the `to_jsonb` that is making this query slow?
>
> - Since you are serializing this for easy machine readable consumption
> outside of the database, does it make a difference if you use `to_json`
> instead?
>
>
To follow up here a little.  I ran some quick tests on my database and
found that `to_json` is consistently, slightly, faster than `to_jsonb` when
you are just serializing the result set for consumption.   I feed in some
arrays of 1,000,000 elements for testing.  While both json serializers are
slower than just sending back the result set, it wasn't significant on my
machine with simple object types.  (3% slower).

Are any of your objects in "b.*", etc, complex data structures or deeper
arrays, or gis shapes, or strange data types that might be hard to
serialize?  I'm wondering if there is something hidden in those ".*" row
sets that are particularly problematic and compute intensive to process.