Re: NUMA settings

2020-05-05 Thread Laurenz Albe
On Tue, 2020-05-05 at 07:56 +0200, Marc Rechté wrote:
> Thanks for answers. Further readings make me think that we should *not* 
> start postgres with numactl --interleave=all: this may have counter 
> productive effect on backends anon memory (heap, stack). IMHO, what is 
> important is to use Huge Pages for shared buffers: they are allocated 
> (reserved) by the kernel at boot time and spread evenly on all nodes. On 
> top of that they never swap.
> 
> My (temp) conclusions are following:
> vm.zone_reclaim_mode = 0
> kernel.numa_balancing = 0 (still not sure with that choice)
> wm.swappiness = 60 (default)
> start postgres as usual (no numactl)

Thanks for sharing your insights.

I think that "vm.swappiness" should be 0.
PostgreSQL does its own memory management, any swapping by the kernel
would go against that.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





Re: Please help! Query jumps from 1s -> 4m

2020-05-05 Thread Laurenz Albe
On Mon, 2020-05-04 at 20:12 +0100, James Thompson wrote:
> The change is abrupt, on the 10th execution (but I hadn't spotted it was 
> always after the
> same number of executions until your suggestion - thanks for pointing me in 
> that direction).
> 
> I don't see any custom configuration on our end that changes the threshold 
> for this from 5->10.
> Debugging the query call I also see that PgConnection has the 
> prepareThreshold set to 5.
> 
> Additionally, the execution plans for the 10th + following queries look fine, 
> they have the
> same structure as if I run the query manually. It's not that the query plan 
> switches,
> it seems as though the same query plan is just > 200X slower than usual.
> 
> As for the heap fetches -> as far as I can tell, on both occasions the 
> fetches are relatively
> low and shouldn't account for minutes of execution (even if one is lower than 
> the other).
> Looking through one days logs I do find cases with lower heap fetches too, 
> for example as
> below which has 1977 fetches instead of the previous 6940 but took approx the 
> same time:
> ->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 
> table1alias1  (cost=0.56..17.25 rows=1 width=60) (actual 
> time=56.858..120893.874 rows=67000 loops=1)
> Index Cond: (col20 = $2005)
> Filter: (((col3 = $2004) OR (col3 IS NULL)) AND ((col8)::text = ANY 
> ((ARRAY[$1004, ..., $2003])::text[])))
> Rows Removed by Filter: 2662793
> Heap Fetches: 1977
> Buffers: shared hit=84574 read=3522
> 
> Would you agree the statement threshold / heap fetches seems unlikely to be 
> causing this? Any other thoughts?

It does sound suspiciously like custom plans vs. generic plan.

If you are using JDBC, then the cut-off of 10 would make sense:
the JDBC driver uses (server) prepared statements only after the
fifth execution, and the prepared statement will use a generic plan
only after the fifth execution.

It would be good to see the execution plan from the third, seventh
and thirteenth execution.  You could use "auto_explain" for that.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





Re: NUMA settings

2020-05-05 Thread Marc Rechté

On Tue, 2020-05-05 at 07:56 +0200, Marc Rechté wrote:

Thanks for answers. Further readings make me think that we should *not*
start postgres with numactl --interleave=all: this may have counter
productive effect on backends anon memory (heap, stack). IMHO, what is
important is to use Huge Pages for shared buffers: they are allocated
(reserved) by the kernel at boot time and spread evenly on all nodes. On
top of that they never swap.

My (temp) conclusions are following:
 vm.zone_reclaim_mode = 0
 kernel.numa_balancing = 0 (still not sure with that choice)
 wm.swappiness = 60 (default)
 start postgres as usual (no numactl)


Thanks for sharing your insights.

I think that "vm.swappiness" should be 0.
PostgreSQL does its own memory management, any swapping by the kernel
would go against that.

Yours,
Laurenz Albe

As said in the post, we wish the idle backends to be swapped out if 
necessary. Therefore lowering swappiness would produce the opposite 
effect: swapping out Linux file cache rather than backends memory.





Re: NUMA settings

2020-05-05 Thread Laurenz Albe
On Tue, 2020-05-05 at 10:11 +0200, Marc Rechté wrote:
> > I think that "vm.swappiness" should be 0.
> > PostgreSQL does its own memory management, any swapping by the kernel
> > would go against that.
> > 
> > Yours,
> > Laurenz Albe
> > 
> As said in the post, we wish the idle backends to be swapped out if 
> necessary. Therefore lowering swappiness would produce the opposite 
> effect: swapping out Linux file cache rather than backends memory.

I see.  Sorry for not paying attention.

An idle backend consumes only a few MB of RAM, though.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





Re: Please help! Query jumps from 1s -> 4m

2020-05-05 Thread Justin Pryzby
On Mon, May 04, 2020 at 02:12:01PM -0500, Justin Pryzby wrote:
> On Mon, May 04, 2020 at 08:07:07PM +0100, Jamie Thompson wrote:
> > Additionally, the execution plans for the 10th + following queries look
> > fine, they have the same structure as if I run the query manually. It's not
> > that the query plan switches, it seems as though the same query plan is
> > just > 200X slower than usual.
> 
> Are you able to reproduce the problem manually ?
> 
> With/without PREPARE ?
> https://www.postgresql.org/docs/current/sql-prepare.html

Also, you should be able to check if that's the problem by doing either:
plan_cache_mode = force_generic_plan;
Or (I would think) DISCARD PLANS;

https://www.postgresql.org/docs/12/runtime-config-query.html

-- 
Justin




Re: Please help! Query jumps from 1s -> 4m

2020-05-05 Thread James Thompson
I've managed to replicate this now with prepared statements. Thanks for all
the guidance so far.

The slowness occurs when the prepared statement changes to a generic plan.

Initial plan:
->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
table1alias2  (cost=0.56..2549.70 rows=70 width=36) (actual
time=1.901..45.256 rows=65000 loops=1)
Output: table1alias2.uniqueid
Index Cond: ((table1alias2.col20 = '12345'::bigint) AND (table1alias2.
col8 = ANY ('{c5986b02-3a02-4639-8147-f286972413ba,...
98ed24b1-76f5-4b0e-bb94-86cf13a4809c}'::text[])))
Heap Fetches: 10
Buffers: shared hit=5048

after 5 executions of the statement:
->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
table1alias2  (cost=0.56..17.23 rows=1 width=36) (actual
time=125.344..126877.822 rows=65000 loops=1)
Output: table1alias2.uniqueid
Index Cond: (table1alias2.col20 = $1001)
Filter: ((table1alias2.col8)::text = ANY ((ARRAY[$1, ...,
$1000])::text[]))
Rows Removed by Filter: 2670023
Heap Fetches: 428
Buffers: shared hit=45933 read=42060 dirtied=4

The second plan looks worse to me as it's applying a filter rather than
using an index condition? I don't understand why it's not part of the
condition and also why this is so much slower though.
If I force a retrieval of all index rows for col20 = '12345' using an
ad-hoc query (below, which in my mind is what the 'bad' plan is doing),
that only takes 2s (2.7 mil rows). Where's the difference?

EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT COUNT(DISTINCT id) FROM table1
WHERE datatype='TypeA' AND col20 = 12345;
-> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
(cost=0.56..2762.95 rows=55337 width=8) (actual time=3.976..1655.645
rows=2735023 loops=1)
 Index Cond: (col20 = 12345)
 Heap Fetches: 417
 Buffers: shared hit=43843 read=44147 dirtied=8

>You could do: SET track_io_timing TO on;
I've not tried this yet, and haven't used it before - sounds like there's
some risks associated with running it on a production server / clocks going
backwards?

>Also, you should be able to check if that's the problem by doing either:
>plan_cache_mode = force_generic_plan;
>Or (I would think) DISCARD PLANS;
I think plan_cache_mode is just for pg12+ unfortunately? We're on 11
currently.
Just tested DISCARD PLANS locally, it didn't switch back from the generic
plan. Was that your expectation?


Re: Please help! Query jumps from 1s -> 4m

2020-05-05 Thread Justin Pryzby
On Tue, May 05, 2020 at 10:10:18PM +0100, James Thompson wrote:
> I've managed to replicate this now with prepared statements. Thanks for all
> the guidance so far.
> 
> The slowness occurs when the prepared statement changes to a generic plan.
> 
> Initial plan:
> ->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 
> table1alias2  (cost=0.56..2549.70 rows=70 width=36) (actual 
> time=1.901..45.256 rows=65000 loops=1)
> Output: table1alias2.uniqueid
> Index Cond: ((table1alias2.col20 = '12345'::bigint) AND 
> (table1alias2.col8 = ANY 
> ('{c5986b02-3a02-4639-8147-f286972413ba,...98ed24b1-76f5-4b0e-bb94-86cf13a4809c}'::text[])))

The rowcount is off by a factor of 1000x.

> after 5 executions of the statement:
> ->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 
> table1alias2  (cost=0.56..17.23 rows=1 width=36) (actual 
> time=125.344..126877.822 rows=65000 loops=1)
> Output: table1alias2.uniqueid
> Index Cond: (table1alias2.col20 = $1001)
> Filter: ((table1alias2.col8)::text = ANY ((ARRAY[$1, ...,$1000])::text[]))
> Rows Removed by Filter: 2670023

And the generic plan is cheaper than the previous, custom plan; but slower,
undoubtedly due to rowcount mis-estimate.

> The second plan looks worse to me as it's applying a filter rather than
> using an index condition? I don't understand why it's not part of the
> condition and also why this is so much slower though.
> If I force a retrieval of all index rows for col20 = '12345' using an
> ad-hoc query (below, which in my mind is what the 'bad' plan is doing),
> that only takes 2s (2.7 mil rows). Where's the difference?
> 
> EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT COUNT(DISTINCT id) FROM table1
> WHERE datatype='TypeA' AND col20 = 12345;
> -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
> (cost=0.56..2762.95 rows=55337 width=8) (actual time=3.976..1655.645 
> rows=2735023 loops=1)

I see you're querying on datetype, which I think isn't in the original query
(but I'm not sure if it's just renamed?).

Underestimate usually means that the conditions are redundant or correlated.
You could mitigate that either by creating an index on both columns (or add
datatype to the existing index), or CREATE STATISTICS ndistinct on those
columns.  Or maybe you just need to increase the stats target for col20 (?).
Then re-analyze.

-- 
Justin




Re: Please help! Query jumps from 1s -> 4m

2020-05-05 Thread Tom Lane
James Thompson  writes:
> The slowness occurs when the prepared statement changes to a generic plan.

> Initial plan:
> ->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
> table1alias2  (cost=0.56..2549.70 rows=70 width=36) (actual
> time=1.901..45.256 rows=65000 loops=1)
> Output: table1alias2.uniqueid
> Index Cond: ((table1alias2.col20 = '12345'::bigint) AND (table1alias2.
> col8 = ANY ('{c5986b02-3a02-4639-8147-f286972413ba,...
> 98ed24b1-76f5-4b0e-bb94-86cf13a4809c}'::text[])))
> Heap Fetches: 10
> Buffers: shared hit=5048

> after 5 executions of the statement:
> ->  Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
> table1alias2  (cost=0.56..17.23 rows=1 width=36) (actual
> time=125.344..126877.822 rows=65000 loops=1)
> Output: table1alias2.uniqueid
> Index Cond: (table1alias2.col20 = $1001)
> Filter: ((table1alias2.col8)::text = ANY ((ARRAY[$1, ...,
> $1000])::text[]))
> Rows Removed by Filter: 2670023
> Heap Fetches: 428
> Buffers: shared hit=45933 read=42060 dirtied=4

Yeah, this is a dynamic we've seen before.  The rowcount estimate, and
hence the cost estimate, for the plan with explicit parameter values is
way off; but the estimate for the generic plan is even more way off,
causing the system to falsely decide that the latter is cheaper.

I've speculated about refusing to believe generic cost estimates if they are
more than epsilon less than the concrete cost estimate, but it's not quite
clear how that should work or whether it'd have its own failure modes.

The one thing that is totally clear is that these rowcount estimates are
crappy.  Can you improve them by increasing the stats target for that
table?  Maybe with less-garbage-y inputs, the system would make the right
plan choice here.

regards, tom lane




Re: 600 million rows of data. Bad hardware or need partitioning?

2020-05-05 Thread Arya F
On Mon, May 4, 2020 at 12:44 AM David Rowley  wrote:
> How long does it take if you first do:
>
> SET enable_nestloop TO off;

I tried this, but it takes much longer

Time: 318620.319 ms (05:18.620)

Below is the EXPLAIN ANALYZE


  QUERY PLAN

 Update on test_table  (cost=639.96..4581378.80 rows=1 width=160)
(actual time=290593.159..290593.159 rows=0 loops=1)
   ->  Hash Join  (cost=639.96..4581378.80 rows=1 width=160)
(actual time=422.313..194430.318 rows=1 loops=1)
 Hash Cond: (test_table.id = subquery.id)
 ->  Seq Scan on test_table  (cost=0.00..4200967.98
rows=101238898 width=92) (actual time=296.970..177731.611
rows=101189271 loops=1)
 ->  Hash  (cost=514.96..514.96 rows=1 width=56) (actual
time=125.312..125.312 rows=1 loops=1)
   Buckets: 16384  Batches: 1  Memory Usage: 959kB
   ->  Subquery Scan on subquery  (cost=0.00..514.96
rows=1 width=56) (actual time=0.030..123.031 rows=1 loops=1)
 ->  Limit  (cost=0.00..414.96 rows=1
width=16) (actual time=0.024..121.014 rows=1 loops=1)
   ->  Seq Scan on test_table test_table_1
(cost=0.00..4200967.98 rows=101238898 width=16) (actual
time=0.021..120.106 rows=1 loops=1)
 Planning Time: 0.304 ms
 JIT:
   Functions: 12
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.178 ms, Inlining 155.980 ms, Optimization
100.611 ms, Emission 39.481 ms, Total 298.250 ms
 Execution Time: 290595.448 ms
(15 rows)


> If you find it's faster then you most likely have random_page_cost set
> unrealistically low. In fact, I'd say it's very unlikely that a nested
> loop join will be a win in this case when random pages must be read
> from a mechanical disk, but by all means, try disabling it with the
> above command and see for yourself.

It's much slower with SET enable_nestloop TO off. Any other suggestions?




Re: 600 million rows of data. Bad hardware or need partitioning?

2020-05-05 Thread Arya F
On Mon, May 4, 2020 at 5:21 AM Justin Pryzby  wrote:

> I mentioned in February and March that you should plan to set shared_buffers
> to fit the indexes currently being updated.
>

The following command gives me

select pg_size_pretty (pg_indexes_size('test_table'));
 pg_size_pretty

 5216 MB
(1 row)


So right now, the indexes on that table are taking about 5.2 GB, if a
machine has 512 GB of RAM and SSDs, is it safe to assume I can achieve
the same update that takes 1.5 minutes in less than 5 seconds while
having 600 million rows of data without partitioning?




Re: 600 million rows of data. Bad hardware or need partitioning?

2020-05-05 Thread Justin Pryzby
On Tue, May 05, 2020 at 08:31:29PM -0400, Arya F wrote:
> On Mon, May 4, 2020 at 5:21 AM Justin Pryzby  wrote:
> 
> > I mentioned in February and March that you should plan to set shared_buffers
> > to fit the indexes currently being updated.
> 
> The following command gives me
> 
> select pg_size_pretty (pg_indexes_size('test_table'));
>  pg_size_pretty >  5216 MB
> 
> So right now, the indexes on that table are taking about 5.2 GB, if a
> machine has 512 GB of RAM and SSDs, is it safe to assume I can achieve
> the same update that takes 1.5 minutes in less than 5 seconds while
> having 600 million rows of data without partitioning?

I am not prepared to guarantee server performance..

But, to my knowledge, you haven't configured shared_buffers at all.  Which I
think might be the single most important thing to configure for loading speed
(with indexes).

Couple months ago, you said your server had 4GB RAM, which isn't much, but if
shared_buffers is ~100MB, I think that deserves attention.

If you get good performance with a million rows and 32MB buffers, then you
could reasonably hope to get good performance (at least initially) with
100million rows and 320MB buffers.  Scale that up to whatever you expect your
index size to be.  Be conservative since you may need to add indexes later, and
you can expect they'll become bloated, so you may want to run a reindex job.

shared_buffers is frequently set to ~25% of RAM, and if you need to efficiently
use indexes larger than what that supports, then you should add RAM, or
implement partitioning.

-- 
Justin