Performance implications of 8K pread()s

2023-07-11 Thread Dimitrios Apostolou

Hello list,

I have noticed that the performance during a SELECT COUNT(*) command is
much slower than what the device can provide. Parallel workers improve the
situation but for simplicity's sake, I disable parallelism for my
measurements here by setting max_parallel_workers_per_gather to 0.

Strace'ing the postgresql process shows that all reads happen in offset'ed 8KB
blocks using pread():

  pread64(172, ..., 8192, 437370880) = 8192

The read rate I see on the device is only 10-20 MB/s. My case is special
though, as this is on a zstd-compressed btrfs filesystem, on a very fast
(1GB/s) direct attached storage system. Given the decompression ratio is around
10x, the above rate corresponds to about 100 to 200 MB/s of data going into the
postgres process.

Can the 8K block size cause slowdown? Here are my observations:

+ Reading a 1GB postgres file using dd (which uses read() internally) in
   8K and 32K chunks:

 # dd if=4156889.4 of=/dev/null bs=8k
 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.18829 s, 174 MB/s

 # dd if=4156889.4 of=/dev/null bs=8k# 2nd run, data is cached
 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.287623 s, 3.7 GB/s

 # dd if=4156889.8 of=/dev/null bs=32k
 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02688 s, 1.0 GB/s

 # dd if=4156889.8 of=/dev/null bs=32k# 2nd run, data is cached
 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.264049 s, 4.1 GB/s

   The rates displayed are after decompression (the fs does it
   transparently) and the results have been verified with multiple runs.

   Notice that the read rate with bs=8k is 174MB/s (I see ~20MB/s on the
   device), slow and similar to what Postgresql gave us above. With bs=32k
   the rate increases to 1GB/s (I see ~80MB/s on the device, but the time
   is very short to register properly).

  The cached reads are fast in both cases.

Note that I suspect my setup being related, (btrfs compression behaving
suboptimally) since the raw device can give me up to 1GB/s rate. It is however
evident that reading in bigger chunks would mitigate such setup inefficiencies.
On a system that reads are already optimal and the read rate remains the same,
then bigger block size would probably reduce the sys time postgresql consumes
because of the fewer system calls.

So would it make sense for postgres to perform reads in bigger blocks? Is it
easy-ish to implement (where would one look for that)? Or must the I/O unit be
tied to postgres' page size?

Regards,
Dimitris





Entire index scanned, but only when in SQL function?

2023-07-11 Thread Philip Semanchuk
Hi there,
I’m on Postgres 13.11 and I'm seeing a situation where an INSERT...SELECT 
statement seq scans an index, but only when wrapped in a SQL function. When 
invoked directly (via psql) or when called via a PL/pgSQL function, it only 
reads the index tuples it needs, resulting in much better performance. I can 
solve my problem by writing the function in PL/pgSQL, but I'm curious why the 
pure SQL version behaves the way it does.

Here's my table --

\d documents
+---+--++
| Column| Type | Modifiers  
|
|---+--+|
| document_id   | integer  |  not null generated always as identity 
|
| product_id| integer  |  not null  
|
| units_sold| integer  |  not null  
|
| sale_date | date |  not null  
|
... some other columns ...
+---+--++

CREATE INDEX idx_philip_tmp on documents (document_id, product_id);

Here's the SQL function which will use that index --

CREATE OR REPLACE FUNCTION fn_create_tasks(product_ids int[])
RETURNS void
AS $$
-- Create processing tasks for documents related to these products
INSERT INTO
processing_queue (document_id)
SELECT
DISTINCT document_id
FROM
documents
JOIN unnest(product_ids::int[]) AS product_id USING (product_id)
;

$$ LANGUAGE sql VOLATILE PARALLEL SAFE;

96498 is a product_id that has one associated document_id. When I copy/paste 
this statement into psql, it executes quickly, and 
pg_stat_user_indexes.idx_tup_read reports 2 tuples read for the index.

INSERT INTO
processing_queue (document_id)
SELECT
DISTINCT document_id
FROM
documents
JOIN unnest(ARRAY[96498]::int[]) AS product_id USING (product_id)
;

When I copy/paste this into psql, I expect it to perform just as quickly but it 
does not. pg_stat_user_indexes.idx_tup_read reports 64313783 tuples read (which 
is the entire index).

SELECT fn_create_tasks(ARRAY[96498]::int[])

If I rewrite fn_create_tasks() in PL/pgSQL, it behaves as I expect (executes 
quickly, pg_stat_user_indexes.idx_tup_read = 2).

SELECT fn_create_tasks_plpgsql(ARRAY[96498]::int[])

My rule of thumb is that SQL functions always perform as well as or better than 
a PL/pgSQL equivalent, but this is a case where that's not true. If anyone can 
give me some clues as to what's happening here, I'd appreciate it.

Thanks
Philip



Re: Performance implications of 8K pread()s

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou  wrote:
> Note that I suspect my setup being related, (btrfs compression behaving
> suboptimally) since the raw device can give me up to 1GB/s rate. It is however
> evident that reading in bigger chunks would mitigate such setup 
> inefficiencies.
> On a system that reads are already optimal and the read rate remains the same,
> then bigger block size would probably reduce the sys time postgresql consumes
> because of the fewer system calls.

I don't know about btrfs but maybe it can be tuned to prefetch
sequential reads better...

> So would it make sense for postgres to perform reads in bigger blocks? Is it
> easy-ish to implement (where would one look for that)? Or must the I/O unit be
> tied to postgres' page size?

It is hard to implement.  But people are working on it.  One of the
problems is that the 8KB blocks that we want to read data into aren't
necessarily contiguous so you can't just do bigger pread() calls
without solving a lot more problems first.  The project at
https://wiki.postgresql.org/wiki/AIO aims to deal with the
"clustering" you seek plus the "gathering" required for non-contiguous
buffers by allowing multiple block-sized reads to be prepared and
collected on a pending list up to some size that triggers merging and
submission to the operating system at a sensible rate, so we can build
something like a single large preadv() call.  In the current
prototype, if io_method=worker then that becomes a literal preadv()
call running in a background "io worker" process, but it could also be
OS-specific stuff (io_uring, ...) that starts an asynchronous IO
depending on settings.  If you take that branch and run your test you
should see 128KB-sized preadv() calls.




Re: Performance implications of 8K pread()s

2023-07-11 Thread Thomas Munro
On Wed, Jul 12, 2023 at 5:12 AM Thomas Munro  wrote:
> "gathering"

(Oops, for reads, that's "scattering".  As in scatter/gather I/O but I
picked the wrong one...).