Improving Performance of Query ~ Filter by A, Sort by B
Hi all,
I'm having a good bit of trouble making production-ready a query I wouldn't
have thought would be too challenging.
Below is the relevant portions of my table, which is partitioned on values
of part_key from 1-5. The number of rows is on the order of 10^7, and the
number of unique "parent_id" values is on the order of 10^4. "tmstmp" is
continuous (more or less). Neither parent_id nor tmstmp is nullable. The
table receives a fair amount of INSERTS, and also a number of UPDATES.
db=> \d a
Table "public.a"
Column | Type | Collation | Nullable
| Default
-+--+---+--+--
id | integer | |
| nextval('a_id_seq'::regclass)
parent_id | integer | | not null |
tmstmp | timestamp with time zone | | not null |
part_key| integer | | |
Partition key: LIST (part_key)
Number of partitions: 5 (Use \d+ to list them.)
db=> \d a_partition1
Table "public.a_partition1"
Column | Type | Collation | Nullable
| Default
-+--+---+--+--
id | integer | |
| nextval('a_id_seq'::regclass)
parent_id | integer | | not null |
tmstmp | timestamp with time zone | | not null |
part_key| integer | | |
Partition of: a FOR VALUES IN (1)
Indexes:
"a_pk_idx1" UNIQUE, btree (id)
"a_tmstmp_idx1" btree (tmstmp)
"a_parent_id_idx1" btree (parent_id)
Check constraints:
"a_partition_check_part_key1" CHECK (part_key = 1)
Foreign-key constraints:
"a_partition_parent_id_fk_b_id1" FOREIGN KEY (parent_id) REFERENCES
b(id) DEFERRABLE INITIALLY DEFERRED
db=> SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts,
relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname
like 'a_partition%';
relname | relpages | reltuples | relallvisible
| relkind | relnatts | relhassubclass | reloptions | pg_table_size
---+--+-+---+-+--+++---
a_partition1 | 152146 | 873939 |106863
| r | 26 | f || 287197233152
a_partition2 | 669987 | 3.62268e+06 | 0
| r | 26 | f || 161877745664
a_partition3 | 562069 | 2.94414e+06 |213794
| r | 26 | f || 132375994368
a_partition4 | 729880 | 3.95513e+06 | 69761
| r | 26 | f || 188689047552
a_partition5 | 834132 | 4.9748e+06 | 52622
| r | 26 | f || 218596630528
(5 rows)
I'm interested in filtering by parent_id (an indexed foreign-key
relationship, though I think it shouldn't matter in this context), then
sorting the result by a timestamp field (also indexed). I only need 20 at a
time.
The naive query works well for a very small number of ids (up to 3):
EXPLAIN (ANALYZE, BUFFERS)
SELECT "a"."id"
FROM "a"
WHERE "a"."parent_id" IN (
37066,41174,28843
)
ORDER BY "a"."tmstmp" DESC
LIMIT 20;
Limit (cost=9838.23..9838.28 rows=20 width=12) (actual
time=13.307..13.307 rows=0 loops=1)
Buffers: shared hit=29 read=16
-> Sort (cost=9838.23..9860.12 rows=8755 width=12) (actual
time=13.306..13.306 rows=0 loops=1)
Sort Key: a_partition1.tmstmp DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=29 read=16
-> Append (cost=0.43..9605.26 rows=8755 width=12) (actual
time=13.302..13.302 rows=0 loops=1)
Buffers: shared hit=29 read=16
-> Index Scan using a_parent_id_idx1 on a_partition1
(cost=0.43..4985.45 rows=4455 width=12) (actual time=4.007..4.007 rows=0
loops=1)
Index Cond: (parent_id = ANY
('{37066,41174,28843}'::integer[]))
Buffers: shared hit=6 read=3
-> Index Scan using a_parent_id_idx2 on a_partition2
(cost=0.43..1072.79 rows=956 width=12) (actual time=3.521..3.521 rows=0
loops=1)
Index Cond: (parent_id = ANY
('{37066,41174,28843}'::integer[]))
Buffers: shared hit=5 read=4
-> Index Scan using a_parent_id_idx3 on a_partition3
(cost=0.43..839.30
performance statistics monitoring without spamming logs
I'm looking for a way of gathering performance stats in a more usable way than turning on `log_statement_stats` (or other related modules). The problem I have with the log_*_stats family of modules is that they log every single query, which makes them unusable in production. Aside from consuming space, there's also the problem that the log system wouldn't be able to keep up with the rate. There are a couple ideas that pop into mind that would make these stats more usable: 1. Only log when the statement would otherwise already be logged. Such as due to the `log_statement` or `log_min_duration_statement` settings. 2. Make stats available in `pg_stat_statements` (or alternate view that could be joined on). The block stats are already available here, but others like CPU usage, page faults, and context switches are not. To answer why I want this data: I want to be able to identify queries which are consuming large amounts of CPU time so that I can either optimize the query or optimize the application making the query, and free up CPU resources on the database. The `pg_stat_statements` view provides the `total_time` metric, but many things can contribute to query time other than CPU usage, and CPU usage is my primary concern at the moment. Do these seem like reasonable requests? And if so, what's the procedure for getting them implemented? Any thoughts on whether they would be hard to implement? I'm unfamiliar with the PostgresQL code base, but might be willing to attempt an implementation if it wouldn't be terribly difficult. -Patrick
Re: performance statistics monitoring without spamming logs
On Tue, Jul 10, 2018 at 01:54:12PM -0400, Patrick Hemmer wrote:
> I'm looking for a way of gathering performance stats in a more usable
> way than turning on `log_statement_stats` (or other related modules).
> The problem I have with the log_*_stats family of modules is that they
> log every single query, which makes them unusable in production. Aside
> from consuming space, there's also the problem that the log system
> wouldn't be able to keep up with the rate.
>
> There are a couple ideas that pop into mind that would make these stats
> more usable:
> 1. Only log when the statement would otherwise already be logged. Such
> as due to the `log_statement` or `log_min_duration_statement` settings.
Did you see: (Added Adrien to Cc);
https://commitfest.postgresql.org/18/1691/
I don't think the existing patch does what you want, but perhaps all that's
needed is this:
if (save_log_statement_stats)
+ if (log_sample_rate==1 || was_logged)
ShowUsage("EXECUTE MESSAGE STATISTICS");
In any case, I'm thinking that your request could/should be considered by
whatever future patch implements sampling (if not implemented/included in the
patch itself).
If that doesn't do what's needed, that patch might still be a good crash course
in how to start implementing what you need (perhaps on top of that patch).
> 2. Make stats available in `pg_stat_statements` (or alternate view that
> could be joined on). The block stats are already available here, but
> others like CPU usage, page faults, and context switches are not.
pg_stat_statements is ./contrib/pg_stat_statements/pg_stat_statements.c which
is 3k LOC.
getrusage stuff and log_*_stat stuff is in src/backend/tcop/postgres.c
Justin
Special bloom index of INT, BIGINT, BIT, VARBIT for bitwise operation
Hello, I have a table which has billions of rows and I want to select it by bitwise operation like, =# CREATE TABLE IF NOT EXISTS t_bitwise ( id INTEGER NOT NULL ,status_bigint BITINT NOT NULL ,status_bit BIT(32) NOT NULL ); =# INSERT INTO t_bitwise (id, status_bigint, status_bit) SELECT id ,(random() * 4294967295)::BIGINT ,(random() * 4294967295)::BIGINT::BIT(32) FROM generate_series(1, 30) as t(id); =# SELECT * FROM t_bitwise WHERE status_bigint & 170 = 170 OR status_bigint & 256 = 256; =# SELECT * FROM t_bitwise WHERE status_bit & b'10101010'::BIT(32) = b'10101010'::BIT(32) OR status_bit & b'0001'::BIT(32) = b'0001'::BIT(32); Yes, these SELECT statements scan all rows. I guess possible index types are - Expression indexes ? - Partial Indexes ? - GIN ? - GIST ? - bloom index ? I googled but I feel there is no good solution and it would be good if I hava "bloom index specific for bitwise operation". In case of normal bloom index, a value is hashed into a few bits which is mapped to a signature (default 80 bits). This is a lossy representation of the original value, and as such is prone to reporting false positives which requires "Recheck" process at SELECT. The more rows or the more complex condition, the more execution time. My idea is that, in case of index for bitwise operation, each bit should be mapped to exactly same bit on a signature (One to one mapping). No false positives. No "Recheck" process is required. If the target coulmn is BIT(32), just 32 bits signature lengh is enough. Is there any index module like this ? Since I am not familiar with C and Postgresql, I can not write my own module. Any help would be great for me. Thanks, Takao
