Improving Performance of Query ~ Filter by A, Sort by B

2018-07-10 Thread Lincoln Swaine-Moore
Hi all,


I'm having a good bit of trouble making production-ready a query I wouldn't
have thought would be too challenging.

Below is the relevant portions of my table, which is partitioned on values
of part_key from 1-5. The number of rows is on the order of 10^7, and the
number of unique "parent_id" values is on the order of 10^4. "tmstmp" is
continuous (more or less). Neither parent_id nor tmstmp is nullable. The
table receives a fair amount of INSERTS, and also a number of UPDATES.

db=> \d a
 Table "public.a"
 Column  |   Type   | Collation | Nullable
|  Default
-+--+---+--+--
 id  | integer  |   |
| nextval('a_id_seq'::regclass)
 parent_id   | integer  |   | not null |
 tmstmp  | timestamp with time zone |   | not null |
 part_key| integer  |   |  |
Partition key: LIST (part_key)
Number of partitions: 5 (Use \d+ to list them.)

db=> \d a_partition1
Table "public.a_partition1"
 Column  |   Type   | Collation | Nullable
|  Default
-+--+---+--+--
 id  | integer  |   |
| nextval('a_id_seq'::regclass)
 parent_id   | integer  |   | not null |
 tmstmp  | timestamp with time zone |   | not null |
 part_key| integer  |   |  |
Partition of: a FOR VALUES IN (1)
Indexes:
"a_pk_idx1" UNIQUE, btree (id)
"a_tmstmp_idx1" btree (tmstmp)
"a_parent_id_idx1" btree (parent_id)
Check constraints:
"a_partition_check_part_key1" CHECK (part_key = 1)
Foreign-key constraints:
"a_partition_parent_id_fk_b_id1" FOREIGN KEY (parent_id) REFERENCES
b(id) DEFERRABLE INITIALLY DEFERRED

db=> SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts,
relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname
like 'a_partition%';
  relname  | relpages |  reltuples  | relallvisible
| relkind | relnatts | relhassubclass | reloptions | pg_table_size
---+--+-+---+-+--+++---
 a_partition1  |   152146 |  873939 |106863
| r   |   26 | f  ||  287197233152
 a_partition2  |   669987 | 3.62268e+06 | 0
| r   |   26 | f  ||  161877745664
 a_partition3  |   562069 | 2.94414e+06 |213794
| r   |   26 | f  ||  132375994368
 a_partition4  |   729880 | 3.95513e+06 | 69761
| r   |   26 | f  ||  188689047552
 a_partition5  |   834132 |  4.9748e+06 | 52622
| r   |   26 | f  ||  218596630528
(5 rows)



I'm interested in filtering by parent_id (an indexed foreign-key
relationship, though I think it shouldn't matter in this context), then
sorting the result by a timestamp field (also indexed). I only need 20 at a
time.


The naive query works well for a very small number of ids (up to 3):

EXPLAIN (ANALYZE, BUFFERS)
SELECT "a"."id"
FROM "a"
WHERE "a"."parent_id" IN (
37066,41174,28843
)
ORDER BY "a"."tmstmp" DESC
LIMIT 20;

 Limit  (cost=9838.23..9838.28 rows=20 width=12) (actual
time=13.307..13.307 rows=0 loops=1)
   Buffers: shared hit=29 read=16
   ->  Sort  (cost=9838.23..9860.12 rows=8755 width=12) (actual
time=13.306..13.306 rows=0 loops=1)
 Sort Key: a_partition1.tmstmp DESC
 Sort Method: quicksort  Memory: 25kB
 Buffers: shared hit=29 read=16
 ->  Append  (cost=0.43..9605.26 rows=8755 width=12) (actual
time=13.302..13.302 rows=0 loops=1)
   Buffers: shared hit=29 read=16
   ->  Index Scan using a_parent_id_idx1 on a_partition1
(cost=0.43..4985.45 rows=4455 width=12) (actual time=4.007..4.007 rows=0
loops=1)
 Index Cond: (parent_id = ANY
('{37066,41174,28843}'::integer[]))
 Buffers: shared hit=6 read=3
   ->  Index Scan using a_parent_id_idx2 on a_partition2
(cost=0.43..1072.79 rows=956 width=12) (actual time=3.521..3.521 rows=0
loops=1)
 Index Cond: (parent_id = ANY
('{37066,41174,28843}'::integer[]))
 Buffers: shared hit=5 read=4
   ->  Index Scan using a_parent_id_idx3 on a_partition3
(cost=0.43..839.30 

performance statistics monitoring without spamming logs

2018-07-10 Thread Patrick Hemmer
I'm looking for a way of gathering performance stats in a more usable
way than turning on `log_statement_stats` (or other related modules).
The problem I have with the log_*_stats family of modules is that they
log every single query, which makes them unusable in production. Aside
from consuming space, there's also the problem that the log system
wouldn't be able to keep up with the rate.

There are a couple ideas that pop into mind that would make these stats
more usable:
1. Only log when the statement would otherwise already be logged. Such
as due to the `log_statement` or `log_min_duration_statement` settings.
2. Make stats available in `pg_stat_statements` (or alternate view that
could be joined on). The block stats are already available here, but
others like CPU usage, page faults, and context switches are not.

To answer why I want this data: I want to be able to identify queries
which are consuming large amounts of CPU time so that I can either
optimize the query or optimize the application making the query, and
free up CPU resources on the database. The `pg_stat_statements` view
provides the `total_time` metric, but many things can contribute to
query time other than CPU usage, and CPU usage is my primary concern at
the moment.

Do these seem like reasonable requests? And if so, what's the procedure
for getting them implemented?
Any thoughts on whether they would be hard to implement? I'm unfamiliar
with the PostgresQL code base, but might be willing to attempt an
implementation if it wouldn't be terribly difficult.

-Patrick


Re: performance statistics monitoring without spamming logs

2018-07-10 Thread Justin Pryzby
On Tue, Jul 10, 2018 at 01:54:12PM -0400, Patrick Hemmer wrote:
> I'm looking for a way of gathering performance stats in a more usable
> way than turning on `log_statement_stats` (or other related modules).
> The problem I have with the log_*_stats family of modules is that they
> log every single query, which makes them unusable in production. Aside
> from consuming space, there's also the problem that the log system
> wouldn't be able to keep up with the rate.
> 
> There are a couple ideas that pop into mind that would make these stats
> more usable:
> 1. Only log when the statement would otherwise already be logged. Such
> as due to the `log_statement` or `log_min_duration_statement` settings.

Did you see: (Added Adrien to Cc);
https://commitfest.postgresql.org/18/1691/

I don't think the existing patch does what you want, but perhaps all that's
needed is this:

if (save_log_statement_stats)
+   if (log_sample_rate==1 || was_logged)
ShowUsage("EXECUTE MESSAGE STATISTICS");

In any case, I'm thinking that your request could/should be considered by
whatever future patch implements sampling (if not implemented/included in the
patch itself).

If that doesn't do what's needed, that patch might still be a good crash course
in how to start implementing what you need (perhaps on top of that patch).

> 2. Make stats available in `pg_stat_statements` (or alternate view that
> could be joined on). The block stats are already available here, but
> others like CPU usage, page faults, and context switches are not.

pg_stat_statements is ./contrib/pg_stat_statements/pg_stat_statements.c which 
is 3k LOC.

getrusage stuff and log_*_stat stuff is in src/backend/tcop/postgres.c

Justin



Special bloom index of INT, BIGINT, BIT, VARBIT for bitwise operation

2018-07-10 Thread Takao Magoori
Hello,

I have a table which has billions of rows and I want to select it by
bitwise operation like,

=# CREATE TABLE IF NOT EXISTS t_bitwise (
  id INTEGER NOT NULL
  ,status_bigint BITINT NOT NULL
  ,status_bit BIT(32) NOT NULL
);

=# INSERT INTO t_bitwise (id, status_bigint, status_bit) SELECT
  id
  ,(random() * 4294967295)::BIGINT
  ,(random() * 4294967295)::BIGINT::BIT(32)
FROM generate_series(1, 30) as t(id);

=# SELECT * FROM t_bitwise WHERE
  status_bigint & 170 = 170
  OR status_bigint & 256 = 256;

=# SELECT * FROM t_bitwise WHERE
  status_bit & b'10101010'::BIT(32) =
b'10101010'::BIT(32)
  OR status_bit & b'0001'::BIT(32) =
b'0001'::BIT(32);

Yes, these SELECT statements scan all rows. I guess possible index types are

- Expression indexes ?
- Partial Indexes ?
- GIN ?
- GIST ?
- bloom index ?

I googled but I feel there is no good solution and it would be good if
I hava "bloom index specific for bitwise operation".

In case of normal bloom index, a value is hashed into a few bits which
is mapped to a signature (default 80 bits).
This is a lossy representation of the original value, and as such is
prone to reporting false positives which requires "Recheck" process at
SELECT. The more rows or the more complex condition, the more
execution time.

My idea is that, in case of index for bitwise operation, each bit
should be mapped to exactly same bit on a signature (One to one
mapping). No false positives. No "Recheck" process is required. If the
target coulmn is BIT(32), just 32 bits signature lengh is enough.

Is there any index module like this ? Since I am not familiar with C
and Postgresql, I can not write my own module.

Any help would be great for me.

Thanks,
Takao