How to make runtime partition pruning work?

2019-10-11 Thread Markus Heiden
I partitioned a table "data_table" by the key "import_id" to reduce the 
number of partitions to be loaded in my queries.
I used list partitions, each containing usually just one "import_id". I 
used a primary key (id, import_id)
But PostgreSQL does not consider partition keys to avoid loading not 
needed partitions.


My query:
SELECT SUM(something) FROM data_table WHERE import_id IN (SELECT id FROM 
import_table WHERE ...)

My problem:
The query takes too long, because PostgreSQL uses a hash join over all 
partitions of "data_table" with the "import_table", instead of pruning 
the "data_table" partitions by the import_ids at runtime.

Static pruning (when using ... IN (1, 2, 3, 4)) works fine though.

What am I doing wrong that runtime partition pruning with PostgreSQL 
11.5 does not work in my case?


Thanks,
Markus





Re: How to make runtime partition pruning work?

2019-10-15 Thread Markus Heiden




Notice that only subqueries and parameterized nested loop joins are
mentioned.  The above text does not really go into the detail of which
types of subqueries can be used, but I can confirm that they must be
subqueries that can only return a scalar value. e.g WHERE x = (SELECT
y FROM ...).  The executor would raise an error if that subquery
returned more than 1 row. The IN() clause you have is not eligible.
This will be converted into a semi-join during planning, and even if
it wasn't, the executor wouldn't be raising an error if it returned
multiple rows.

Unfortunately, to do what you mention with a hash join, we'd need to
scan through the entire hash table and incrementally build the set of
partitions which could match each value in the table.  I'm sure there
are times where that would be well worth the trouble, but I imagine in
the average case we might find that the cost of scanning the entire
table this way would be more than just executing the query without any
partition pruning. I don't see any good way to know in advance if it
would be worthwhile or not.


First thanks for the detailed answer. Now at least I known that I can't 
make it work this way.


What I do not understand is, why for the IN() clause a hash join will be 
used in this case (instead of a semi join too)?


In this case (x is the partition key) using a semi join would be optimal 
for IN() clauses too IMHO.
This way only the query providing the y's has to executed first to 
provide the partition keys to filter.
In other cases where x is not the partition key, I agree with your 
argumentation.