Re: Performance problems with Postgres JDBC 42.4.2
On Mon, 6 Nov 2023 at 09:59, Jose Osinde wrote:
>
> Dear all,
>
> I'm running a query from Java on a postgres database:
>
> Java version: 17
> JDBC version: 42.4.2
> Postgres version: 13.1
>
> In parallel I'm testing the same queries from pgAdmin 4 version 6.13
>
> The tables I'm using contains more than 10million rows each and I have two
> questions here:
>
> 1. I need to extract the path of a file without the file itself. For this
> I use two alternatives as I found that sentence "A" is much faster than
> the "B" one:
>
> "A" sentence:
>
> SELECT DISTINCT ( LEFT(opf.file_path, length(opf.file_path) - position('/'
> in reverse(opf.file_path))) ) AS path
>FROM product AS op JOIN product_file AS opf ON
> opf.product_id = op.id
>WHERE op.proprietary_end_date <= CURRENT_DATE
> AND op.id LIKE 'urn:esa:psa:%'
>
> "B" sentence:
>
> SELECT DISTINCT ( regexp_replace(opf.file_path, '(.*)\/(.*)$', '\1') ) AS
> path
>FROM product AS op JOIN product_file AS opf ON
> opf.product_id = op.id
>WHERE op.proprietary_end_date <= CURRENT_DATE
> AND op.id LIKE 'urn:esa:psa:%'
>
> 2. Running sentence "A" on the pgAdmin client takes 4-5 minutes to finish
> but running it from a Java program it never ends. This is still the case
> when I limit the output to the first 100 rows so I assume this is not a
> problem with the amount of data being transferred but the way postgres
> resolve the query. To make it work in Java I had to define a postgres
> function that I call from the Java code instead of running the query
> directly.
>
> I had a similar problem in the past with a query that performed very
> poorly from a Java client while it was fine from pgAdmin or a python
> script. In that case it was a matter of column types not compatible with
> the JDBC (citext) deriving in an implicit cast that prevented the
> postgres engine from using a given index or to cast all the values of that
> column before using it, not sure now. But I don't think this is not the
> case here.
>
> Could anyone help me again?
>
Can you share your java code ?
If you are using a PreparedStatement the driver will use the extended
protocol which may be slower. Statements use SimpleQuery which is faster
and more like pgadmin
Issuing a Query and Processing the Result | pgJDBC (postgresql.org)
Dave
>
>
Awkward Join between generate_series and long table
Hi all--
I'm having a performance problem in 12.16 that I'm hoping someone can help
with.
I have a table shaped roughly like this:
Table "public.data"
Column |Type | Collation | Nullable |
Default
-+-+---+--+
id | integer | | not null |
nextval('data_seq'::regclass)
timestamp | timestamp without time zone | | not null |
sn | character varying(36) | | not null |
Indexes:
"data_pkey" PRIMARY KEY, btree (id)
"data_multicol_sn_and_timestamp_desc_idx" btree (sn, "timestamp" DESC)
with something in the 10M to 100M range in terms of number of rows.
I have a query more or less like:
WITH periods as (
SELECT
*
FROM
(
SELECT
s at time zone 'utc' AS period_start,
LEAD(s) OVER (
ORDER BY
s
) at time zone 'utc' AS period_end
FROM
generate_series(
('2023-01-01T00:00:00-08:00') :: timestamptz,
('2023-11-01T00:00:00-07:00') :: timestamptz,
('1 day') :: interval
) s
) with_junk_period
WHERE
period_end IS NOT NULL
)
SELECT
p.period_start,
p.period_end,
COUNT (distinct d.id)
FROM
periods p
LEFT JOIN data d
ON
d.timestamp >= (p.period_start)
AND d."timestamp" < (p.period_end)
AND d.sn = 'BLAH'
GROUP BY
p.period_start,
p.period_end
ORDER BY
p.period_start ASC
;
This worked fine on a smaller, but same-shaped, set of data on a staging
database, where the query plan was:
GroupAggregate (cost=14843021.48..15549022.41 rows=200 width=24) (actual
time=1311.052..1515.344 rows=303 loops=1)
Group Key: with_junk_period.period_start, with_junk_period.period_end
-> Sort (cost=14843021.48..15019521.21 rows=70599893 width=20) (actual
time=1305.969..1384.676 rows=463375 loops=1)
Sort Key: with_junk_period.period_start, with_junk_period.period_end
Sort Method: external merge Disk: 13632kB
-> Nested Loop Left Join (cost=60.26..2329833.01 rows=70599893
width=20) (actual time=355.379..917.049 rows=463375 loops=1)
-> Subquery Scan on with_junk_period (cost=59.83..92.33
rows=995 width=16) (actual time=355.307..358.978 rows=303 loops=1)
Filter: (with_junk_period.period_end IS NOT NULL)
Rows Removed by Filter: 1
-> WindowAgg (cost=59.83..82.33 rows=1000 width=24)
(actual time=355.302..358.723 rows=304 loops=1)
-> Sort (cost=59.83..62.33 rows=1000 width=8)
(actual time=355.265..355.510 rows=304 loops=1)
Sort Key: s.s
Sort Method: quicksort Memory: 39kB
-> Function Scan on generate_series s
(cost=0.00..10.00 rows=1000 width=8) (actual time=355.175..355.215
rows=304 loops=1)
-> Index Scan using data_multicol_sn_and_timestamp_desc_idx
on data d (cost=0.43..1631.90 rows=70955 width=12) (actual
time=0.042..1.516 rows=1529 loops=303)
"Index Cond: (((sn)::text = 'BLAH'::text) AND
(""timestamp"" >= with_junk_period.period_start) AND (""timestamp"" <
with_junk_period.period_end))"
Planning Time: 0.283 ms
JIT:
Functions: 20
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 1.570 ms, Inlining 29.051 ms, Optimization 192.084 ms,
Emission 134.022 ms, Total 356.727 ms
Execution Time: 1523.983 ms
But on the production database, the query is no longer runnable for the
same frame, and when we restrict, it we can see that index is no longer
being fully utilized:
GroupAggregate (cost=56901238.84..58446602.98 rows=200 width=24) (actual
time=3652.892..3653.669 rows=4 loops=1)
Group Key: with_junk_period.period_start, with_junk_period.period_end
-> Sort (cost=56901238.84..57287579.38 rows=154536214 width=20) (actual
time=3652.544..3652.765 rows=5740 loops=1)
Sort Key: with_junk_period.period_start, with_junk_period.period_end
Sort Method: quicksort Memory: 641kB
-> Nested Loop Left Join (cost=60.40..32259766.02 rows=154536214
width=20) (actual time=172.908..3651.658 rows=5740 loops=1)
" Join Filter: ((d.""timestamp"" >=
with_junk_period.period_start) AND (d.""timestamp"" <
with_junk_period.period_end))"
Rows Removed by Join Filter: 5310656
-> Subquery Scan on with_junk_period (cost=59.83..92.33
rows=995 width=16) (actual time=152.963..153.014 rows=4 loops=1)
Filter: (with_junk_period.period_end IS NOT NULL)
Rows Removed by Filter
Re: Awkward Join between generate_series and long table
On Wed, Nov 8, 2023 at 6:26 PM Lincoln Swaine-Moore wrote: > SELECT > s at time zone 'utc' AS period_start, > LEAD(s) OVER ( > ORDER BY > s > ) at time zone 'utc' AS period_end > Maybe doesn't help overall but this can be equivalently written as: s + '1 day'::interval as period_end Resorting to a window function here is expensive waste, the lead() value can be computed, not queried. > SELECT > p.period_start, > p.period_end, > COUNT (distinct d.id) > FROM > periods p > LEFT JOIN data d > ON > d.timestamp >= (p.period_start) > AND d."timestamp" < (p.period_end) > AND d.sn = 'BLAH' > This seems better written (semantically, not sure about execution dynamics) as: FROM periods AS p LEFT JOIN LATERAL (SELECT count(distinct? d.id) FROM data AS d WHERE d.timestamp >= p.period_start AND d.timestamp < p.period_end AND d.sn = 'BLAH') AS cnt_d -- NO grouping required at this query level David J.
Re: Awkward Join between generate_series and long table
> Maybe doesn't help overall but this can be equivalently written as: s + '1 day'::interval as period_end Ah, so I've glossed over a detail here which is that I'm relying on some timezone specific behavior and not actually generate_series itself. If you're curious, the details are here: https://www.postgresql.org/message-id/2582288.1696428710%40sss.pgh.pa.us I think that makes the window function necessary, or at least something a little more sophisticated than addition of a day (though I'd be happy to be wrong about that). > LEFT JOIN LATERAL (SELECT Oh wow, this seems to get the index used! That's wonderful news--thank you. I'd be super curious if anyone has any intuition about why the planner is so much more successful there--most of what I see online about LATERAL JOINs is focused as you said on semantics not performance. But in terms of solving my problem, this seems to do the trick. Thanks again! On Wed, Nov 8, 2023 at 5:45 PM David G. Johnston wrote: > On Wed, Nov 8, 2023 at 6:26 PM Lincoln Swaine-Moore < > [email protected]> wrote: > >> SELECT >> > s at time zone 'utc' AS period_start, >> LEAD(s) OVER ( >> ORDER BY >> s >> ) at time zone 'utc' AS period_end >> > > Maybe doesn't help overall but this can be equivalently written as: > s + '1 day'::interval as period_end > > Resorting to a window function here is expensive waste, the lead() value > can be computed, not queried. > > >> SELECT >> p.period_start, >> p.period_end, >> COUNT (distinct d.id) >> FROM >> periods p >> LEFT JOIN data d >> ON >> d.timestamp >= (p.period_start) >> AND d."timestamp" < (p.period_end) >> AND d.sn = 'BLAH' >> > > This seems better written (semantically, not sure about execution > dynamics) as: > > FROM periods AS p > LEFT JOIN LATERAL (SELECT count(distinct? d.id) FROM data AS d WHERE > d.timestamp >= p.period_start AND d.timestamp < p.period_end AND d.sn = > 'BLAH') AS cnt_d > -- NO grouping required at this query level > > David J. > > -- Lincoln Swaine-Moore
