PGSQL Phriday #001 - Two truths and a lie about PostgreSQL

2022-10-04 Thread Ryan Booz
Hello everyone!

Many months ago at PGConf NYC 2021 and on the pgsql-advocacy email list, I
talked about starting a monthly blogging event for the PostgreSQL
community. Two weeks ago I wrote a blog post explaining how the monthly
event would work and some of the first community members to host the first
six+ months. (https://www.softwareandbooz.com/introducing-psql-phriday/)

I'm pleased to share here that the first invite has been posted on my blog
and I'd be thrilled to have anyone contribute a blog post to the initiative
this Friday, October 7. The "rules" for contributing a post are outlined in
the invite, but please feel free to reach out to me if you have any
questions.

https://www.softwareandbooz.com/pgsql-phriday-001-invite/

Regards,
Ryan Booz


Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted
as trying to "sell" something), but feels like an opportunity to talk about
DISTINCT queries and opportunities. Because you have that index,
Timescale 2.3 added a "Skip Scan" query planner node that works on regular
BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable
data at all). In this case, your distinct query would likely run in a few
milliseconds based on the counts you mention (170 stations, 3 channels per
station), and then the outer aggregation would do the GROUP BY. So, you
**could** add the TimescaleDB extension to your database (or a copy of) and
give it a try. You don't actually need to use any TimescaleDB features
otherwise.


   - A writeup of why this kind of DISTINCT query is slow in PostgreSQL
   (for now) and what we did to overcome it:
   
https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
   - Plans for a similar feature in PostgreSQL proper that we'd totally
   support but hasn't made forward progress yet:
   https://commitfest.postgresql.org/19/1741/

Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis  wrote:
>
>> In the future, please share the plan returned by explain analyze, and
>> some data about how many rows in the involved tables,
>>
>
> I believe we consider it acceptable to link to an explain viewer, which is
> what the OP did.  Reading explain output in email has its own challenges,
> and I'd rather have the website than a text attachment.
>
>
>> How does the below work? It should do a very simple index scan only, then
>> aggregate the relative few rows after the fact.
>>
>> select station, array_agg(distinct(channel)) as channels
>> FROM(
>> SELECT station,channel FROM data GROUP BY station,channel
>> ) AS sub
>> group by station;
>>
>
> Yeah, am pondering this too, though seems like the queries should be
> identical so the plan/execution should be the same either way.
>
>
>> If there is correlation between station & channel, then you might look at
>> creating a multivariate statistics object and analyzing the table so the
>> planner can make better choices
>>
>
> There is no where clause so I'm doubtful there is much to be gained going
> down this path.  The Index-Only scan seems like an optimal way to obtain
> this data and the existing query already does that.  The aggregation path
> might vary though it seems like that shouldn't be the case here.
>
> David J.
>


Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Cool. I'd be interested to see the explain on it if you ever try it again.
On that cardinality, I'd expect it to be really fast, so I'm interested to
see if the (SkipScan) nodes were actually used.

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster 
wrote:

>
> On Sep 22, 2021, at 12:49 PM, Ryan Booz  wrote:
>
> [Timescale Dev Advocate here]
> I realize this might not be the most accepted answer (could be interpreted
> as trying to "sell" something), but feels like an opportunity to talk about
> DISTINCT queries and opportunities. Because you have that index,
> Timescale 2.3 added a "Skip Scan" query planner node that works on regular
> BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable
> data at all). In this case, your distinct query would likely run in a few
> milliseconds based on the counts you mention (170 stations, 3 channels per
> station), and then the outer aggregation would do the GROUP BY. So, you
> **could** add the TimescaleDB extension to your database (or a copy of) and
> give it a try. You don't actually need to use any TimescaleDB features
> otherwise.
>
>
> I had actually already done that, as I was considering, in spite of past
> negative experiences with timescaledb, experimenting with it on this DB to
> see if it worked any better with this data. Out of curiosity, I tried
> removing the timescaledb extension, whereupon the query in question took
> roughly twice as long. So you are right that installing timescaledb speeds
> things up, even when not using any timescaledb specific functions. So that
> was a good call. Thanks!
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>
>
>- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
>(for now) and what we did to overcome it:
>
> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>- Plans for a similar feature in PostgreSQL proper that we'd totally
>support but hasn't made forward progress yet:
>https://commitfest.postgresql.org/19/1741/
>
> Anyway, it might be worth a shot. HTH
>
> Ryan B
>
> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis  wrote:
>>
>>> In the future, please share the plan returned by explain analyze, and
>>> some data about how many rows in the involved tables,
>>>
>>
>> I believe we consider it acceptable to link to an explain viewer, which
>> is what the OP did.  Reading explain output in email has its own
>> challenges, and I'd rather have the website than a text attachment.
>>
>>
>>> How does the below work? It should do a very simple index scan only,
>>> then aggregate the relative few rows after the fact.
>>>
>>> select station, array_agg(distinct(channel)) as channels
>>> FROM(
>>> SELECT station,channel FROM data GROUP BY station,channel
>>> ) AS sub
>>> group by station;
>>>
>>
>> Yeah, am pondering this too, though seems like the queries should be
>> identical so the plan/execution should be the same either way.
>>
>>
>>> If there is correlation between station & channel, then you might look
>>> at creating a multivariate statistics object and analyzing the table so the
>>> planner can make better choices
>>>
>>
>> There is no where clause so I'm doubtful there is much to be gained going
>> down this path.  The Index-Only scan seems like an optimal way to obtain
>> this data and the existing query already does that.  The aggregation path
>> might vary though it seems like that shouldn't be the case here.
>>
>> David J.
>>
>
>


Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as
one of the execution nodes. I also realize I was making a few assumptions
about your data, are channels shared among stations, or are all channels
unique (like an ID) per station? That would impact the index and approach.

Something like:

station | channel
--|---
11
12
23
24

or:
station | channel
--|---
11
12
21
22




On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster 
wrote:

> On Sep 22, 2021, at 1:50 PM, Ryan Booz  wrote:
>
> Cool. I'd be interested to see the explain on it if you ever try it again.
> On that cardinality, I'd expect it to be really fast, so I'm interested to
> see if the (SkipScan) nodes were actually used.
>
>
> With timescaledb extension installed, the explain is what I posted in the
> original message (https://explain.depesz.com/s/mtxB#html). Without
> timescaledb installed, the explain looks the same, except it takes twice as
> long to run.
>
> Unless I missed something in your message, i.e. some sort of tweak to the
> query to get it to use the timescaledb features?
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>
> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster 
> wrote:
>
>>
>> On Sep 22, 2021, at 12:49 PM, Ryan Booz  wrote:
>>
>> [Timescale Dev Advocate here]
>> I realize this might not be the most accepted answer (could be
>> interpreted as trying to "sell" something), but feels like an opportunity
>> to talk about DISTINCT queries and opportunities. Because you have that
>> index, Timescale 2.3 added a "Skip Scan" query planner node that works on
>> regular BTree indexes (it doesn't have to be time-series/TimescaleDB
>> Hypertable data at all). In this case, your distinct query would likely run
>> in a few milliseconds based on the counts you mention (170 stations, 3
>> channels per station), and then the outer aggregation would do the GROUP
>> BY. So, you **could** add the TimescaleDB extension to your database (or a
>> copy of) and give it a try. You don't actually need to use any TimescaleDB
>> features otherwise.
>>
>>
>> I had actually already done that, as I was considering, in spite of past
>> negative experiences with timescaledb, experimenting with it on this DB to
>> see if it worked any better with this data. Out of curiosity, I tried
>> removing the timescaledb extension, whereupon the query in question took
>> roughly twice as long. So you are right that installing timescaledb speeds
>> things up, even when not using any timescaledb specific functions. So that
>> was a good call. Thanks!
>>
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>
>>
>>
>>- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
>>(for now) and what we did to overcome it:
>>
>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>>- Plans for a similar feature in PostgreSQL proper that we'd totally
>>support but hasn't made forward progress yet:
>>https://commitfest.postgresql.org/19/1741/
>>
>> Anyway, it might be worth a shot. HTH
>>
>> Ryan B
>>
>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
>> david.g.johns...@gmail.com> wrote:
>>
>>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis 
>>> wrote:
>>>
>>>> In the future, please share the plan returned by explain analyze, and
>>>> some data about how many rows in the involved tables,
>>>>
>>>
>>> I believe we consider it acceptable to link to an explain viewer, which
>>> is what the OP did.  Reading explain output in email has its own
>>> challenges, and I'd rather have the website than a text attachment.
>>>
>>>
>>>> How does the below work? It should do a very simple index scan only,
>>>> then aggregate the relative few rows after the fact.
>>>>
>>>> select station, array_agg(distinct(channel)) as channels
>>>> FROM(
>>>> SELECT station,channel FROM data GROUP BY station,channel
>>>> ) AS s

Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Sorry - break for dinner! So much happens on a global scale in a few
hours.  :-)!

I took a few minutes and created a simple example here of what I imagine
you have on that table. I only inserted ~80 million rows of test data, but
hopefully, it's somewhat representative.

TimescaleDB's current implementation of SkipScan only allows distinct on
one column, and because of where we can place the hook to read the query
cost, a LATERAL JOIN (or similar) can't be used to get both columns like
you want. So, to outsmart the planner, you can get your results with one of
the DISTINCT queries in a function. I realize this is getting a bit
specific, so it might not be an exact fit for you, but this query comes
back on my 78 million rows in 67ms. YMMV

Step 1: Create a function that returns the array of channels per station

CREATE FUNCTION channel_array(TEXT) RETURNS text[]
AS $$
SELECT array_agg(channel) FROM (
select distinct on (channel) channel from stations where station=$1
) a
$$
LANGUAGE SQL;

Step 2: Use a CTE for the distinct stations, querying the function for each
station

WITH s1 AS (
SELECT DISTINCT ON (station) station FROM stations
)
SELECT station, channel_array(station) channel
FROM s1;

If it's using the index, you should see something like:

Subquery Scan on s1  (cost=0.57..16.22 rows=19 width=34) (actual
time=0.580..4.809 rows=19 loops=1)

  ->  Unique  (cost=0.57..11.28 rows=19 width=2) (actual time=0.043..0.654
rows=19 loops=1)

->  Custom Scan (SkipScan) on stations  (cost=0.57..11.23 rows=19
width=2) (actual time=0.042..0.647 rows=19 loops=1)

  ->  Index Only Scan using idx_station_channel on stations
(cost=0.57..1807691.34 rows=7632 width=2) (actual time=0.040..0.641
rows=19 loops=1)
Index Cond: (station > NULL::text)


Heap Fetches: 19



HTH,
Ryan


On Wed, Sep 22, 2021 at 6:22 PM Israel Brewster 
wrote:

> On Sep 22, 2021, at 2:05 PM, Ryan Booz  wrote:
>
> Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it
> as one of the execution nodes. I also realize I was making a few
> assumptions about your data, are channels shared among stations, or are all
> channels unique (like an ID) per station? That would impact the index and
> approach.
>
>
> Ok, that may be a good point: “channel” is currently a varchar column,
> containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful
> of possible channels that I am currently aware of, which are shared among
> stations - most stations have a ‘BHZ’ channel, for example. That would be
> fairly simple to normalize out if that would help.
>
>
> Something like:
>
> station | channel
> --|---
> 11
> 12
> 23
> 24
>
> or:
> station | channel
> --|---
> 11
> 12
> 2    1
> 22
>
>
>
>
> On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster 
> wrote:
>
>> On Sep 22, 2021, at 1:50 PM, Ryan Booz  wrote:
>>
>> Cool. I'd be interested to see the explain on it if you ever try it
>> again. On that cardinality, I'd expect it to be really fast, so I'm
>> interested to see if the (SkipScan) nodes were actually used.
>>
>>
>> With timescaledb extension installed, the explain is what I posted in the
>> original message (https://explain.depesz.com/s/mtxB#html). Without
>> timescaledb installed, the explain looks the same, except it takes twice as
>> long to run.
>>
>> Unless I missed something in your message, i.e. some sort of tweak to the
>> query to get it to use the timescaledb features?
>>
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>
>>
>> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster 
>> wrote:
>>
>>>
>>> On Sep 22, 2021, at 12:49 PM, Ryan Booz  wrote:
>>>
>>> [Timescale Dev Advocate here]
>>> I realize this might not be the most accepted answer (could be
>>> interpreted as trying to "sell" something), but feels like an opportunity
>>> to talk about DISTINCT queries and opportunities. Because you have that
>>> index, Timescale 2.3 added a "Skip Scan" query planner node that works on
>>> regular BTree indexes (it doesn't have to be time-series/TimescaleDB
>>> Hypertable data at all). In this case, your distinct query would likely run
>>> in a few milliseconds based on the counts you mention (170 stations, 3
>>> channels pe

Re: Faster distinct query?

2021-09-23 Thread Ryan Booz
Heh, I honestly forgot about the recursive CTE. Certainly worth a try and
wouldn't require installing other extensions.

This is what depesz is referring to:
https://wiki.postgresql.org/wiki/Loose_indexscan

On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski 
wrote:

> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> > I was wondering if there was any way to improve the performance of this
> query:
> >
> > SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP
> BY station;
> >
> > The explain execution plan can be found here:
> > https://explain.depesz.com/s/mtxB#html <
> https://explain.depesz.com/s/mtxB#html>
> >
> > and it looks pretty straight forward. It does an index_only scan,
> followed by an aggregate, to produce a result that is a list of stations
> along with a list of channels associated with each (there can be anywhere
> from 1 to 3 channels associated with each station). This query takes around
> 5 minutes to run.
> >
> > To work around the issue, I created a materialized view that I can
> update periodically, and of course I can query said view in no time flat.
> However, I’m concerned that as the dataset grows, the time it takes to
> refresh the view will also grow (correct me if I am wrong there).
> >
> > This is running PostgreSQL 13, and the index referenced is a two-column
> index on data(station, channel)
>
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
>
> can you tell us how many rows has this:
>
> select distinct station, channel from data;
>
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.
>
> Best regards,
>
> depesz
>
>
>
>


Re: Growth planning

2021-10-04 Thread Ryan Booz
As for clustering, unfortunately, it's a one-time operation in Postgres (as
far as I'm aware), so you'd have to "cluster" the index every time after an
insert or update of data. If it is partitioned, I presume it can be run on
the index of each partition table individually - but I'm not sure.

On Mon, Oct 4, 2021 at 6:05 PM Rob Sargent  wrote:

> On 10/4/21 3:37 PM, Israel Brewster wrote:
>
> On Oct 4, 2021, at 1:21 PM, Rob Sargent  wrote:
>
> My "strict" table per station suggestion was meant as an option to avoid
> the partitioning pain point entirely if it wasn't going to buy you
> anything. Namely querying more than one station's data.
>
>
> Ah, so in theory making “strict” tables for each would be easier than
> creating partitions for each? Something to consider for sure if so.
>
>
> In a write-once scenario such as this,  would a "clustered index" on
> datetime be stable, performant?  Seems a read-for-export could put the head
> down at time point A and just go?
>
> That’s beyond my level of DB admin knowledge, unfortunately :) I can
> certainly read up on it and give it a try though!
>
>
> I was hoping one of the smart people would chime in;)
>


Re: DBeaver does not show all tables in a Schema

2021-10-29 Thread Ryan Booz
In a recent update (not sure when), the default for DBeaver seems to have
changed so that the navigator view is set to "simple", rather than
"advanced" which shows all objects.

Right-click the server -> edit connection -> Select "General" -> verify
"Navigator View"



On Fri, Oct 29, 2021 at 9:48 AM Shaozhong SHI 
wrote:

> It is a new installation with all permissions.  Nothing has done to it.
>
> Regards,
>
> David
>
> On Fri, 29 Oct 2021 at 14:14, negora  wrote:
>
>> Are you sure that you're not applying a filter in the "Database
>> Navigator" panel of DBeaver?
>>
>> Sometimes, it occurs to me that I apply a filter for certain database, I
>> forget to remove it, then open another database, and some or all tables
>> do not appear in the navigator.
>>
>>
>> On 29/10/2021 12:46, Shaozhong SHI wrote:
>> > I used a DBeaver to connect to postgres but it does not show all
>> > tables in a schema.
>> >
>> > Can anyone shed light on this?
>> >
>> > Regards,
>> >
>> > David
>>
>>
>>