Should from_collapse be switched off? (queries 10 times faster)

2018-03-23 Thread Peter

My queries get up to 10 times faster when I disable from_collapse
(setting from_collapse_limit=1).

After this finding, The pramatic solution is easy: it needs to be
switched off.

BUT:
I found this perchance, accidentally (after the queries had been
running for years). And this gives me some questions about
documentation and best practices.

I could not find any documentation or evaluation that would say
that from_collapse can have detrimental effects. Even less, which
type of queries may suffer from that.

Since we cannot experimentally for all of our queries try out all
kinds of options, if they might have significant (negative) effects,
my understanding now is that, as a best practice, from_collapse
should be switched off by default. And only after development it
should be tested if activating it gives a positive improvement.

Sadly, my knowledge does not reach into the internals. I can
understand which *logical* result I should expect from an SQL
statement. But I do not know how this is achieved internally.
So, I have a very hard time when trying to understand output from
EXPLAIN, or to make an educated guess on how the design of a
query may influence execution strategy. I am usually happy when
I found some SQL that would correctly produce the results I need.
In short: I lack the experience to do manual optimization, or to
see where manual optimization might be feasible.

The manual section "Controlling the Planner with Explicit JOIN
Clauses" gives a little discussion on the issue. But it seems only
concerned about an increasing amount of cycles used for the
planning activity, not about bad results from the optimization.
Worse, it creates the impression that giving the planner maximum
freedom is usually a good thing (at least until it takes too much
cycles for the planner to evaluate all possibilities).

In my case, planning uses 1 or 2% of the cycles needed for
execution; that seems alright to me. 
And, as said above, I cannot see why my queries might be an
atypical case (I don't think they are).

If somebody would like to get a hands-on look onto the actual
case, I'd be happy to put it online.

rgds,
PMc



Re: Should from_collapse be switched off? (queries 10 times faster)

2018-03-23 Thread Peter
On Fri, Mar 23, 2018 at 12:41:35PM +0100, Laurenz Albe wrote:

! https://www.postgresql.org/docs/current/static/explicit-joins.html
! states towards the end of the page that the search tree grows
! exponentially with the number of relations, and from_collapse_limit
! can be set to control that.

Yes, I read that page.

! > In my case, planning uses 1 or 2% of the cycles needed for
! > execution; that seems alright to me. 
! > And, as said above, I cannot see why my queries might be an
! > atypical case (I don't think they are).
! > 
! > If somebody would like to get a hands-on look onto the actual
! > case, I'd be happy to put it online.
! 
! It seems like you are barking up the wrong tree.
! 
! Your query does not take long because of the many relations in the
! FROM list, but because the optimizer makes a wrong choice.

Exactly! 
And I am working hard in order to understand WHY this happens.

! The correct solution is *not* to set from_collapse_limit = 1, but
! to find and fix the problem that causes the optimizer to make a
! wrong choice.
! 
! If you send the query and the output of
! EXPLAIN (ANALYZE, BUFFERS) SELECT ...
! we have a chance of telling you what's wrong.

Your viewpoint would be preferrable, only I am lacking any idea on
where there could be such a problem that would make up a root cause.

I will gladly follow Your suggestion; data is underway. 

P.



Re: Should from_collapse be switched off? (queries 10 times faster)

2018-03-23 Thread Peter

The problem appeared when I found the queries suddenly taking longer
than usual. Investigation showed that execution time greatly depends
on the way the queries are invoked.
Consider fn(x) simply a macro containing a plain SQL SELECT statement
returning SETOF (further detail follows below):

# SELECT fn(x);
-> 6.3 sec.

# SELECT a from fn(x) as a;
-> 1.3 sec.

Further investigation with auto_explain shows different plans being
chosen. The slower one uses an Index Only Scan, which seems to perform
bad. Slightly increasing random_page_cost solves this, but this seems
the wrong way, because we are on SSD+ZFS, where random_page_cost
actually should be DEcreased, as there is no difference if random or
sequential.

During this effort I accidentally came upon from_collapse_limit,
and setting it off significantly changed things:

# SET from_collapse_limit = 1;

# SELECT fn(x);
-> 0.6 sec.

# SELECT a from fn(x) as a;
-> 1.2 sec.

The plans look different now (obviousely), and again the difference
between the two invocations comes from an an Index Only Scan, but
this time the Index Only Scan is faster. So now we can reduce
random_page_cost in order to better reflect physical circumstances,
and then both invocations will be fast.

>From here it looks like from_collapse is the problem.


Now for the details:

VACUUM ANALYZE is up to date, and all respective configurations are as
default.

The query itself contains three nested SELECTS working all on the same
table. The table is 400'000 rows, 36 MB. (The machine is a pentium-3,
which is my router - so don't be surprized about the comparatively long
execution times.)

This is the (critical part of the) query - let $1 be something like
'2017-03-03':

  SELECT MAX(quotes.datum) AS ratedate, aktkurs.*
FROM quotes, wpnames, places,
 (SELECT quotes.datum, close, quotes.wpname_id, places.waehrung
FROM quotes, wpnames, places,
 (SELECT MAX(datum) AS datum, wpname_id
FROM quotes
WHERE datum <= $1
GROUP BY wpname_id) AS newest
WHERE newest.datum = quotes.datum
  AND newest.wpname_id = quotes.wpname_id
  AND quotes.wpname_id = wpnames.id
  AND wpnames.place_id = places.id) AS aktkurs
WHERE quotes.wpname_id = wpnames.id
  AND wpnames.place_id = places.id AND places.platz = 'WAEHR'
  AND wpnames.nummer = aktkurs.waehrung
  AND quotes.datum <= aktkurs.datum
GROUP BY aktkurs.datum, aktkurs.close, aktkurs.wpname_id,
  aktkurs.waehrung

Here are the (respective parts of the) tables:

CREATE TABLE public.quotes -- rows = 405466, 36 MB
(
  id integer NOT NULL DEFAULT nextval('quotes_id_seq'::regclass),
  wpname_id integer NOT NULL,
  datum date NOT NULL,
  close double precision NOT NULL,
  CONSTRAINT quotes_pkey PRIMARY KEY (id),
  CONSTRAINT fk_rails_626c320689 FOREIGN KEY (wpname_id)
  REFERENCES public.wpnames (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
) 
CREATE INDEX quotes_wd_idx -- 8912 kB
  ON public.quotes
  USING btree
  (wpname_id, datum);

CREATE TABLE public.wpnames -- rows = 357, 40 kB
(
  id integer NOT NULL DEFAULT nextval('wpnames_id_seq'::regclass),
  place_id integer NOT NULL,
  nummer text NOT NULL,
  name text NOT NULL,
  CONSTRAINT wpnames_pkey PRIMARY KEY (id),
  CONSTRAINT fk_rails_18eae07552 FOREIGN KEY (place_id)
  REFERENCES public.places (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
)
  
CREATE TABLE public.places -- rows = 11, 8192 b
(
  id integer NOT NULL DEFAULT nextval('places_id_seq'::regclass),
  platz text NOT NULL,
  text text,
  waehrung character varying(3) NOT NULL,
  CONSTRAINT places_pkey PRIMARY KEY (id)
)

Hint: the quotes table contains daily stock quotes AND forex quotes,
and what the thing does is fetch the newest quotes before a given
date (inmost SELECT), fetch the respective currency ("waehrung") from
wpnames+places (next SELECT), and fetch the (date of the) respective
newest forex quote (last SELECT). (A final outermost fourth select
will then put it all together, but thats not part of the problem.)

Finally, the execution plans:

6 sec. index only scan with from_collapse:
https://explain.depesz.com/s/IPaT

1.3 sec. seq scan with from_collapse:
https://explain.depesz.com/s/Bxys

1.2 sec. seq scan w/o from_collapse:
https://explain.depesz.com/s/V02L

0.6 sec. index only scan w/o from_collapse:
https://explain.depesz.com/s/8Xh


Addendum: from the Guides for the mailing list, supplemental
information as requested. As this concerns planner strategy, which is
influenced by statistics, it appears difficult to me to create a
proper test-case, because I would need to know from where the planner
fetches the decision-relevant information - which is exactly my
question: how does it get the clue to choose the bad plans?

 CPU: Intel Pentium III (945.02-MHz 686-class CPU)
 avail memory = 2089263104 (1992 MB)
 FreeBSD 11.1-RELEASE-p7
 PostgreSQL 9.5.7 on i386-portbld-freebsd11.1, compiled

Re: Should from_collapse be switched off? (queries 10 times faster)

2018-03-23 Thread Peter
On Fri, Mar 23, 2018 at 10:14:19AM -0400, Tom Lane wrote:

! It's conceivable that the OP's problem is actually planning time
! (if the query joins sufficiently many tables) and that restricting
! the cost of the join plan search is really what he needs to do.

Negative. Plnning time 10 to 27 ms. Execution time 600 to 6300 ms.

! Lacking any further information about the problem, we can't say.
! We can, however, point to
! https://wiki.postgresql.org/wiki/Slow_Query_Questions
! concerning how to ask this type of question effectively.

I strongly hope the data that I sent as followup will now 
suffice Your expectations.

rgds,
PMc



functions: VOLATILE performs better than STABLE

2018-03-23 Thread Peter

Given an arbitrary function fn(x) returning numeric.

Question: how often is the function executed?


A. 
select fn('const'), fn('const');

Answer:
Twice. 

This is not a surprize.


B.
select v,v from fn('const') as v;  [1]

Answer:
Once.


C.
select v.v,v.v from (select fn('const') as v) as v;

Answer:
Once if declared VOLATILE.
Twice if declared STABLE.

Now this IS a surprize. It is clear that the system is not allowed to
execute the function twice when declared VOLATILE. It IS ALLOWED to
execute it twice when STABLE - but to what point, except prolonging
execution time?

Over all, VOLATILE performs better than STABLE.


[1] I seem to remember that I was not allowed to do this when I coded
my SQL, because expressions in the from clause must return SETOF, not
a single value. Now it seems to work.



Re: Should from_collapse be switched off? (queries 10 times faster)

2018-03-26 Thread Peter
Laurenz,

thank You very much for Your comments!

On Sun, Mar 25, 2018 at 07:12:08AM +0200, Laurenz Albe wrote:

! Your reported execution times don't match the time reported in the
! EXPLAIN output...

Should these match? 
It seems the EXPLAIN (ANALYZE, BUFFERS) does additional things, not 
just execute the query. 

! The cause of the long execution time is clear:
! 
! The row count of the join between "places" (WHERE platz = 'WAEHR'),
! "wpnames" and "places AS places_1" is underestimated by a factor of 10
! (1 row instead of 10).
! 
! The nested loop join that is chosen as a consequence is now executed
! 10 times instead of the estimated 1 time, which is where almost all the
! execution time is spent.

I've seen this, but do not fully understand it yet.
 
! Setting "enable_nestloop = off" is as coarse as forcing "from_collapse = 1"
! and will negatively impact other queries - if it helps at all.

Since this query is already put into a function, I found I can easily
set from_collapse=1 only for this function, by means of "ALTER
FUNCTION ... SET ...", so it does only influence this query. 
It seems this is the most straight-forward solution here.
 
rgds,
P.



Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Peter
On Sun, Apr 14, 2019 at 05:19:11PM -0400, Tom Lane wrote:
! Gunther  writes:
! > For weeks now, I am banging my head at an "out of memory" situation. 
! > There is only one query I am running on an 8 GB system, whatever I try, 
! > I get knocked out on this out of memory. It is extremely impenetrable to 
! > understand and fix this error. I guess I could add a swap file, and then 
! > I would have to take the penalty of swapping. But how can I actually 
! > address an out of memory condition if the system doesn't tell me where 
! > it is happening?

Well, esactly with a swap space. No offense intended, but if You
don't have a swap space, You should not complain about unintellegibe
Out-of-memory situations.
Swapspace is not usually used to run applications from (that would
indeed give horrible performance), it is used to not get out-of-memory
errors. With a swapspace, the out-of-memory situation will persist,
and so one has time to take measurements and analyze system
behaviour and from that, one can better understand what is causing 
the problem, and decide what actions should be taken, on an informed
base (e.g. correct flaws in the system tuning, fix bad query, buy 
more memory, or what may be applicable)

If I remember correctly, I did even see a DTRACE flag in my build,
so what more is to wish? :))

P.




pgaudit and create postgis extension logs a lot inserts

2018-01-18 Thread Svensson Peter
Hi,

I have installed pgaudit, and configured as:
pgaudit.log = 'ddl,role'
pgaudit.log_level = 'log'  (default)

Versions:  postgresql96 (9.6.6) , pgaudit96 (1.0.4), postgis 2.3.2,  Rhel 7.4

When I then install  postgis extension in a database it writes a huge amount of 
logs which slow down the server a lot.
Not only table creation and functions are logged,  even  all inserts in  
spatial_ref_sys are written to the audit-log.

LOG:  AUDIT: SESSION,1,1,DDL,CREATE FUNCTION,,,"
..
INSERT INTO ""spatial_ref_sys"" (""srid"",""auth_name"


This behaviour make pgaudit useless in our environment due to the overhead in 
log-file write.
I have tried different combinations of  pgaudit.log  settings 
(role,-functions), (role),  and also changed pgaudit.log_level to  warning, but 
it was not better.

Does anybody have a useful  pgaudit settings which not overflow the log files, 
even when installing postgis or other extensions?

Also noticed that setting a session log to none (set pgaudit.log='none';)  
overrides parameter from postgresql.conf,  but does not get logged, and then 
you can do whatever you want without any audit.
I supposed this changing of  audit session log parameter should be logged to 
file?


Regards,
Peter


SV: pgaudit and create postgis extension logs a lot inserts

2018-01-19 Thread Svensson Peter

A test to create postgis extension made 4 rsyslog processes run for several 
minutes with high cpu util,
and when you have only 8 cpu:s this take lot of resources. 
The create command also have to wait until all the log are written so there are 
great impact.
Log file got 16 GB big only for this.

We have several databases in the same server, some of them with postgis.
Those databases are maintained bye different people, and tell them to disable 
pgaudit
every time they are doing something that can cause lot log will create a bad 
behaviour,
especially when we cannot see in the logs that they have disabled pgaudit.

I think postgis extension is not the only extention that creates both tables, 
functions and insert data,
but if there are a way to configure pgaudit so you get rid of the inserts maybe 
its a way to handle it.

/Peter

Från: Joe Conway [[email protected]]
Skickat: den 18 januari 2018 17:54
Till: Svensson Peter; [email protected]
Ämne: Re: pgaudit and create postgis extension logs a lot inserts

On 01/18/2018 04:12 AM, Svensson Peter wrote:
> When I then install  postgis extension in a database it writes a huge
> amount of logs which slow down the server a lot.
> Not only table creation and functions are logged,  even  all inserts in
> spatial_ref_sys are written to the audit-log.
>
> LOG:  AUDIT: SESSION,1,1,DDL,CREATE FUNCTION,,,"
> ..
> INSERT INTO ""spatial_ref_sys"" (""srid"",""auth_name"
> 
>
> This behaviour make pgaudit useless in our environment due to the
> overhead in log-file write.

How often do you intend to install PostGIS? Disable pgaudit, install
PostGIS, enable pgaudit?

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




SV: pgaudit and create postgis extension logs a lot inserts

2018-01-22 Thread Svensson Peter
OK,  thanks a lot. 

Regards,
Peter

Från: David Steele [[email protected]]
Skickat: den 19 januari 2018 14:41
Till: Magnus Hagander; Joe Conway
Kopia: Svensson Peter; [email protected]
Ämne: Re: pgaudit and create postgis extension logs a lot inserts

On 1/19/18 6:05 AM, Magnus Hagander wrote:
>
>
> On Thu, Jan 18, 2018 at 6:54 PM, Joe Conway  <mailto:[email protected]>> wrote:
>
> On 01/18/2018 04:12 AM, Svensson Peter wrote:
> > When I then install  postgis extension in a database it writes a huge
> > amount of logs which slow down the server a lot.
> > Not only table creation and functions are logged,  even  all inserts in
> > spatial_ref_sys are written to the audit-log.
> >
> > LOG:  AUDIT: SESSION,1,1,DDL,CREATE FUNCTION,,,"
> > ..
> > INSERT INTO ""spatial_ref_sys"" (""srid"",""auth_name"
> > 
> >
> > This behaviour make pgaudit useless in our environment due to the
> > overhead in log-file write.
>
> How often do you intend to install PostGIS? Disable pgaudit, install
> PostGIS, enable pgaudit?
>
>
> Would it make sense for pgaudit to, at least by option, not include DDL
> statements that are generated as "sub-parts" of a CREATE EXTENSION? It
> should still log the CREATE EXTENSION of course, but not necessarily all
> the contents of it, since that's actually defined in the extension
> itself already?
That's doable, but I think it could be abused if it was always on and
installing extensions is generally not a daily activity.

It seems in this case the best action is to disable pgaudit before
installing postgis or install postgis first.

Regards,
--
-David
[email protected]



Re: blending fast and temp space volumes

2018-02-21 Thread Peter Geoghegan
On Wed, Feb 21, 2018 at 7:53 AM, Rick Otten  wrote:
> side note:  The disadvantage of local SSD is that it won't survive "hitting
> the virtual power button" on an instance, nor can it migrate automatically
> to other hardware.  (We have to hit the power button to add memory/cpu to
> the system, and sometimes the power button might get hit by accident.)  This
> is OK for temp space.  I never have my database come up automatically on
> boot, and I have scripted the entire setup of the temp space volume and data
> structures.  I can run that script before starting the database.   I've done
> some tests and it seems to work great.  I don't mind rolling back any
> transaction that might be in play during a power failure.

It sounds like you're treating a temp_tablespaces tablespace as
ephemeral, which IIRC can have problems that an ephemeral
stats_temp_directory does not have.

-- 
Peter Geoghegan



Re: blending fast and temp space volumes

2018-02-21 Thread Peter Geoghegan
On Wed, Feb 21, 2018 at 12:07 PM, Claudio Freire  wrote:
> On Wed, Feb 21, 2018 at 4:50 PM, Peter Geoghegan  wrote:
>> On Wed, Feb 21, 2018 at 7:53 AM, Rick Otten  wrote:
>>> side note:  The disadvantage of local SSD is that it won't survive "hitting
>>> the virtual power button" on an instance, nor can it migrate automatically
>>> to other hardware.  (We have to hit the power button to add memory/cpu to
>>> the system, and sometimes the power button might get hit by accident.)  This
>>> is OK for temp space.  I never have my database come up automatically on
>>> boot, and I have scripted the entire setup of the temp space volume and data
>>> structures.  I can run that script before starting the database.   I've done
>>> some tests and it seems to work great.  I don't mind rolling back any
>>> transaction that might be in play during a power failure.
>>
>> It sounds like you're treating a temp_tablespaces tablespace as
>> ephemeral, which IIRC can have problems that an ephemeral
>> stats_temp_directory does not have.
>
> For instance?
>
> I've been doing that for years without issue. If you're careful to
> restore the skeleton directory structure at server boot up, I haven't
> had any issues.

Then you clearly know what I mean already. That's not documented as
either required or safe anywhere.

-- 
Peter Geoghegan



Re: creating hash indexes

2022-12-14 Thread Peter Geoghegan
On Wed, Dec 14, 2022 at 12:03 PM Rick Otten  wrote:
> Assuming I can live with the slower inserts, is there any parameter in 
> particular I can tweak that would make the time it takes to create the hash 
> index closer to the btree index creation time?  In particular if I wanted to 
> try this on a several billion row table in a busy database?

No. B-Tree index builds are parallelized, and are far better optimized
in general.

> -  As long as the index fits in memory, varchar btree isn't really that much 
> slower in postgresql 14 (the way it was a few years ago), so we'll probably 
> just live with that for the forseeable future given the complexity of 
> changing things at the moment.

The other things to consider are 1.) the index size after retail
inserts, 2.) the index size following some number of updates and
deletes.

Even if you just had plain inserts for your production workload, the
picture will not match your test case (which I gather just looked at
the index size after a CREATE INDEX ran). I think that B-Tree indexes
will still come out ahead if you take this growth into account, and by
quite a bit, but probably not due to any effect that your existing test case
exercises.

B-Tree indexes are good at accommodating unpredictable growth, without
ever getting terrible performance on any metric of interest. So it's
not just that they tend to have better performance on average than
hash indexes (though they do); it's that they have much more
*predictable* performance characteristics as conditions change.

--
Peter Geoghegan




Re: time sorted UUIDs

2023-04-17 Thread peter plachta
Hi Tim -- I am looking at the issue of random IDs (ie, UUIDs) as well. Did
you have a chance to try time sorted UUIDs as was suggested in one of the
responses?

On Mon, Apr 17, 2023 at 5:23 PM Tim Jones  wrote:

> Hi,
>
> could someone please comment on this article
> https://vladmihalcea.com/uuid-database-primary-key/ specifically re the
> comments (copied below) in regards to a Postgres database.
>
> ...
>
> But, using a random UUID as a database table Primary Key is a bad idea for
> multiple reasons.
>
> First, the UUID is huge. Every single record will need 16 bytes for the
> database identifier, and this impacts all associated Foreign Key columns as
> well.
>
> Second, the Primary Key column usually has an associated B+Tree index to
> speed up lookups or joins, and B+Tree indexes store data in sorted order.
>
> However, indexing random values using B+Tree causes a lot of problems:
>
>- Index pages will have a very low fill factor because the values come
>randomly. So, a page of 8kB will end up storing just a few elements,
>therefore wasting a lot of space, both on the disk and in the database
>memory, as index pages could be cached in the Buffer Pool.
>- Because the B+Tree index needs to rebalance itself in order to
>maintain its equidistant tree structure, the random key values will cause
>more index page splits and merges as there is no pre-determined order of
>filling the tree structure.
>
> ...
>
>
> Any other general comments about time sorted UUIDs would be welcome.
>
>
>
> Thanks,
>
> *Tim Jones*
>
>
>


High QPS, random index writes and vacuum

2023-04-17 Thread peter plachta
Hi all

The company I work for has a large (50+ instances, 2-4 TB each) Postgres
install. One of the key problems we are facing in vanilla Postgres is
vacuum behavior on high QPS (20K writes/s), random index access on UUIDs.

In one case the table is 50Gb and has 3 indexes which are also 50Gb each.
It takes 20 hours to vacuum the entire thing, where bulk of the time is
spent doing 'index vacuuming'. The table is then instantly vacuumed again.
I increased work_mem to 2Gb, decreased sleep threshold to 2ms and increased
the IO limit to 2000. I also changed the autovacuum thresholds for this
table.

I understand that doing random index writes is not a good strategy, but, 20
hours to vacuum 200Gb is excessive.

My question is: what is the recommended strategy to deal with such cases in
Postgres?

Thanks very much!!


Re: High QPS, random index writes and vacuum

2023-04-17 Thread Peter Geoghegan
On Mon, Apr 17, 2023 at 5:35 PM peter plachta  wrote:
> My question is: what is the recommended strategy to deal with such cases in 
> Postgres?

You didn't say what version of Postgres you're using...

-- 
Peter Geoghegan




Re: High QPS, random index writes and vacuum

2023-04-17 Thread peter plachta
Thank you Tom.
Version: I sheepishly admit it's 9.6, 10 and 11 (it's Azure Single Server,
that's another story).

I am definitely looking at redoing the way we do UUIDs... but that' s not a
trivial change given the volume of data we have + 24/7 workload.

I was trying to understand whether there are any known workarounds for
random access + index vacuums. Are my vacuum times 'normal' ?

On Mon, Apr 17, 2023 at 7:01 PM Tom Lane  wrote:

> peter plachta  writes:
> > The company I work for has a large (50+ instances, 2-4 TB each) Postgres
> > install. One of the key problems we are facing in vanilla Postgres is
> > vacuum behavior on high QPS (20K writes/s), random index access on UUIDs.
>
> Indexing on a UUID column is an antipattern, because you're pretty much
> guaranteed the worst-case random access patterns for both lookups and
> insert/delete/maintenance cases.  Can you switch to timestamps or
> the like?
>
> There are proposals out there for more database-friendly ways of
> generating UUIDs than the traditional ones, but nobody's gotten
> around to implementing that in Postgres AFAIK.
>
> regards, tom lane
>


Re: High QPS, random index writes and vacuum

2023-04-17 Thread peter plachta
Thank you David -- I increased this to 1GB as well (seeing as that was the
max). We are doing mostly single passes now.

On Mon, Apr 17, 2023 at 7:40 PM David Rowley  wrote:

> On Tue, 18 Apr 2023 at 12:35, peter plachta  wrote:
> > I increased work_mem to 2Gb
>
> maintenance_work_mem is the configuration option that vacuum uses to
> control how much memory it'll make available for storage of dead
> tuples. I believe 1GB would allow 178,956,970 tuples to be stored
> before multiple passes would be required. The chunk of memory for dead
> tuple storage is capped at 1GB.
>
> David
>


Re: High QPS, random index writes and vacuum

2023-04-17 Thread Peter Geoghegan
On Mon, Apr 17, 2023 at 7:43 PM peter plachta  wrote:
> Version: I sheepishly admit it's 9.6, 10 and 11 (it's Azure Single Server, 
> that's another story).

If you can upgrade to 14, you'll find that there is much improved
management of index updates on that version:

https://www.postgresql.org/docs/current/btree-implementation.html#BTREE-DELETION

But it's not clear what the problem really is here. If the problem is
that you're dependent on vacuum to get acceptable response times by
holding back index bloat, then an upgrade could easily help a lot. But
an upgrade might not make VACUUM take less time, given that you've
already tuned it fairly aggressively. It depends.

An upgrade might make VACUUM go faster if you set
vacuum_cost_page_miss to 2, which is the default on later versions
anyway -- looks like you didn't touch that. And, as Thomas said, later
versions do have parallel VACUUM, though that cannot be used by
autovacuum workers.

-- 
Peter Geoghegan




Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-12 Thread Peter Geoghegan
On Mon, Jun 12, 2023 at 1:17 PM benoit  wrote:
> Is there a misusage of my indexes?
>
> Is there a limitation when using ANY or IN operators and ordered LIMIT behind?

It's complicated. Do you find that you get satisfactory performance if
you force a bitmap index scan? In other words, what is the effect of
"set enable_indexscan = off" on your original query? Does that speed
up execution at all? (I think that this approach ought to produce a
plan that uses a bitmap index scan in place of the index scan, without
changing anything else.)

-- 
Peter Geoghegan




Results of experiments with UUIDv7, UUIDv8

2023-07-30 Thread peter plachta
Hi all

My colleague and I did some experiments to see what effect using UUIDs as
2nd-ary indexes has on Index IO. The context is that by default ORM
frameworks will use UUIDs as index keys which I found as a major factor to
performance issues at Celonis. I suspect this isn't specific to Celonis.
The secondary factor is that random IO on Azure Single Server can be slow
as a dog -- thus for large enough indexes that aren't cached, and workloads
doing insert/delete at a high enough QPS, this really hurts.

We found that using UUID v7 (which has a longer time based prefix than v8)
gave 30% in IO savings in index access and roughly the same in index size
after I/D workload. v8 was ~24%. We simulated slow, random IO by running
this on a USB key which seemed to match Azure performance pretty well. SSD
was maybe 2x better.
This is relative to UUID v3 which is essentially random (actually, pretty
good random distribution on a 500Gb table).

This isn't as much as I expected, but, again for large indexes, slow IO, it
was significant.

 peter


Table copy with SERIALIZABLE is incredibly slow

2023-07-30 Thread peter plachta
Hi all

Background is we're trying a pg_repack-like functionality to compact a
500Gb/145Gb index (x2) table from which we deleted 80% rows. Offline is not
an option. The table has a moderate (let's say 100QPS) I/D workload running.

The typical procedure for this type of thing is basically CDC:

1. create 'log' table/create trigger
2. under SERIALIZABLE: select * from current_table insert into new_table

What we're finding is that for the 1st 30 mins the rate is 10Gb/s, then it
drops to 1Mb/s and stays there and 22 hours later the copy is still
going and now the log table is huge so we know the replay will also take a
very long time.

===

Q: what are some ways in which we could optimize the copy?

Btw this is Postgres 9.6

(we tried unlogged table (that did nothing), we tried creating indexes
after (that helped), we're experimenting with RRI)

Thanks!


Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-28 Thread Peter Geoghegan
On Mon, Aug 28, 2023 at 5:33 PM jayaprabhakar k  wrote:
> REINDEX requires a full table scan
>
> Roughly create a new index, rename index, drop old index.
> REINDEX is not incremental. running reindex frequently does not reduce the 
> future reindex time.

You didn't say which Postgres version you're on. Note that Postgres 14
can deal with index bloat a lot better than earlier versions could.
This is known to work well with partial indexes. See:

https://www.postgresql.org/message-id/flat/CAL9smLAjt9mZC2%3DqBeJwuNPq7KMAYGTWWQw_hvA-Lfo0b3ycow%40mail.gmail.com

-- 
Peter Geoghegan




Re: Trying to understand why a query is filtering when there is a composite index

2024-08-18 Thread Peter Geoghegan
On Sun, Aug 18, 2024 at 9:56 PM Stephen Samuel (Sam)  wrote:
> My question is, why isn't it using the index for column b? Is this expected? 
> And why is it doing heap lookups for every row,.

This has been fixed for Postgres 17:

https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans

-- 
Peter Geoghegan




Re: Trying to understand why a query is filtering when there is a composite index

2024-08-18 Thread Peter Geoghegan
On Sun, Aug 18, 2024 at 10:01 PM Stephen Samuel (Sam)  wrote:
> Oh as simple as upgrading!
> Ok great, appreciate the quick reply. Will have to wait for AWS to support 17 
> :)

It is possible to use index quals for both a and b on earlier
versions, with certain restrictions. You might try setting
random_page_cost to a much lower value, to see if that allows the
planner to use such a plan with your real query.

In my experience it's very unlikely that the planner will do that,
though, even when coaxed. At least when there are this many IN()
constants. So you probably really will need to upgrade to 17.

-- 
Peter Geoghegan




Re: Trying to understand why a query is filtering when there is a composite index

2024-08-18 Thread Peter Geoghegan
On Sun, Aug 18, 2024 at 10:50 PM Tom Lane  wrote:
> I think it's a good bet that this query would be *slower* if
> it were done the other way.  The filter condition is eliminating
> only one of the 11 rows matching "a = 662028765".  If we did what
> you think you want, we'd initiate ten separate index descents
> to find the other ten rows.

True - on versions prior to Postgres 17.

On 17 the number of index descents will be minimal.  If there are less
than a few hundred index tuples with the value a = , then
there'll only be one descent.

> Yeah, that part is a weakness I've wanted to fix for a long
> time: it could do the filter condition by fetching b from the
> index, but it doesn't notice that and has to go to the heap
> to get b.

It was fixed? At least on 17.

-- 
Peter Geoghegan




Re: Trying to understand why a query is filtering when there is a composite index

2024-08-19 Thread Peter Geoghegan
On Mon, Aug 19, 2024 at 12:06 AM Tom Lane  wrote:
> > It was fixed? At least on 17.
>
> Oh, sorry, I was thinking of a related problem that doesn't apply
> here: matching indexes on expressions to fragments of a filter
> condition.  However, the fact that the OP's EXPLAIN shows heap
> fetches from a supposedly all-visible table suggests that his
> IN isn't getting optimized that way.

As you pointed out, the number of tuples filtered out by the filter
qual is only a small proportion of the total in this particular
example (wasn't really paying attention to that aspect myself). I
guess that that factor makes the Postgres 17 nbtree SAOP work almost
irrelevant to the exact scenario shown, since even if true index quals
could be used they'd only save at most one heap page access.

I would still expect the 17 work to make the query slightly faster,
since my testing showed that avoiding expression evaluation is
slightly faster. Plus it would *definitely* make similar queries
faster by avoiding heap access entirely -- cases where the use of true
index quals can eliminate most heap page accesses.

> No heap fetches, so it must have done the filter from the index.
> Why not in the original case?

My guess is that that's due to some kind of naturally occuring
correlation. The few unset-in-VM pages are disproportionately likely
to become heap fetches.

The difficulty at predicting this kind of variation argues for an
approach that makes as many decisions as possible at runtime. This is
particularly true of how we skip within the index scan. I wouldn't
expect skipping to be useful in the exact scenario shown, but why not
be open to the possibility? If the planner only has one choice then
there are no wrong choices.

-- 
Peter Geoghegan




Re: Has gen_random_uuid() gotten much slower in v17?

2024-09-11 Thread Peter Eisentraut

On 10.09.24 15:58, David Mullineux wrote:
I'm getting a bit concerned by the slow performance of generating uidds 
on latest dev code versus older versions. Here I compare the time to 
generate 50k random uuids. Both son the same machine.

I must be missing something.


Are you sure that the 18devel installation isn't compiled with 
assertions enabled?


The underlying code for gen_random_uuid() is virtually unchanged between 
PG14 and current.






Re: Trigger overhead/performance and alternatives?

2018-07-07 Thread Peter Eisentraut
On 01.07.18 11:31, AJG wrote:
> Where is mentioned trigger overhead, and provided an alternative solution
> (logical replication slot monitoring).
> 
> My 2 part question is.
> 
> 1) Does anybody have any benchmarks re: trigger overhead/performance or have
> any experience to give a sort of indication, at all?

That really depends on a lot of things, how you write the triggers, what
they do, etc.  You should probably measure that yourself.

> 2) Is anybody aware of any other clever alternatives, pg extensions or
> github code etc as an alternative to using triggers?

Maybe wal2json will give you a starting point.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: How can sort performance be so different

2019-01-30 Thread Peter Geoghegan
On Wed, Jan 30, 2019 at 3:57 AM Bob Jolliffe  wrote:
> (i) whether the sort order makes sense for the Laos names; and
> (ii) what the locale settings are on the production server where the
> problem was first reported.
>
> There will be some turnaround before I get this information.  I am
> guessing that the database is using "en_US" rather than anything Laos
> specific.  In which case "C" would probably be no worse re sort order.
> But will know better soon.
>
> This has been a long but very fruitful investigation.  Thank you all for 
> input.

If you can find a way to use an ICU collation, it may be possible to
get Laotian sort order with performance that's a lot closer to the
performance you see with the C locale. The difference that you're
seeing is obviously explainable in large part by the C locale using
the abbreviated keys technique. The system glibc's collations cannot
use this optimization.

I believe that some locales have inherently more expensive
normalization processes (comparisons) than others, but if you can
effective amortize the cost per key by building an abbreviated key, it
may not matter that much. And, ICU may be faster than glibc anyway.

-- 
Peter Geoghegan



Re: Transaction size and Wal2Json

2019-02-07 Thread Peter Eisentraut
On 07/02/2019 11:38, Mai Peng wrote:
> We use the plugin Wal2Json in order to catch every modification on database. 
> We’ve got an issue : WAL were growing very fast, the state of 
> pg_stat_replication still on ‘catchup’ , an error:pg_recvlogical: unexpected 
> termination of replication stream: ERROR: out of memory DETAIL: Cannot 
> enlarge string buffer. It seems Wal2Json can not handle very big transactions 
> ( more than 1 Gb).
> How could we measure the size of a transaction ?
> Could we increase this limitation ? 

You should send a bug report to wal2json.

It's plausible that some naive coding would run into the limitation that
you describe, but a bit of effort can probably solve it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: How can sort performance be so different

2019-02-20 Thread Peter Geoghegan
On Wed, Feb 20, 2019 at 1:42 PM Bob Jolliffe  wrote:
> It seems not to be (completely) particular to the installation.
> Testing on different platforms we found variable speed difference
> between 100x and 1000x slower, but always a considerable order of
> magnitiude.  The very slow performance comes from sorting Lao
> characters using en_US.UTF-8 collation.

I knew that some collations were slower, generally for reasons that
make some sense. For example, I was aware that ICU's use of Japanese
standard JIS X 4061 is particularly complicated and expensive, but
produces the most useful possible result from the point of view of a
Japanese speaker. Apparently glibc does not use that algorithm, and so
offers less useful sort order (though it may actually be faster in
that particular case).

I suspect that the reasons why the Lao locale sorts so much slower may
also have something to do with the intrinsic cost of supporting more
complicated rules. However, it's such a ridiculously large difference
that it also seems likely that somebody was disinclined to go to the
effort of optimizing it. The ICU people found that to be a tractable
goal, but they may have had to work at it. I also have a vague notion
that there are special cases that are more or less only useful for
sorting French. These complicate the implementation of UCA style
algorithms.

I am only speculating, based on what I've heard about other cases --
perhaps this explanation is totally wrong. I know a lot more about
this stuff than most people on this mailing list, but I'm still far
from being an expert.

-- 
Peter Geoghegan



Re: How can sort performance be so different

2019-02-20 Thread Peter Geoghegan
On Wed, Feb 20, 2019 at 2:25 PM Peter Geoghegan  wrote:
> I suspect that the reasons why the Lao locale sorts so much slower may
> also have something to do with the intrinsic cost of supporting more
> complicated rules.

I strongly suspect that it has something to do with the issue
described here specifically:

http://userguide.icu-project.org/collation/concepts#TOC-Thai-Lao-reordering
-- 
Peter Geoghegan



Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-23 Thread Peter Geoghegan
On Sat, Feb 23, 2019 at 1:06 PM Gunther  wrote:
> I thought to keep my index tight, I would define it like this:
>
> CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;
>
> so that only pending jobs are in that index.
>
> When a job is done, follow up work is often inserted into the Queue as 
> pending, thus adding to that index.

How many distinct jobIds are there in play, roughly? Would you say
that there are many fewer distinct Jobs than distinct entries in the
index/table? Is the number of jobs fixed at a fairly low number, that
doesn't really grow as the workload needs to scale up?

-- 
Peter Geoghegan



Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Peter Geoghegan
On Sun, Feb 24, 2019 at 10:02 AM Gunther  wrote:
> David Rowley  and Justin Pryzby suggested things about autovacuum. But I 
> don't think autovacuum has any helpful role here.

My suspicion is that this has something to do with the behavior of
B-Tree indexes with lots of duplicates. See also:

https://www.postgresql.org/message-id/flat/CAH2-Wznf1uVBguutwrvR%2B6NcXTKYhagvNOY3-dg9dzcYiu_vKw%40mail.gmail.com#993f152a41a1e2c257d12d118aa7ebfc

I am working on a patch to address the problem which is slated for
Postgres 12. I give an illustrative example of one of the problems
that my patch addresses here (it actually addresses a number of
distinct issues all at once):

https://postgr.es/m/cah2-wzmf0fvvhu+sszpgw4qe9t--j_dmxdx3it5jcdb8ff2...@mail.gmail.com

Do you think that that could be a significant factor here? I found
your response to my initial questions unclear.

Using a Postgres table as a queue is known to create particular
problems with bloat, especially index bloat. Fixing the underlying
behavior in the nbtree code would likely sharply limit the growth in
index bloat over time, though it still may not make queue-like tables
completely painless to operate. The problem is described in high level
terms from a user's perspective here:

https://brandur.org/postgres-queues

--
Peter Geoghegan



Failure to reordering in case of a lateral join in combination with a left join (not inner join) resulting in suboptimal nested loop plan

2019-04-30 Thread Peter Billen
Hi,

The queries in what follows can be executed on the following fiddle:
*https://dbfiddle.uk/?rdbms=postgres_10&fiddle=64542f2d987d3ce0d85bbc40ddadf7d6
*
- Please
note that the queries/functions might look silly/pointless, I extracted the
performance issue I am seeing to a minimal reproducible example.

I have the following schema:

create table parent
(
id   intprimary key
);
create table child
(
id   intprimary key,
parent_idintreferences parent(id)
);
create index on child(parent_id);

Let's start with the following inlinable setof-returning function which
basically returns the child rows for a given parent identifier:

create function child_get_1(int) returns table(a int) as
$$
selectchild.id
from  child
left join parent
onparent.id = child.parent_id
where child.parent_id = $1;
$$
language sql stable;

Note that the left join branch is intentionally not used, and thus could be
eliminated by the planner.

When executing the following query, I get a satisfying hash (left) join
(and the left join to parent is indeed eliminated):

explain analyze
select ch.* from parent p, child_get_1(p.id) ch;

++
| Hash Join  (cost=3.25..17163.23 rows=00 width=4) (actual
time=0.025..194.279 rows=00 loops=1)  |
|   Hash Cond: (child.parent_id = p.id)
  |
|   ->  Seq Scan on child  (cost=0.00..14424.00 rows=00 width=8)
(actual time=0.005..47.215 rows=00 loops=1) |
|   ->  Hash  (cost=2.00..2.00 rows=100 width=4) (actual time=0.016..0.016
rows=100 loops=1) |
| Buckets: 1024  Batches: 1  Memory Usage: 12kB
  |
| ->  Seq Scan on parent p  (cost=0.00..2.00 rows=100 width=4)
(actual time=0.001..0.007 rows=100 loops=1)   |
++


Now, I introduce a convenience function, also inlinable, which fetches the
child rows by its parent id:

create function t(int) returns setof child as
$$
select child.* from child where child.parent_id = $1;
$$
language sql stable;

I refactor `child_get_1(int)` from above as following:

create function child_get_2(int) returns table(a int) as
$$
selectchild.id
from  t($1) child
left join parent
onparent.id = child.parent_id;
$$
language sql stable;

explain analyze
select ch.* from parent p, child_get_2(p.id) ch;

Now, I get a nested loop, which as expected performs quite badly:

++
| Nested Loop  (cost=189.92..493990.48 rows=00 width=4) (actual
time=1.519..713.680 rows=00 loops=1) |
|   ->  Seq Scan on parent p  (cost=0.00..2.00 rows=100 width=4) (actual
time=0.004..0.081 rows=100 loops=1) |
|   ->  Bitmap Heap Scan on child  (cost=189.92..4739.90 rows= width=4)
(actual time=1.365..6.332 rows= loops=100)   |
| Recheck Cond: (parent_id = p.id)
 |
| Heap Blocks: exact=442476
  |
| ->  Bitmap Index Scan on child_parent_id_idx  (cost=0.00..187.42
rows= width=0) (actual time=0.838..0.838 rows= loops=100) |
|   Index Cond: (parent_id = p.id)
 |
++

For some reason I cannot explain we now end up with a nested loop, instead
an hash join. The fairly trivial introduction of `t(int)` messes up with
reordering, but I fail to see why. I manually upped the from and join
collapse limit to 32 - just to be sure -, but no effect. Also, the left
join branch could not be eliminated. I believe this is related to the usage
of the implicit lateral join to `child_get_2(p.id)` in the main query,
which somehow messes up with the reordering of `from t($1) as child` in
`child_get_2(int)`, though I am not 100% sure.

Also, note that when we apply an inner join instead of a left join, the
problem goes away. The planner now manages to end up with a hash join in
both cases.

I am seeing this on v10 and v11.

Any ideas?

Thank you. Best regards.


Re: GCC 8.3.0 vs. 9.0.1

2019-05-07 Thread Peter Geoghegan
On Tue, May 7, 2019 at 10:06 AM Tom Lane  wrote:
> Given the described test setup, I'd put basically no stock in these
> numbers.  It's unlikely that this test case's performance is CPU-bound
> per se; more likely, I/O and lock contention are dominant factors.
> So I'm afraid whatever they're measuring is a more-or-less chance
> effect rather than a real system-wide code improvement.

Or a compiler bug. Link-time optimizations give the compiler a view of
the program as a whole, not just a single TU at a time. This enables
it to perform additional aggressive optimization.

-- 
Peter Geoghegan




Re: Temporarily very slow planning time after a big delete

2019-05-21 Thread Peter Geoghegan
On Tue, May 21, 2019 at 11:12 AM Walter Smith  wrote
> I did a VACUUM overnight and got the following. The thing that stands out to 
> me is that one index (index_unproc_notifications_on_notifiable_type) took 
> 100x longer to scan than the others. That's not the index used in the slow 
> query, though.

What columns are indexed by
index_unproc_notifications_on_notifiable_type, and what are their
datatypes?

-- 
Peter Geoghegan




Re: Temporarily very slow planning time after a big delete

2019-05-21 Thread Peter Geoghegan
On Tue, May 21, 2019 at 11:16 AM Walter Smith  wrote:
> It occurs to me that is a somewhat unusual index -- it tracks unprocessed 
> notifications so it gets an insert and delete for every row, and is normally 
> almost empty.

Is it a very low cardinality index? In other words, is the total
number of distinct keys rather low? Not just at any given time, but
over time?

-- 
Peter Geoghegan




Re: Temporarily very slow planning time after a big delete

2019-05-21 Thread Peter Geoghegan
On Tue, May 21, 2019 at 11:27 AM Walter Smith  wrote:
> Very low. Probably less than ten over all time. I suspect the only use of the 
> index is to rapidly find the processed=false rows, so the notifiable_type 
> value isn’t important, really. It would probably work just as well on any 
> other column.

This problem has been fixed in Postgres 12, which treats heap TID as a
tiebreaker column within B-Tree indexes. It sounds like you have the
right idea about how to work around the problem.

VACUUM will need to kill tuples in random locations in the low
cardinality index, since the order of tuples is unspecified between
duplicate tuples -- it is more or less random. VACUUM will tend to
dirty far more pages than is truly necessary in this scenario, because
there is no natural temporal locality that concentrates dead tuples in
one or two particular places in the index. This has a far more
noticeable impact on VACUUM duration than you might expect, since
autovacuum is throttled by delays that vary according to how many
pages were dirtied (and other such factors).

-- 
Peter Geoghegan




Re: UUID v1 optimizations...

2019-05-25 Thread Peter Eisentraut
On 2019-05-25 15:45, Ancoron Luciferis wrote:
> So, my question now is: Would it make sense for you to handle these
> time-based UUID's differently internally? Specifically un-shuffling the
> timestamp before they are going to storage?

It seems unlikely that we would do that, because that would break
existing stored UUIDs, and we'd also need to figure out a way to store
non-v1 UUIDs under this different scheme.  The use case is pretty narrow
compared to the enormous effort.

It is well understood that using UUIDs or other somewhat-random keys
perform worse than serial-like keys.

Btw., it might be nice to rerun your tests with PostgreSQL 12beta1.  The
btree storage has gotten some improvements.  I don't think it's going to
fundamentally solve your problem, but it would be useful feedback.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Incorrect index used in few cases..

2019-06-18 Thread Peter Geoghegan
On Tue, Jun 18, 2019 at 2:08 PM Tom Lane  wrote:
> Are you perhaps running with non-default values for any planner cost
> parameters?  Or it's not a stock build of Postgres?
>
> If you could find a way to adjust the attached example so that it
> produces the same misbehavior you see with live data, that would be
> very interesting ...

FWIW, if you move the CREATE INDEX statements before the INSERT, and
compared earlier versions of Postgres to 12, you'll see that the size
of some of the indexes are a lot smaller on 12.

v11 (representative of 9.6):

pg@tc:5411 [1067]=# \di+ i_*
   List of relations
 Schema │  Name   │ Type  │ Owner │ Table │ Size  │ Description
┼─┼───┼───┼───┼───┼─
 public │ i_agi_tc_tcn│ index │ pg│ tc│ 74 MB │
 public │ i_cid_agid_tcn  │ index │ pg│ tc│ 82 MB │
 public │ i_tc_adid_tid   │ index │ pg│ tc│ 57 MB │
 public │ i_tc_advertiser_id  │ index │ pg│ tc│ 27 MB │
 public │ i_tc_campaign_id│ index │ pg│ tc│ 28 MB │
 public │ i_tc_lud_agi│ index │ pg│ tc│ 57 MB │
 public │ i_tc_uniqueness_hash_v2 │ index │ pg│ tc│ 21 MB │
(7 rows)

v12/master:

pg@regression:5432 [1022]=# \di+ i_*
   List of relations
 Schema │  Name   │ Type  │ Owner │ Table │ Size  │ Description
┼─┼───┼───┼───┼───┼─
 public │ i_agi_tc_tcn│ index │ pg│ tc│ 69 MB │
 public │ i_cid_agid_tcn  │ index │ pg│ tc│ 78 MB │
 public │ i_tc_adid_tid   │ index │ pg│ tc│ 36 MB │
 public │ i_tc_advertiser_id  │ index │ pg│ tc│ 20 MB │
 public │ i_tc_campaign_id│ index │ pg│ tc│ 24 MB │
 public │ i_tc_lud_agi│ index │ pg│ tc│ 30 MB │
 public │ i_tc_uniqueness_hash_v2 │ index │ pg│ tc│ 21 MB │
(7 rows)

Note that i_tc_lud_agi is 30 MB, not 57MB, and that i_tc_adid_tid is
36 MB, not 57 MB.

I can see that both i_tc_lud_agi and i_tc_adid_tid consistently use
the "split after new tuple" optimization on v12.

-- 
Peter Geoghegan


Re: Perplexing, regular decline in performance

2019-06-26 Thread Peter Geoghegan
On Tue, Jun 25, 2019 at 8:49 AM Hugh Ranalli  wrote:
> What we continued to notice was a milder but still definite trend of 
> increased query times, during the course of each week, from the mid to high 
> 200 ms, to the high 300 ms to low 400 ms. Some years ago, someone had noticed 
> that as the number of "raw_page" columns in a particular table grew, 
> performance would decline. They wrote a script that once a week locks the 
> table, deletes the processed large columns (they are not needed after 
> processing), copies the remaining data to a backup table, truncates the 
> original table, then copies it back. When this script runs we see an 
> immediate change in performance, from 380 ms in the hour before the drop, to 
> 250 ms in the hour of the drop. As rows with these populated columns are 
> added during the course of a week, the performance drops, steadily, until the 
> next week's cleaning operation. Each week the performance increase is clear 
> and significant.

Can you show us the definition of the table, including its indexes?
Can you describe the data and distribution of values within the
columns, particularly where they're indexed?

-- 
Peter Geoghegan




Re: Perplexing, regular decline in performance

2019-06-26 Thread Peter Geoghegan
On Wed, Jun 26, 2019 at 12:02 PM Hugh Ranalli  wrote:
> I'm sorry, but I'm not sure what you mean by the "distribution of values 
> within the columns." Can you clarify or provide an link to an example?

I would mostly just like to see the schema of the table in question,
including indexes, and a high-level description of the nature of the
data in the table. Ideally, you would also include pg_stats.*
information for all columns in the table. That will actually let us
see a summary of the data. Though you should be careful about leaking
sensitive information that happens to be contained in the statistics,
such as the most common values.

-- 
Peter Geoghegan




Re: UUID v1 optimizations...

2019-07-07 Thread Peter Geoghegan
Please don't top post -- trim the your response down so that only
still-relevant text remains.

On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis
 wrote:
> Primary key indexes after an ANALYZE:
> table_name | bloat  | index_mb | table_mb
> ---++--+--
>  uuid_v1   | 767 MiB (49 %) | 1571.039 | 1689.195
>  uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195
>  uuid_seq  | 759 MiB (49 %) | 1562.766 | 1689.195
>  uuid_serial   | 700 MiB (47 %) | 1504.047 | 1689.195
>
> OK, sadly no reclaim in any of them.

I don't know how you got these figures, but most likely they don't
take into account the fact that the FSM for the index has free blocks
available. You'll only notice that if you have additional page splits
that can recycle that space. Or, you could use pg_freespacemap to get
some idea.

> 5.) REINDEX
> Table: uuid_v1  Time: 21549.860 ms (00:21.550)
> Table: uuid_v1_timestampTime: 27367.817 ms (00:27.368)
> Table: uuid_seq Time: 19142.711 ms (00:19.143)
> Table: uuid_serial  Time: 16889.807 ms (00:16.890)
>
> Even in this case it looks as if my implementation is faster than
> anything else - which I really don't get.

Sorting already-sorted data is faster. CREATE INDEX is mostly a big
sort operation in the case of B-Tree indexes.

> I might implement a different opclass for the standard UUID to enable
> time-wise index sort order. This will naturally be very close to
> physical order but I doubt that this is something I can tell PostgreSQL, or?

PostgreSQL only knows whether or not your page splits occur in the
rightmost page in the index -- it fills the page differently according
to whether or not that is the case.

-- 
Peter Geoghegan




Re: Huge shared hit for small table

2019-11-04 Thread Peter Geoghegan
On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby  wrote:
> I think it's because some heap pages are being visited many times, due to the
> index tuples being badly "fragmented".  Note, I'm not talking about
> fragmentation of index *pages*, which is what pgstattuple reports (which
> wouldn't have nearly so detrimental effect).  I could probably say that the
> index tuples are badly "correlated" with the heap.

But this is a unique index, and Scott indicates that the problem seems
to go away for a while following a REINDEX.

> In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
> affecting its progress reporting, fix to be included in v12.1).

PG v12 will store B-Tree duplicates in heap TID order, so if that's
the problem then upgrading to v12 (and REINDEXing if the upgrade was
performed using pg_upgrade) will fix it for good.

-- 
Peter Geoghegan




Re: Huge shared hit for small table

2019-11-04 Thread Peter Geoghegan
On Mon, Nov 4, 2019 at 12:32 PM Jeff Janes  wrote:
> Could there be a long-open transaction, which is preventing hint-bits from 
> getting on set on the table rows, as well on the index rows?

Contention on a small number of rows may also be a factor.

> A reindex would not by itself fix the problem if it were the long open 
> transaction.  But  if the long open transaction held a sufficient lock on the 
> table, then the reindex would block until the transaction went away on its 
> own, at which point the problem would go away on its own, so it might 
> **appear** to have fixed the problem.

That seems like the simplest and most likely explanation to me, even
though it isn't particularly simple.

-- 
Peter Geoghegan




Re: Huge shared hit for small table

2019-11-04 Thread Peter Geoghegan
On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin  wrote:
> Definitely no long-running transactions on this table; in fact, this table is 
> pretty infrequently updated – on the order of a few tens of rows updated per 
> day.

But a long running transaction will have an impact on all tables --
not just the tables that happen to have been accessed so far in the
long running transaction. This is necessary because nothing stops the
long running transaction from SELECTing data from any table at any
time -- we need to pessimistically keep around the data required to
make that work.

-- 
Peter Geoghegan




Re: Consecutive Query Executions with Increasing Execution Time

2019-12-16 Thread Peter Geoghegan
On Mon, Dec 16, 2019 at 9:28 AM Shijia Wei  wrote:
> 1st Query:

>  Buffers: shared hit=17074 read=16388

> 20th Query:

>  Buffers: shared hit=17037 read=16390

Why do the first and the twentieth executions of the query have almost
identical "buffers shared/read" numbers? That seems odd.

-- 
Peter Geoghegan




Re: Very slow Query compared to Oracle / SQL - Server

2021-05-07 Thread Peter Geoghegan
On Fri, May 7, 2021 at 9:16 AM Justin Pryzby  wrote:
> In pg13, indexes are de-duplicated by default.
>
> But I suspect the performance is better because data was reload, and the
> smaller indexes are a small, additional benefit.

That's a very reasonable interpretation, since the bitmap index scans
themselves just aren't doing that much I/O -- we see that there is
much more I/O for the heap scan, which is likely to be what the
general picture looks like no matter how much bloat there is.

However, I'm not sure if that reasonable interpretation is actually
correct. The nbtinsert.c code that handles deleting LP_DEAD index
tuples no longer relies on having a page-level garbage item flag set
in Postgres 13 -- it just scans the line pointer array for LP_DEAD
items each time. VACUUM has a rather unhelpful tendency to unset the
flag when it shouldn't, which we're no longer affected by. So that's
one possible explanation.

Another possible explanation is that smaller indexes (due to
deduplication) are more likely to get index scans, which leads to
setting the LP_DEAD bit of known-dead index tuples in passing more
often (bitmap index scans won't do the kill_prior_tuple optimization).
There could even be a virtuous circle over time. (Note that the index
deletion stuff in Postgres 14 pretty much makes sure that this
happens, but it is probably at least much more likely in Postgres 13
compared to 12.)

I could easily be very wrong about all of this in this instance,
though, because the behavior I've described is highly non-linear and
therefore highly unpredictable in general (not to mention highly
sensitive to workload characteristics). I'm sure that I've thought
about this stuff way more than any other individual Postgres
contributor, but it's easy to be wrong in any given instance. The real
explanation might be something else entirely. Though it's hard not to
imagine that what really matters here is avoiding all of that bitmap
heap scan I/O.

-- 
Peter Geoghegan




Re: Very slow Query compared to Oracle / SQL - Server

2021-05-07 Thread Peter Geoghegan
On Fri, May 7, 2021 at 2:28 PM Peter Geoghegan  wrote:
> That's a very reasonable interpretation, since the bitmap index scans
> themselves just aren't doing that much I/O -- we see that there is
> much more I/O for the heap scan, which is likely to be what the
> general picture looks like no matter how much bloat there is.
>
> However, I'm not sure if that reasonable interpretation is actually
> correct. The nbtinsert.c code that handles deleting LP_DEAD index
> tuples no longer relies on having a page-level garbage item flag set
> in Postgres 13 -- it just scans the line pointer array for LP_DEAD
> items each time.

BTW, I am pointing all of this out because I've heard informal reports
of big improvements following an upgrade to Postgres 13 that seem
unlikely to be related to the simple fact that indexes are smaller
(most of the time you cannot save that much I/O by shrinking indexes
without affected when and how TIDs/heap tuples are scanned).

It's necessary to simulate the production workload to have *any* idea
if LP_DEAD index tuple deletion might be a factor. If the OP is just
testing this one query on Postgres 13 in isolation, without anything
bloating up (or cleaning up) indexes, then that doesn't really tell us
anything about how Postgres 13 compares to Postgres 12. As you said,
simply shrinking the indexes is nice, but not enough -- we'd need some
second of second order effect to get acceptable performance over time
and under real world conditions.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-21 Thread Peter Geoghegan
On Wed, Jul 21, 2021 at 4:19 PM [email protected]
 wrote:
> As you can see, the V13.3 execution is about 10x slower.
>
> It may be hard for me to create a whole copy of the database on 11.12 and 
> check that environment by itself. I'd want to do it on the same machine to 
> control variables, and I don't have much extra disk space at the moment.

I imagine that this has something to do with the fact that the hash
aggregate spills to disk in Postgres 13.

You might try increasing hash_mem_multiplier from its default of 1.0,
to 2.0 or even 4.0. That way you'd be able to use 2x or 4x more memory
for executor nodes that are based on hashing (hash join and hash
aggregate), without also affecting other kinds of nodes, which are
typically much less sensitive to memory availability. This is very
similar to increasing work_mem, except that it is better targeted.

It might even make sense to *decrease* work_mem and increase
hash_mem_multiplier even further than 4.0. That approach is more
aggressive, though, so I wouldn't use it until it actually proved
necessary.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> That is ... weird.  Maybe you have found a bug in the spill-to-disk logic;
> it's quite new after all.  Can you extract a self-contained test case that
> behaves this way?

I wonder if this has something to do with the way that the input data
is clustered. I recall noticing that that could significantly alter
the behavior of HashAggs as of Postgres 13.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 9:21 AM Tom Lane  wrote:
> Yeah, I should have said "2GB plus palloc slop".  It doesn't surprise
> me a bit that we seem to be eating another 20% on top of the nominal
> limit.

MAX_KILOBYTES is the max_val for the work_mem GUC itself, and has been
for many years. The function get_hash_mem() returns a work_mem-style
int that callers refer to as hash_mem -- the convention is that
callers pretend that there is a work_mem style GUC (called hash_mem)
that they must access by calling get_hash_mem().

I don't see how it's possible for get_hash_mem() to be unable to
return a hash_mem value that could be represented by work_mem
directly. MAX_KILOBYTES is an annoyingly low limit on Windows, where
sizeof(long) is 4. But that's nothing new.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 9:42 AM Tom Lane  wrote:
> Right.  The point here is that before v13, hash aggregation was not
> subject to the work_mem limit, nor any related limit.  If you did an
> aggregation requiring more than 2GB-plus-slop, it would work just fine
> as long as your machine had enough RAM.  Now, the performance sucks and
> there is no knob you can turn to fix it.  That's unacceptable in my book.

Oh! That makes way more sense.

I suspect David's theory about hash_agg_set_limits()'s ngroup limit is
correct. It certainly seems like a good starting point.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 9:53 AM Peter Geoghegan  wrote:
> I suspect David's theory about hash_agg_set_limits()'s ngroup limit is
> correct. It certainly seems like a good starting point.

I also suspect that if Laurent set work_mem and/or hash_mem_multiplier
*extremely* aggressively, then eventually the hash agg would be
in-memory. And without actually using all that much memory.

I'm not suggesting that that is a sensible resolution to Laurent's
complaint. I'm just pointing out that it's probably not fundamentally
impossible to make the hash agg avoid spilling through tuning these
GUCs. At least I see no evidence of that right now.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 10:11 AM Tom Lane  wrote:
> No, he already tried, upthread.  The trouble is that he's on a Windows
> machine, so get_hash_mem is quasi-artificially constraining the product
> to 2GB.  And he needs it to be a bit more than that.  Whether the
> constraint is hitting at the ngroups stage or it's related to actual
> memory consumption isn't that relevant.

Somehow I missed that part.

> What I'm wondering about is whether it's worth putting in a solution
> for this issue in isolation, or whether we ought to embark on the
> long-ignored project of getting rid of use of "long" for any
> memory-size-related computations.  There would be no chance of
> back-patching something like the latter into v13, though.

+1. Even if we assume that Windows is a low priority platform, in the
long run it'll be easier to make it more like every other platform.

The use of "long" is inherently suspect to me. It signals that the
programmer wants something wider than "int", even if the standard
doesn't actually require that "long" be wider. This seems to
contradict what we know to be true for Postgres, which is that in
general it's unsafe to assume that long is int64. It's not just
work_mem related calculations. There is also code like logtape.c,
which uses long for block numbers -- that also exposes us to risk on
Windows.

By requiring int64 be used instead of long, we don't actually increase
risk for non-Windows platforms to any significant degree. I'm pretty
sure that "long" means int64 on non-Windows 64-bit platforms anyway.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 10:33 AM [email protected]
 wrote:
> Damn... I know Windows is a lower priority, and this is yet another issue, 
> but in Healthcare, Windows is so prevalent everywhere...

To be clear, I didn't actually say that. I said that it doesn't matter
either way, as far as addressing this long standing "int64 vs long"
issue goes.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 10:35 AM Tom Lane  wrote:
> Well, what we really ought to be using is size_t (a/k/a Size), at least
> for memory-space-related calculations.  I don't have an opinion right
> now about what logtape.c ought to use.  I do agree that avoiding "long"
> altogether would be a good ultimate goal.

I assume that we often use "long" in contexts where a signed integer
type is required. Maybe this is not true in the case of the work_mem
style calculations. But I know that it works that way in logtape.c,
where -1 is a sentinel value.

We already use int64 (not size_t) in tuplesort.c for roughly the same
reason: LACKMEM() needs to work with negative values, to handle
certain edge cases.

> In the short term though, the question is whether we want to regard this
> hashagg issue as something we need a fix for in v13/v14.  The fact that
> it's Windows-only makes it slightly less pressing in my mind, but it's
> still a regression that some people are going to hit.

True. I worry about the potential for introducing new bugs on Windows
by backpatching a fix for this. Technically this restriction existed
in every other work_mem consumer on Windows. Of course this won't
matter much to users like Laurent.

-- 
Peter Geoghegan




Re: Better, consistent instrumentation for postgreSQL using a similar API as Oracle

2021-10-04 Thread Peter Geoghegan
On Mon, Oct 4, 2021 at 6:51 PM Mladen Gogala  wrote:
> Haughty lectures about "Oracle has it" not being good enough could
> hardly be more out of place here. To put it as politely as is possible
> in this case, shut your pie hole. What Jeff is asking for is not
> something that "Oracle has", it's something that customers want. That
> was the case few years ago when I was asking for the optimizer hints. I
> was castigated by the former pastry baker turned Postgres guru and my
> reaction was simple: I threw Postgres out of the company that I was a
> working for as the lead DBA. You see, customer is always right, whether
> the database is open source or not. Needless to say, Postgres has
> optimizer hints these days. It still has them in "we do not want" part
> of the Wiki, which is hilarious.

In all sincerity: Chill out. I don't think that this is worth getting
into an argument over. I think that there is a good chance that you'd
have had a much better experience if the conversation had been in
person. Text has a way of losing a lot of important nuance.

I have personally met and enjoyed talking to quite a few people that
personally worked on Oracle, in various capacities -- the world of
database internals experts is not huge. I find Tanel Poder very
interesting, too -- never met the man, but we follow each other on
Twitter. Oracle is a system that has some interesting properties in
general (not just as a counterpoint to PostgreSQL), and I definitely
respect it. Same with SQL Server.

There are lots of smart and accomplished people in the world,
including Jeff. I think that it's worth understanding these
differences in perspective. There is likely to be merit in all of the
specific points made by both Laurenze and Jeff. They may not be
irreconcilable, or anything like it.

-- 
Peter Geoghegan




Re: Better, consistent instrumentation for postgreSQL using a similar API as Oracle

2021-10-04 Thread Peter Geoghegan
On Mon, Oct 4, 2021 at 9:04 PM Mladen Gogala  wrote:
> What angered me was the presumptuous tone of voice directed to an Oracle
> legend. I have probably talked to many more Oracle people than you,
> including Tanel, whom I have met personally. I am not on Twitter,
> unfortunately I am older than 20. Before you ask, I am not on Instagram,
> Facebook or Tiktok. I am not on OnlyFans either. I have never understood
> the need to share one's every thought in real time. Being rather private
> person has probably stymied my career of an internet influencer. I'll
> never rival Kim Kardashian.

You do seem shy.

> As for Jeff Holt, I believe that a person of his stature needs to be
> taken seriously and not lectured "how are things done in Postgres
> community".

I haven't met Jeff Holt either, but I believe that he is also older
than 20. I have to imagine that he doesn't particularly need you to
defend his honor.

> I  am rather confused by the thinly veiled hostility toward
> Oracle. In my opinion, Postgres community should be rather welcoming to
> Oracle people like Frits Hoogland, Frank Pachot or Jeff Holt. After all,
> we are using Postgres and telling us "you can't have what you used to
> get from Oracle" is not either encouraging or smart.

I agree with all that. I am also friendly with Frank, as it happens.

I think that Laurenze was just trying to establish common terms of
reference for discussion -- it's easy for two groups of people with
similar but different terminology to talk past each other. For
example, I think that there may be confusion about what is possible
with a tool like eBPF today, and what might be possible in an ideal
world.

-- 
Peter Geoghegan




Re: Better, consistent instrumentation for postgreSQL using a similar API as Oracle

2021-10-05 Thread Peter Geoghegan
On Fri, Oct 1, 2021 at 1:06 PM Jeff Holt  wrote:
> Now looking closely at postgreSQL, I see an opportunity to more quickly 
> implement Oracle's current feature list.
>
> I've come to this point because I see many roadblocks for users who want to 
> see a detailed "receipt" for their response time.

I have heard of method R. Offhand it seems roughly comparable to
something like the Top-down Microarchitecture Analysis Method that low
level systems programmers sometimes use, along with Intel's pmu-tools
-- at least at a very high level. The point seems to be to provide a
workflow that can plausibly zero in on low-level bottlenecks, by
providing high level context. Many tricky real world problems are in
some sense a high level problem that is disguised as a low level
problem. And so all of the pieces need to be present on the board, so
to speak.

Does that sound accurate?

One obvious issue with much of the Postgres instrumentation is that it
makes it hard to see how things change over time. I think that that is
often *way* more informative than static snapshots.

I can see why you'd emphasize the need for PostgreSQL to more or less
own the end to end experience for something like this. It doesn't
necessarily follow that the underlying implementation cannot make use
of infrastructure like eBPF, though. Fast user space probes provably
have no overhead, and can be compiled-in by distros that can support
it. There hasn't been a consistent effort to make that stuff
available, but I doubt that that tells us much about what is possible.
The probes that we have today are somewhat of a grab-bag, that aren't
particularly useful -- so it's a chicken-and-egg thing.

It would probably be helpful if you could describe what you feel is
missing in more general terms -- while perhaps giving specific
practical examples of specific scenarios that give us some sense of
what the strengths of the model are. ISTM that it's not so much a lack
of automation in PostgreSQL. It's more like a lack of a generalized
model, which includes automation, but also some high level top-down
theory.

-- 
Peter Geoghegan




Re: Lock contention high

2021-10-13 Thread Peter Geoghegan
On Tue, Oct 12, 2021 at 12:45 AM Ashkil Dighin  wrote:
> Lock contention observed high in PostgreSQLv13.3
> The source code compiled with GNC(GCCv11.x)
> PostgreSQL version: 13.3
> Operating system:   RHEL8.3
> Kernel name:4.18.0-305.10.2.el8_4.x86_64
> RAM Size:512GB
> SSD: 1TB
> The environment used IBM metal and test benchmark environment HammerDbv4.2
> Test case :TPC-C

You didn't say how many TPC-C warehouses you used. In my experience,
people sometimes run TPC-C with relatively few, which will tend to
result in extreme contention on certain B-Tree leaf pages. (My
experiences are with BenchmarkSQL, but I can't imagine HammerDB is too
much different.)

Assuming that's the case here, for you, then it's not clear that you
have a real problem. You're really not supposed to run the benchmark
in that way, per the TPC-C spec, which strictly limits the number of
transactions per minute per warehouse -- for better or worse, valid
results generally require that you use lots of warehouses to get a
very large database (think terabytes). If you run the benchmark with
100 warehouses or less, on a big server, then the contention you'll
see will be out of all proportion to what you're ever likely to see in
the real world.

-- 
Peter Geoghegan




Re: Lock contention high

2021-10-13 Thread Peter Geoghegan
On Wed, Oct 13, 2021 at 6:54 PM Jeremy Schneider
 wrote:
> only a half GB memory for autovac? (it will have a mandatory run as soon
> as you hit 200 mil XIDs, seems like you'd want the full max 1GB for it)

While anti-wraparound vacuums will become a problem for TPC-C (unless
you tune for it), it's not too sensitive to mwm. You just don't end up
accumulating too many TIDs to delete from indexes in practice, even
though the overhead from VACUUM is a concern. The new autovacuum
instrumentation in Postgres 14 makes this far clearer.

-- 
Peter Geoghegan




Advice needed: query performance deteriorates by 2000% within 1 minute

2022-02-24 Thread Peter Adlersburg
t,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Planning Time: 0.107 ms
 Execution Time: 0.258 ms


*
*** the first 'bad' run (one minute later ***
*

 LOG: | 2022-02-24 13:48:01.840362+01 |  9653

 LIMIT 10:

 Limit  (cost=0.00..804.97 rows=10 width=22) (actual
time=23970.845..25588.432 rows=1 loops=1)
   ->  Seq Scan on "order"  (cost=0.00..3863.86 rows=48 width=22) (actual
time=23970.843..25588.429 rows=1 loops=1)
 Filter: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Rows Removed by Filter: 9652
 Planning Time: 0.430 ms
 Execution Time: 25588.448 ms

 LIMIT 100:

 Limit  (cost=788.37..965.63 rows=48 width=22) (actual time=0.900..0.902
rows=1 loops=1)
   ->  Bitmap Heap Scan on "order"  (cost=788.37..965.63 rows=48 width=22)
(actual time=0.900..0.901 rows=1 loops=1)
 Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Heap Blocks: exact=1
 ->  Bitmap Index Scan on idx_fulltext_content  (cost=0.00..788.36
rows=48 width=0) (actual time=0.894..0.895 rows=1 loops=1)
   Index Cond: (jsonb_to_tsvector('english'::regconfig,
content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Planning Time: 0.187 ms
 Execution Time: 0.919 ms

 NO LIMIT:

 Bitmap Heap Scan on "order"  (cost=788.37..965.63 rows=48 width=22)
(actual time=0.442..0.442 rows=1 loops=1)
   Recheck Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_fulltext_content  (cost=0.00..788.36
rows=48 width=0) (actual time=0.438..0.438 rows=1 loops=1)
 Index Cond: (jsonb_to_tsvector('english'::regconfig, content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)
 Planning Time: 0.151 ms
 Execution Time: 0.453 ms




The table in question isn't that big:

  oid   | table_schema | table_name | row_estimate | total  | index  |
toast | table
+--++--+++---+---
 155544 | orderstore   | order  | 9649 | 210 MB | 108 MB | 91
MB | 10 MB



Table DDL:

CREATE TABLE orderstore."order" (
pk_id bigint DEFAULT nextval('orderstore.order_pk_id_seq'::regclass)
NOT NULL,
version integer NOT NULL,
content jsonb NOT NULL,
manipulation_history jsonb NOT NULL,
CONSTRAINT chk_external_id_not_null CHECK (((content ->>
'externalId'::text) IS NOT NULL)),
CONSTRAINT chk_id_not_null CHECK (((content ->> 'id'::text) IS NOT
NULL))
);

DDL of the index used (one amongst many others that exist):

--
-- Name: idx_fulltext_content; Type: INDEX; Schema: orderstore; Owner:
orderstore
--

CREATE INDEX idx_fulltext_content ON orderstore."order" USING gin
(jsonb_to_tsvector('english'::regconfig, content, '["all"]'::jsonb));


The record in pg_stat_all_tables before the manual vacuum:

relid   | 155544
schemaname  | orderstore
relname | order
seq_scan| 249
seq_tup_read| 2209150
idx_scan| 24696
idx_tup_fetch   | 1155483
n_tup_ins   | 87
n_tup_upd   | 1404
n_tup_del   | 0
n_tup_hot_upd   | 0
n_live_tup  | 9653
n_dead_tup  | 87
n_mod_since_analyze | 152
n_ins_since_vacuum  | 4
last_vacuum | 2022-02-24 10:44:34.524241+01
last_autovacuum |
last_analyze| 2022-02-24 03:20:05.79219+01
last_autoanalyze|
vacuum_count| 3
autovacuum_count| 0
analyze_count   | 8
autoanalyze_count   | 0

The entry in pg_stat_all_tables after the manual vacuum:

relid   | 155544
schemaname  | orderstore
relname | order
seq_scan| 249
seq_tup_read| 2209150
idx_scan| 24753
idx_tup_fetch   | 1155561
n_tup_ins   | 87
n_tup_upd   | 1404
n_tup_del   | 0
n_tup_hot_upd   | 0
n_live_tup  | 9476
n_dead_tup  | 0
n_mod_since_analyze | 152
n_ins_since_vacuum  | 0
last_vacuum | 2022-02-24 14:32:16.083692+01
last_autovacuum |
last_analyze| 2022-02-24 03:20:05.79219+01
last_autoanalyze|
vacuum_count| 4
autovacuum_count| 0
analyze_count   | 8
autoanalyze_count   | 0


Can someone provide any hints on how to deal with this issue? What am I
missing?

In case you need additional informations pls let me know.


kind regards,

peter


Re: Advice needed: query performance deteriorates by 2000% within 1 minute

2022-02-28 Thread Peter Adlersburg
Hello,

Michael, Tom: thanks for all the insights and informations in your previous
mails.

A quick update of the explain outputs (this time using explain (analyze,
buffers, verbose))

*The good: *

*LOG Time: | 2022-02-28 09:30:01.400777+01 | order rows: |  9668*



 Limit  (cost=616.37..653.30 rows=10 width=22) (actual time=1.062..1.063
rows=1 loops=1)

   Output: version, content

   Buffers: shared hit=154

   ->  Bitmap Heap Scan on orderstore."order"  (cost=616.37..793.63 rows=48
width=22) (actual time=1.061..1.062 rows=1 loops=1)

 Output: version, content

 Recheck Cond: (jsonb_to_tsvector('english'::regconfig,
"order".content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery)

 Heap Blocks: exact=1

 Buffers: shared hit=154

 ->  Bitmap Index Scan on idx_fulltext_content  (cost=0.00..616.36
rows=48 width=0) (actual time=1.053..1.053 rows=1 loops=1)

   Index Cond: (jsonb_to_tsvector('english'::regconfig,
"order".content, '["all"]'::jsonb) @@ '''1.20709841'''::tsquery)

   Buffers: shared hit=153

Planning:

   Buffers: shared hit=50

Planning Time: 0.408 ms
*Execution Time: 1.079 ms*

*pg_stat_all_tables: *


n_tup_ins   | 102

*n_tup_upd   | 1554*

n_tup_del   | 0

n_tup_hot_upd   | 0

n_live_tup  | 9668

*n_dead_tup  | 69*

n_mod_since_analyze | 61

n_ins_since_vacuum  | 8

last_vacuum | 2022-02-25 07:54:46.196508+01

last_autovacuum |

last_analyze| 2022-02-28 03:20:38.761482+01

last_autoanalyze|


*The bad: *


*LOG Time: | 2022-02-28 09:45:01.662702+01 | order rows: |  9668*



LIMIT 10:



Limit  (cost=0.00..805.63 rows=10 width=22) (actual
time=24175.964..25829.767 rows=1 loops=1)

   Output: version, content

   Buffers: shared hit=26284 read=12550 dirtied=4

   ->  Seq Scan on orderstore."order"  (cost=0.00..3867.01 rows=48
width=22) (actual time=24175.962..25829.763 rows=1 loops=1)

 Output: version, content

 Filter: (jsonb_to_tsvector('english'::regconfig, "order".content,
'["all"]'::jsonb) @@ '''1.20709841'''::tsquery)

 Rows Removed by Filter: 9667

 Buffers: shared hit=26284 read=12550 dirtied=4

Planning:

   Buffers: shared hit=50

Planning Time: 0.377 ms

*Execution Time: 25829.778 ms*

*pg_stat_all_tables:*

n_tup_ins   | 102

*n_tup_upd   | 1585*

n_tup_del   | 0

n_tup_hot_upd   | 0

n_live_tup  | 9668

*n_dead_tup  | 100*

n_mod_since_analyze | 92

n_ins_since_vacuum  | 8

last_vacuum | 2022-02-25 07:54:46.196508+01

last_autovacuum |

last_analyze| 2022-02-28 03:20:38.761482+01

last_autoanalyze|


*The ugly:*


It should be mentioned that the table in question mainly lives in toast
land (but I have no idea if this also influences the query planner):


oid   | table_schema | table_name | row_estimate | total_bytes |
index_bytes | toast_bytes | table_bytes | total  | index  | toast | table
+--++--+-+-+-+-+++---+---
 155544 | orderstore   | order  | 9570 |   229826560 |
120184832 |98557952 |11083776 | 219 MB | 115 MB | 94 MB | 11 MB


Since tinkering with the text search functions is out of the question we
came up with three possibilities on how to deal with this issue:

- significantly increase the limit clause or omit it at all (meh ...)
- use 'set random_page_cost = 0.5'  in the transaction in order to convince
the query planner to prefer the index (tested and works)
- schedule an hourly vacuum job for  the table (the most likely solution we
will settle on since it comes with the least implementation effort)

None of these seems very elegant or viable in the long run ... we'll see.

Ah, yes: our global settings for random_page_cost and autovacuum/analyze
are set to the defaults.

 Will json-processing experience some improvements in pg14/15? We are about
to update to 14 in the near future with our devs saying that this topic is
the main trigger to do so.

Any further thoughts on the case are very much appreciated.

kr p.


Am Do., 24. Feb. 2022 um 17:10 Uhr schrieb Tom Lane :

> Peter Adlersburg  writes:
> >  Limit  (cost=0.00..804.97 rows=10 width=22) (actual
> > time=23970.845..25588.432 rows=1 loops=1)
> >->  Seq Scan on "order"  (cost=0.00..3863.86 rows=48 width=22) (actual
> > time=23970.843..25588.429 rows=1 loops=1)
> >  Filter: (jsonb_to_tsvector('english'::regconfig, content,
> > '["all"]'::jsonb) @@ '''1.20709841'''::tsquery

Re: High System CPU Usage on Selects Seemingly Caused By Vacuum of Same Table

2025-01-31 Thread Peter Geoghegan
On Fri, Jan 31, 2025 at 5:28 PM Joshua Banton  wrote:
> The issue mostly manifests near the end of the "scanning heap" phase of 
> vacuuming of one of our largest tables, we'll call table1. RDS Performance 
> Insights reports that selects on table1 start to wait on cpu, where 
> previously it didn't even show up in the top 25 queries by wait. It doesn't 
> always happen, but if there is a larger than usual number of selects on 
> table1 it is more likely to happen.

Does this database also have many tables? As in thousands of tables?

I am reminded of this issue:

https://www.postgresql.org/message-id/flat/da3205c4-5b07-a65c-6c26-a293c6464fdb%40postgrespro.ru

I've heard of this happening when an aggressive VACUUM updates
relfrozenxid on a larger table.

-- 
Peter Geoghegan




Re: Any risk or overhead considerations for frequently executing queries against catalog tables?

2025-01-25 Thread peter plachta
DataDog — which implements such metrics for Postgres - has ran into multiple issues doing this type of thing. You may be able to search their bugs / repo to see what they were. I just can’t remember them off hand, it’s been a while.Sent from my iPhoneOn Jan 25, 2025, at 12:01 PM, Frits Hoogland  wrote:I am looking at whether sampling key database catalog information per second would have any drawback whatsoever.I think you're saying that you think isn't the case, except maybe for pg_database, and I figure that is because of the frozen and multi xact fields per database.If the database client application is too unpredictable to know what SQL it will produce, then having runtime data available at that granularity, so it can be reasonably constructed what is going on is very convenient and allows tremendous insight. It would also allow usage of the waitevents to spot any weird behavior, such as short-lived peaks. (pg_stat_statements can do that on a busy database, for example).And if there is no known drawback, if such a low interval can be organized: why not? I am not saying you are doing it wrong, this is about trying to figure out what are the borders of what would be technically possible without unreasonably affecting the database, a thought experiment.If course the gathered data needs to be organized so that you don't swamp in it, and it shouldn't lead to the monitoring data swamping the system, either in memory or on disk, but that is a given.Why would per second be too much for locks? Is there overhead to select from pg_locks, or pg_blocking_pids()?Again, please realise I am happy and appreciative of the time you take, I am toying with the above described idea.
Frits Hoogland


On 25 Jan 2025, at 19:18, Pavel Stehule  wrote:Hiso 25. 1. 2025 v 18:00 odesílatel Frits Hoogland  napsal:Thank you Pavel, that is really useful. I can imagine other people thinking about getting fine grained data from postgres might wonder the same as I do about this.And really from a computer's perspective I would say that once a second isn't really a high frequency?I usually work with minute sampling and usually it is good enough (statistics are cumulative, so you can lose the timestamp, but you never lose data.Only when we try to investigate some special case, then I use second sampling. When you investigate lock issues, then seconds are too muchRegardsPavel If I time the amount of time that these queries take, it's around 20ms (local connection), so there is a relative long time of all the objects including pg_database are not actively queried.I git grepped the sourcecode, it seems that there is a rowexclusive lock for pg_database manipulation in case of addition, removal and change of a database in dbcommands.c, but I do think your reasoning is based on the columns datfrozenxid and datminmxid?There is a lock for updating the frozenxid and mxid for a database in (vacuum.c:LockDatabaseFrozenIds, ExclusiveLock), but it seems a select should play nice with that?btw, it's interesting to see that both datfrozenxid and datminmxid are in place updated, with no read consistency provided.
Frits Hoogland


On 25 Jan 2025, at 14:32, Pavel Stehule  wrote:Hiso 25. 1. 2025 v 12:23 odesílatel Frits Hoogland  napsal:For monitoring database behaviour and trying to build an history of activity, if I would create an application that creates a single connection and execute something like:select * from pg_stat_activity;select * from pg_stat_database;select * from pg_stat_bgwriter;select * from pg_stat_wal;select * from pg_settings;select * from pg_database;For which the query is prepared, and execute that every 1 second, would there be any realistic danger or overhead that should be considered?My thinking is that the data for these catalogs are all in shared memory and when executed serially and do not cause any significant resources to be taken?The queries to all tables excluding pg_database every 1 sec will have probably zero impact to performance.I am not sure about pg_database - it is a very important table, and your query can block operations that need exclusive lock to this table. So theoretically, there can be some impact to performance.RegardsPavel Thanks,
Frits Hoogland







Re: Read-only connectios optimizatios

2025-01-25 Thread peter plachta
You can still block vacuum from running if you have long running (or very 
aggressive) read transactions. I don’t think they are very helpful or 
performant from a Postgres engine perspective.
They can be helpful in application development because they will fail if devs 
attempt any mutations inside read only (from what I recall).

Sent from my iPhone

> On Jan 25, 2025, at 10:01 AM, Laurenz Albe  wrote:
> 
> On Sat, 2025-01-25 at 14:55 +, Edson Richter wrote:
>> -Connections are established using the jdbc "readonly" attribute.
>> 
>> Does PostgreSQL perform any optimization on queries in this scenario to avoid
>> establishing locks? Or are these queries treated like any other?
> 
> The only difference that I am aware of is that read-only transactions at the
> SERIALIZABLE isolation level can release predicate locks earlier, which can
> benefit performance.
> 
> But I don't think that you need to worry: reading transactions only take an
> ACCESS SHARE lock on tables, which won't conflict with data modifications.
> 
> Yours,
> Laurenz Albe
> 
> 




Re: Efficient pagination using multi-column cursors

2025-02-26 Thread Peter Geoghegan
On Wed, Feb 26, 2025 at 9:29 AM  wrote:
> Without being familiar the internals of the query planner, I *think* there 
> *should* be a way to come up with WHERE conditions that results in the 
> "perfect" plan.

There is a fundamental trade-off involved here. The simple, fast
"WHERE (col_1, col_2, col_3) > (10, 20, 29)" query returns whatever
tuples are stored immediately after "(10, 20, 29)" in the index.
Naturally, they're returned in index order, which is usually the most
useful order (simple ASC order or simple DESC order for all columns).

The B-Tree code can physically traverse your mixed-ASC-and-DESC order
index in almost the same way. But it is much less useful, since the
matching index tuples won't be physically located together as exactly
one contiguous group of tuples. And so (with your "handwritten" row
comparison) you get a filter qual that filters out non-matching tuples
using lower-order index columns. The index scan actually just returns
"Index Cond: (col_1 >= 10)" (which still returns a contiguous group of
index tuples), while a filter condition excludes those tuples returned
by the index scan node that don't satisfy the later/lower-order column
condition.

The book "Relational Database Index Design and the Optimizers"
proposes a vocabulary for the trade-offs in this area -- the 3 star
model. When creating the best possible index for certain queries it is
sometimes inherently necessary to choose between what it calls the
first star (which means avoiding a sort) and the second star (which
means having the thinnest possible "row slice"). Sometimes those
things are in tension, which makes sense when you imagine how the
index must be physically traversed.

-- 
Peter Geoghegan




Re: Efficient pagination using multi-column cursors

2025-02-26 Thread Peter Geoghegan
On Wed, Feb 26, 2025 at 10:40 AM  wrote:
> My understanding is that given this "mixed order" index:
> CREATE INDEX data_index_desc ON data (col_1, col_2 DESC, col_3);
>
> The index tuples are physically organized exactly in this way:
> ORDER BY col_1, col_2 DESC, col_3
>
> So that I should be able to write a query that reads a continuous range from 
> this index without filtering.

Yes, you can. For example, if you use the predicate "col_1 >= 10", it
can work in this way, even with a mixed-asc-and-desc order multicolumn
index -- without any filtering. Same thing if you don't have any
predicate at all -- there's no lower-order columns to filter on
because there's no columns to filter on at all.

The actual predicate that you're interested in isn't like that. It
cannot use an index that both returns rows in the mixed-ASC-and-DESC
order that you want (and so terminate early with a LIMIT and whatnot),
while at the same time accessing all the tuples as exactly one
contiguous group. You have to pick which is more important. It sounds
very much like having the most useful sort order is important.

> Does this mean that it is not possible to come up with a plan that has the 
> same performance as "WHERE (col_1, col_2, col_3) > (10, 20, 29)" using 
> "handwritten" filters, or only for "mixed order"? Or not a theoretical 
> limitation but a limitation of the current implementation of the query 
> planner?

Perhaps the query planner should be taught to rewrite the query in
such a way as to make it unnecessary for you to do so -- I think that
that's what MySQL is doing for you. That is beside the point.

Again, think about how things are physically laid out in an index
which mixes ASC and DESC order. It is inevitable that the scan has to
traverse over non-matching tuples in order to read all of the matching
tuples (or to read a given number of matching tuples). This has
nothing to do with the query planner.

> Aka. "Good, Fast, Cheap — Pick Any Two" ;)

It's not like that. Often it just isn't necessary to pick any 2 -- you
can have all 3, because the requirements of the query allow it. (Plus
it would never make sense to pick the first and second stars over the
third.)

-- 
Peter Geoghegan




Re: Poor performance with row wise comparisons

2025-10-17 Thread Peter Geoghegan
On Fri, Feb 7, 2025 at 2:05 AM Jon Emord  wrote:
> My expectation is that the following two queries would have roughly the same 
> performance.
> They both use the same index only scans and return the same 100 rows of data.
> The main difference I see in the explain output is that the row wise 
> comparison has 3,000 times
> the shared buffer hits, but it is unclear why there would need to be more 
> hits.

I agree that this doesn't make sense.

The problem here is that the row compare condition that terminates the
scan (namely "(ROW(data_model_id, primary_key) <= ROW(123,
'DEF'::text))") was unable to recognize that we've reached the end of
all matching tuples upon reaching the first tuple that's > "(123,
'DEF')". The scan would only terminate upon reaching the first tuple
whose data_model_id was > 123. Which (in this particular case) meant
that the scan read far more index leaf pages than necessary. Note that
this wouldn't have mattered very much if there weren't so many
irrelevant tuples that were "data_model_id = 123 AND > '(123, 'DEF')'"
-- but there were.

I fixed this problem in passing, in the context of a bug fix that went
into Postgres 18 (see commit bd3f59fd, in particular the part about
marking lower-order subkeys as required to continue the scan,
described towards the end of the commit message). You should look into
upgrading to Postgres 18 if this issue is important to you.

-- 
Peter Geoghegan




Re: Index Searches higher than expected for skip scan

2025-11-06 Thread Peter Geoghegan
On Thu, Nov 6, 2025 at 2:01 PM Michael Christofides
 wrote:
> I'm trying to understand the new Index Searches field in Postgres 18 explain 
> analyze output. I've put together a super simple test case (below) expecting 
> a skip scan with 2 Index Searches, one for each value in the leading 
> (boolean) column of the index.

That's a good question. You're right that in this specific case, with
a boolean column, skip scan performs 4 index searches, when in
principle it only really needs to do 3.

> In reality, instead of 2 Index Searches, I got 4 (query plan below).

During work on skip scan (and on the 17 work), I went to quite a lot
of effort to build custom instrumentation that would show me exactly
what an index scan was doing. Including details of the scan's array
keys, and how they advance as the scan advances through the index.

Attached is its output when I run your test query. The issue here is
that skip scan thinks that there are 4 distinct skip array values that
it must use:

1. SK_BT_MINVAL
2. false
3. true
4. SK_ISNULL

SK_BT_MINVAL is a sentinel value that represents the lowest possible
value that can be stored by the data type. That's the same thing as
"false", which is a little bit unfortunate with a datatype like bool,
where the difference might actually matter -- here we waste an access
to the leftmost leaf page to "advance" the skip array from
SK_BT_MINVAL to false, instead of making false the lowest skip array
value directly, from the very start.

This is highly unlikely to matter with a data type like integer,
though: in practice it's very unlikely that the value INT_MIN is
actually stored in the index, so having 2 distinct representations for
the same thing (SK_BT_MINVAL and INT_MIN) is equally unlikely to
result in the scan reading any more leaf pages than strictly necessary
due to this implementation deficiency. We'll have to go to the
leftmost leaf page to determine what the real lowest value in the
index is either way -- doesn't matter if you start from SK_BT_MINVAL
or from INT_MIN (unless there really are hundreds of tuples that have
the value INT_MIN, when the difference between those 2 things starts
to matter, as with your bool test case).

I did actually think about making this work. In fact, there were
revisions of the skip scan patch where it actually did work.
Ultimately I judged that it wasn't worth the trouble of introducing
this special case. Better to have types with skip support (like
boolean and integer) behave exactly the same as all other types, by
also using these SK_BT_MINVAL/SK_BT_MAXVAL sentinels (we don't use
SK_BT_MAXVAL at all here because the highest skip array value is
SK_ISNULL, but we would if the index was declared NULLS FIRST).

That just leaves SK_ISNULL. We cannot assume that the index doesn't
have any NULLs (unless the query uses IS NOT NULL directly). We might
end up having to read the rightmost leaf page anyway, in which case
there won't be an extra search for SK_ISNULL, but we'll likely need an
extra search for this too (just like the leftmost leaf page, with
SK_BT_MINVAL). This isn't an implementation deficiency -- it's
strictly necessary for correctness.

-- 
Peter Geoghegan
pg@regression:5432 [1116003]=# SELECT boolean_field FROM example WHERE 
integer_field = 5432;
LOG:
👾  btbeginscan to begin scan of index "bool_int_idx" in worker -1
♻️  btrescan
btrescan: BTScanPosInvalidate() called for markPos
_bt_preprocess_keys:  scan->keyData[0]: [ strategy: = , attno: 
2/"integer_field", func: int84eq, flags: [] ]
_bt_preprocess_keys:so->keyData[0]: [ strategy: = , attno: 
1/"boolean_field", func: booleq, flags: [SK_SEARCHARRAY, SK_BT_REQFWD, 
SK_BT_REQBKWD, SK_BT_SKIP] ]
so->keyData[1]: [ strategy: = , attno: 
2/"integer_field", func: int84eq, flags: [SK_BT_REQFWD, SK_BT_REQBKWD] ]
_bt_preprocess_keys: scan->numberOfKeys is 1, so->numberOfKeys on output is 2, 
so->numArrayKeys on output is 1

➕ ➕ ➕
_bt_first: sk could not be formed, so descending to leftmost leaf page in whole 
index
_bt_readpage: 🍀  1 with 122 offsets/tuples (leftsib 0, rightsib 2) ➡️
 _bt_readpage first: (boolean_field, integer_field)=(f, 0), TID='(171,68)', 
0x7fffaeaa9fc8, from non-pivot offnum 2 started page
 _bt_readpage pstate.startikey: 0, with 2 scan keys
  _bt_checkkeys: comparing (boolean_field, integer_field)=(f, 0) with TID 
(171,68), 0x7fffaeaa9fc8
_bt_advance_array_keys, sktrig (required): 0, tuple: (boolean_field, 
integer_field)=(f, 0), 0x7fffaeaa9fc8

  - sk: 0, sk_attno: 1, cur_elem:   -1, num_elems:   -1, val: ? 
SK_BT_MINVAL<--

  + sk: 0, sk_attno: 1, cur_elem:   -1, num_elems:   -1, val: f

 _bt_readpage final: (boolean_field, integer_field)=(f, 0), TID='(171,68)', 
0x7fffaeaa9fc8, from non-pivot offnum 2 set so-&

Re: Index Searches higher than expected for skip scan

2025-11-06 Thread Peter Geoghegan
On Thu, Nov 6, 2025 at 2:54 PM Peter Geoghegan  wrote:
> That just leaves SK_ISNULL. We cannot assume that the index doesn't
> have any NULLs (unless the query uses IS NOT NULL directly).

Actually, that won't work here. Because the optimizer recognizes that
the leading boolean column isn't nullable, and "helpfully"
optimizes-away the IS NOT NULL qualification. But if the column *was*
nullable, adding IS NOT NULL would cut the number of index searches by
1.

> We might end up having to read the rightmost leaf page anyway, in which case
> there won't be an extra search for SK_ISNULL, but we'll likely need an
> extra search for this too (just like the leftmost leaf page, with
> SK_BT_MINVAL).

I said the wrong thing here. In fact, it's very likely that we will
have to read the rightmost leaf page with a query that has no qual on
the leading column, no matter the datatype. Here's why:

Suppose we only store tuples with the values 1 - 100 in the leading
column "a", for a query "WHERE b = 5432", with an index on (a, b).
Once skip scan reaches the point where it returns "(a, b) = (100,
5432)" to the scan (or once it sees that there's no such match in the
index), it'll guess that the next "a" value is 101. We'll then try and
fail to find a match "(a, b) = (101, 5432)".

If the column "a" has no NULLs, then the scan will already be at the
rightmost position of its rightmost leaf page -- so we're done then
and there. Notice that there hasn't been an extra search for SK_ISNULL
here. Because we visited the leaf page where NULLs would have been,
had there been any, and found that there was none at all (which is the
common case, contrary to what I said earlier).

If, on the other hand, the column "a" has many NULLs, then this failed
attempt to find "(a, b) = (101, 5432)" will advance the array on "a"
from 101 to SK_ISNULL, and then perform another search, this time for
"(a, b) = (NULL, 5432)". Again, notice that this also isn't wasteful
-- we simply have no better way to reliably determine that there is no
non-NULL value after 100 in this index.

It's perhaps worth noting that your original boolean example shows
that skip scan *is* directly aware that the next value after false is
SK_ISNULL -- even though (as I said in my first email from earlier) it
cannot do the similar trick of knowing that the lowest value really
should be false (this is knowable before the first scan/search even
begins). There are very obscure performance reasons for this
inconsistency, though it might seem a bit arbitrary.

-- 
Peter Geoghegan




Re: Multicolumn index scan efficiency

2025-11-09 Thread Peter Geoghegan
On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn  wrote:
> I am wondering about 2 things:
> 1) Does anyone know which specific change / version made it fast?
> 2) What was the proper way to do a range index scan like WHERE (a,b,c) 
> between (x1,y1,z1) and (x2,y2,z2) before the improvement.
> Note that my tests can mostly be rewritten as equality at least for some 
> columns (and this is what we'll do), but sometimes we do need a range scan 
> like above, so understanding it would be important. Also I am curious :).

This improvement you're seeing here is down to work in commit
bd3f59fd. The short version is that the way we used to decide when a
condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
conservative. If there were many "a" values equal to x2, we'd have to
scan the index until we got to the next distinct/non-equal "a" value
-- without realizing that we're already past the point where there
cannot possibly be any more matches.

See the discussion on this thread which complained about the problem,
particularly my response to the complaint:

https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69

--
Peter Geoghegan




Re: Multicolumn index scan efficiency

2025-11-10 Thread Peter Geoghegan
On Mon, Nov 10, 2025 at 12:12 AM Vitalii Tymchyshyn  wrote:
> Thank you so much for both clarifying and fixing it!

FWIW the problem is limited to row compares/row constructor
comparisons that are used to decide when to end the scan. Note in
particular that row compares that decide where in the index (what leaf
page) the scan should *begin* from were never affected -- only those
that determine where the scan should end. In other words, for a
forwards scan, > and >= row compares aren't affected (but < and <= row
compares are). For backwards scans/with ORDER BY a DESC, b DESC, it's
exactly the other way around (it's > and >= row compares that'll end
the scan/that had this problem).

My guess is that this issue wasn't noticed sooner because in practice
a lot of users of row compares only use them to determine where each
scan begins from, in the context of apply row compares to implement
keyset pagination [1]. I think that it's typical to use an ORDER BY
... LIMIT, or a FETCH FIRST ... ROWS WITH TIES to limit the size of
the result set on each individual query. It was a nasty and surprising
issue, but it didn't actually come up all that often.

After all, if you use a < or a <= condition to end each scan, the
total number of rows that'll be returned each time is unpredictable --
and potentially very large. That isn't generally desirable with keyset
pagination; what users usually do is have Postgres return a more or
less uniform number of rows for each individual query that fetches the
next portion of the "total result set". That's kinda the natural way
to do it.

[1] 
https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
-- 
Peter Geoghegan




Re: Index Searches higher than expected for skip scan

2025-11-07 Thread Peter Geoghegan
On Fri, Nov 7, 2025 at 6:16 AM Michael Christofides
 wrote:
> Thank you for the incredibly helpful (and fast) replies Peter.

You're welcome.

> Nice idea. Once it sunk in, I realised I could try the explicit "AND 
> boolean_field IN (true, false)" and got it down to 2 index searches:
>
> EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)
> SELECT boolean_field FROM example WHERE integer_field = 5432 AND 
> boolean_field IN (true, false);

That's using the Postgres 17 work. You could also write the query as
"SELECT boolean_field FROM example WHERE integer_field = 5432 AND
boolean_field BETWEEN false AND true" and get 2 index searches. That
variant uses what I've called "range skip scan", which is new in
Postgres 18.

-- 
Peter Geoghegan




LISTEN NOTIFY sometimes huge delay

2022-04-28 Thread Peter Eser HEUFT [Germany]

Hi all,

I have a table with time series data and on this table a trigger for 
notifies:


containers_notify AFTER INSERT ON containers FOR EACH ROW EXECUTE 
PROCEDURE containers_notify('containers_notify_collector')


and the function does:

PERFORM pg_notify(CAST(TG_ARGV[0] AS text), row_to_json(NEW)::text);

so that another application (java) fetches every inserted row as a JSON 
for further processing every half second:


...listenStatement.execute("LISTEN 'containers_notify_collector'");
...PGNotification notifications[] = 
((org.postgresql.PGConnection)notifyPGCon.getUnderlyingConnection()).getNotifications(); 



This works as a charm but occasionally (I think with more load on the 
system) the notifications are received much time (up to hours!) after 
the INSERTs.
Nevertheless no notifications become lost, they are only very late! The 
delay grows, seems as a queue grows, but the java process tries to fetch 
the notifications fairly fast,

so there should be no queue growing..

Versions:
PostgreSQL 10.12 on x86_64-pc-linux-gnu, compiled by 
x86_64-pc-linux-gnu-gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit

JDBC 42.2.23

The commit of the application inserting the data is ok/fast. So the 
insert of the data is not slowed down.

Are the notifications delivered asynchronously to the commit/trigger?

Thanks for any help,

Peter