Huge shared hit for small table

2019-11-04 Thread Scott Rankin
Hello all,

We are trying to debug some slow performance in our production environment 
(Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node 
that seems… weird.  This is a very large query involving a number of joins, but 
it performs pretty well in our staging environment (which has roughly the same 
data set as production, with a few tweaks).  However, there is one node in the 
EXPLAIN plan that is wildly different:

In the staging environment, we get this:

Index Scan using "programPK" on public.program prog  (cost=0.29..0.35 rows=1 
width=16) (actual time=0.002..0.003 rows=1 loops=21965)
  Output: prog.id, prog.version, prog.active, prog.created_date, 
prog.last_modified_date, prog.created_by, prog.last_modified_by, 
prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, 
prog.custom_fields, prog.setup_complete, prog.setup_messages, 
prog.legacy_program_type
  Index Cond: (prog.id = per.program)
  Buffers: shared hit=87860

In the production environment, we get this:

Index Scan using "programPK" on public.program prog  (cost=0.29..0.36 rows=1 
width=16) (actual time=0.017..4.251 rows=1 loops=21956)
   Output: prog.id, prog.version, prog.active, prog.created_date, 
prog.last_modified_date, prog.created_by, prog.last_modified_by, 
prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, 
prog.custom_fields, prog.setup_complete, prog.setup_messages, 
prog.legacy_program_type
   Index Cond: (prog.id = per.program)
   Buffers: shared hit=25437716

The tables in both environments are about the same size (18MB) and the indexes 
are about the same size (360kb/410kb) – and the shared hits are pretty much the 
same on the other nodes of the query between the two environments.

Lastly, if we take out the join to the “program” table, the query performs much 
faster in production and the timing between staging and production is similar.

This has happened one time before, and we did a “REINDEX” on the program table 
– and that made the problem mostly go away.  Now it seems to be back, and I’m 
not sure what to make of it.

Thanks in advance for any help you can offer!

Scott

SCOTT RANKIN
VP, Technology
Motus, LLC
Two Financial Center, 60 South Street, Boston, MA 02111
617.467.1900 (O) | [email protected]

Follow us on LinkedIn | Visit us 
at motus.com


This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written 
advice include a disclaimer. To the extent the preceding message contains 
advice relating to a Federal tax issue, unless expressly stated otherwise the 
advice is not intended or written to be used, and it cannot be used by the 
recipient or any other taxpayer, for the purpose of avoiding Federal tax 
penalties, and was not written to support the promotion or marketing of any 
transaction or matter discussed herein.


Re: Huge shared hit for small table

2019-11-04 Thread Andres Freund
Hi,

On 2019-11-04 19:38:40 +, Scott Rankin wrote:
> In the staging environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.35 rows=1 
> width=16) (actual time=0.002..0.003 rows=1 loops=21965)
>   Output: prog.id, prog.version, prog.active, prog.created_date, 
> prog.last_modified_date, prog.created_by, prog.last_modified_by, 
> prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, 
> prog.custom_fields, prog.setup_complete, prog.setup_messages, 
> prog.legacy_program_type
>   Index Cond: (prog.id = per.program)
>   Buffers: shared hit=87860
> 
> In the production environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.36 rows=1 
> width=16) (actual time=0.017..4.251 rows=1 loops=21956)
>Output: prog.id, prog.version, prog.active, prog.created_date, 
> prog.last_modified_date, prog.created_by, prog.last_modified_by, 
> prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, 
> prog.custom_fields, prog.setup_complete, prog.setup_messages, 
> prog.legacy_program_type
>Index Cond: (prog.id = per.program)
>Buffers: shared hit=25437716
> 
> The tables in both environments are about the same size (18MB) and the 
> indexes are about the same size (360kb/410kb) – and the shared hits are 
> pretty much the same on the other nodes of the query between the two 
> environments.

It'd be worthwhile to look at the index stats using pgstatindex. Also,
could you show the definition of those indexes?


> This email message contains information that Motus, LLC considers 
> confidential and/or proprietary, or may later designate as confidential and 
> proprietary. It is intended only for use of the individual or entity named 
> above and should not be forwarded to any other persons or entities without 
> the express consent of Motus, LLC, nor should it be used for any purpose 
> other than in the course of any potential or actual business relationship 
> with Motus, LLC. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible to deliver it to the intended recipient, 
> you are hereby notified that any dissemination, distribution, or copying of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please notify sender immediately and destroy the 
> original message.
> 
> Internal Revenue Service regulations require that certain types of written 
> advice include a disclaimer. To the extent the preceding message contains 
> advice relating to a Federal tax issue, unless expressly stated otherwise the 
> advice is not intended or written to be used, and it cannot be used by the 
> recipient or any other taxpayer, for the purpose of avoiding Federal tax 
> penalties, and was not written to support the promotion or marketing of any 
> transaction or matter discussed herein.

GNGNGGRR.

Greetings,

Andres Freund




Re: Huge shared hit for small table

2019-11-04 Thread Scott Rankin
The index is exceedingly simple:


CREATE UNIQUE INDEX "programPK" ON program(id int8_ops);

From pg_stat_user_indexes:

Staging:

idx_scan: 5826745
idx_tup_read: 52715470
idx_tup_fetch: 52644465

Production:

idx_scan : 7277919087
idx_tup_read: 90612605047
idx_tup_fetch: 5207807880

From: Andres Freund 
Date: Monday, November 4, 2019 at 2:46 PM
To: Scott Rankin 
Cc: "[email protected]" 

Subject: Re: Huge shared hit for small table

Hi,

On 2019-11-04 19:38:40 +, Scott Rankin wrote:
> In the staging environment, we get this:
>
> Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 
> width=16) (actual time=0.002..0.003 rows=1 loops=21965)
> Output: prog.id, prog.version, 
> prog.active, prog.created_date, prog.last_modified_date, 
> prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, 
> prog.name, prog.legacy_group_id, prog.custom_fields, 
> prog.setup_complete, prog.setup_messages, prog.legacy_program_type
> Index Cond: (prog.id = per.program)
> Buffers: shared hit=87860
>
> In the production environment, we get this:
>
> Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 
> width=16) (actual time=0.017..4.251 rows=1 loops=21956)
> Output: prog.id, prog.version, 
> prog.active, prog.created_date, prog.last_modified_date, 
> prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, 
> prog.name, prog.legacy_group_id, prog.custom_fields, 
> prog.setup_complete, prog.setup_messages, prog.legacy_program_type
> Index Cond: (prog.id = per.program)
> Buffers: shared hit=25437716
>
> The tables in both environments are about the same size (18MB) and the 
> indexes are about the same size (360kb/410kb) – and the shared hits are 
> pretty much the same on the other nodes of the query between the two 
> environments.

It'd be worthwhile to look at the index stats using pgstatindex. Also,
could you show the definition of those indexes?


> This email message contains information that Motus, LLC considers 
> confidential and/or proprietary, or may later designate as confidential and 
> proprietary. It is intended only for use of the individual or entity named 
> above and should not be forwarded to any other persons or entities without 
> the express consent of Motus, LLC, nor should it be used for any purpose 
> other than in the course of any potential or actual business relationship 
> with Motus, LLC. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible to deliver it to the intended recipient, 
> you are hereby notified that any dissemination, distribution, or copying of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please notify sender immediately and destroy the 
> original message.
>
> Internal Revenue Service regulations require that certain types of written 
> advice include a disclaimer. To the extent the preceding message contains 
> advice relating to a Federal tax issue, unless expressly stated otherwise the 
> advice is not intended or written to be used, and it cannot be used by the 
> recipient or any other taxpayer, for the purpose of avoiding Federal tax 
> penalties, and was not written to support the promotion or marketing of any 
> transaction or matter discussed herein.

GNGNGGRR.

Greetings,

Andres Freund



This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written 
advice include a disclaimer. To the extent the preceding message contains 
advice relating to a Federal tax issue, unless expressly stated otherwise the 
advice is not intended or written to be used, and it cannot be used by the 
recipient or any other taxpayer, for the purpose of avoiding Federal tax 
penalties, and was not written to support the promotion or marketing of any 
transaction or matter discussed herein.


Re: Huge shared hit for small table

2019-11-04 Thread Justin Pryzby
On Mon, Nov 04, 2019 at 07:38:40PM +, Scott Rankin wrote:
> In the staging environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.35 rows=1 
> width=16) (actual time=0.002..0.003 rows=1 loops=21965)
>   Output: prog.id, prog.version, prog.active, prog.created_date, 
> prog.last_modified_date, prog.created_by, prog.last_modified_by, 
> prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, 
> prog.custom_fields, prog.setup_complete, prog.setup_messages, 
> prog.legacy_program_type
>   Index Cond: (prog.id = per.program)
>   Buffers: shared hit=87860
> 
> In the production environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.36 rows=1 
> width=16) (actual time=0.017..4.251 rows=1 loops=21956)
>Output: prog.id, prog.version, prog.active, prog.created_date, 
> prog.last_modified_date, prog.created_by, prog.last_modified_by, 
> prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, 
> prog.custom_fields, prog.setup_complete, prog.setup_messages, 
> prog.legacy_program_type
>Index Cond: (prog.id = per.program)
>Buffers: shared hit=25437716
> 
> The tables in both environments are about the same size (18MB) and the 
> indexes are about the same size (360kb/410kb) – and the shared hits are 
> pretty much the same on the other nodes of the query between the two 
> environments.

I think it's because some heap pages are being visited many times, due to the
index tuples being badly "fragmented".  Note, I'm not talking about
fragmentation of index *pages*, which is what pgstattuple reports (which
wouldn't have nearly so detrimental effect).  I could probably say that the
index tuples are badly "correlated" with the heap.

I'm guessing there are perhaps 25437716/87860 = 290 index tuples per page, and
they rarely point to same heap page as their siblings.  "Hit" means that this
affects you even though it's cached (by postgres).  So this is apparently slow
due to reading each page ~300 times rather than once to get its tuples all at
once.

> This has happened one time before, and we did a “REINDEX” on the program 
> table – and that made the problem mostly go away.  Now it seems to be back, 
> and I’m not sure what to make of it.

..which is consistent with my hypothesis.

You can use pg_repack or CREATE INDEX+DROP+RENAME hack (which is what pg_repack
-i does).  In a fresh index, its tuples are sorted by heap TID.  You could
CLUSTER the table itself (or pg_repack -t) on that index column.

In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
affecting its progress reporting, fix to be included in v12.1).

Justin




Re: Huge shared hit for small table

2019-11-04 Thread Peter Geoghegan
On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby  wrote:
> I think it's because some heap pages are being visited many times, due to the
> index tuples being badly "fragmented".  Note, I'm not talking about
> fragmentation of index *pages*, which is what pgstattuple reports (which
> wouldn't have nearly so detrimental effect).  I could probably say that the
> index tuples are badly "correlated" with the heap.

But this is a unique index, and Scott indicates that the problem seems
to go away for a while following a REINDEX.

> In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
> affecting its progress reporting, fix to be included in v12.1).

PG v12 will store B-Tree duplicates in heap TID order, so if that's
the problem then upgrading to v12 (and REINDEXing if the upgrade was
performed using pg_upgrade) will fix it for good.

-- 
Peter Geoghegan




Re: Huge shared hit for small table

2019-11-04 Thread Scott Rankin
Thanks to Justin for the clarification around pgstatindex:

Staging:

version2
tree_level1
index_size425984
root_block_no3
internal_pages1
leaf_pages50
empty_pages0
deleted_pages0
avg_leaf_density70.86
leaf_fragmentation16

Production:

version2
tree_level1
index_size360448
root_block_no3
internal_pages1
leaf_pages41
empty_pages0
deleted_pages1
avg_leaf_density60.44
leaf_fragmentation39.02

On 11/4/19, 3:07 PM, "Peter Geoghegan"  wrote:

On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby  wrote:
> I think it's because some heap pages are being visited many times, due to 
the
> index tuples being badly "fragmented".  Note, I'm not talking about
> fragmentation of index *pages*, which is what pgstattuple reports (which
> wouldn't have nearly so detrimental effect).  I could probably say that 
the
> index tuples are badly "correlated" with the heap.

But this is a unique index, and Scott indicates that the problem seems
to go away for a while following a REINDEX.

> In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
> affecting its progress reporting, fix to be included in v12.1).

PG v12 will store B-Tree duplicates in heap TID order, so if that's
the problem then upgrading to v12 (and REINDEXing if the upgrade was
performed using pg_upgrade) will fix it for good.

--
Peter Geoghegan



This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written 
advice include a disclaimer. To the extent the preceding message contains 
advice relating to a Federal tax issue, unless expressly stated otherwise the 
advice is not intended or written to be used, and it cannot be used by the 
recipient or any other taxpayer, for the purpose of avoiding Federal tax 
penalties, and was not written to support the promotion or marketing of any 
transaction or matter discussed herein.


Re: Huge shared hit for small table

2019-11-04 Thread Andres Freund
Hi,

On 2019-11-04 19:56:57 +, Scott Rankin wrote:
> The index is exceedingly simple:
>
>
> CREATE UNIQUE INDEX "programPK" ON program(id int8_ops);
>
> From pg_stat_user_indexes:
>
> Staging:
>
> idx_scan: 5826745
> idx_tup_read: 52715470
> idx_tup_fetch: 52644465
>
> Production:
>
> idx_scan : 7277919087
> idx_tup_read: 90612605047
> idx_tup_fetch: 5207807880

I was basically asking for SELECT * FROM pgstatindex('pgstatindex');
with pgstatindex being from the pgstattuple extension
https://www.postgresql.org/docs/current/pgstattuple.html
not the pg_stat_user_indexes entry...

Greetings,

Andres Freund




Re: Huge shared hit for small table

2019-11-04 Thread Jeff Janes
On Mon, Nov 4, 2019 at 2:38 PM Scott Rankin  wrote:

> Hello all,
>
>
>
> We are trying to debug some slow performance in our production environment
> (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN
> node that seems… weird.  This is a very large query involving a number of
> joins, but it performs pretty well in our staging environment (which has
> roughly the same data set as production, with a few tweaks).  However,
> there is one node in the EXPLAIN plan that is wildly different:
>

Could there be a long-open transaction, which is preventing hint-bits from
getting on set on the table rows, as well on the index rows?

...


> The tables in both environments are about the same size (18MB) and the
> indexes are about the same size (360kb/410kb) – and the shared hits are
> pretty much the same on the other nodes of the query between the two
> environments.
>

If this table has more turn-over than those other tables (as measured in
rows, not in percentage of the table), this would not be inconsistent with
my theory.


> This has happened one time before, and we did a “REINDEX” on the program
> table – and that made the problem mostly go away.  Now it seems to be back,
> and I’m not sure what to make of it.
>


A reindex would not by itself fix the problem if it were the long open
transaction.  But  if the long open transaction held a sufficient lock on
the table, then the reindex would block until the transaction went away on
its own, at which point the problem would go away on its own, so it might
**appear** to have fixed the problem.

Cheers,

Jeff

>


Re: Huge shared hit for small table

2019-11-04 Thread Peter Geoghegan
On Mon, Nov 4, 2019 at 12:32 PM Jeff Janes  wrote:
> Could there be a long-open transaction, which is preventing hint-bits from 
> getting on set on the table rows, as well on the index rows?

Contention on a small number of rows may also be a factor.

> A reindex would not by itself fix the problem if it were the long open 
> transaction.  But  if the long open transaction held a sufficient lock on the 
> table, then the reindex would block until the transaction went away on its 
> own, at which point the problem would go away on its own, so it might 
> **appear** to have fixed the problem.

That seems like the simplest and most likely explanation to me, even
though it isn't particularly simple.

-- 
Peter Geoghegan




Re: Huge shared hit for small table

2019-11-04 Thread Scott Rankin
Definitely no long-running transactions on this table; in fact, this table is 
pretty infrequently updated – on the order of a few tens of rows updated per 
day.

From: Jeff Janes 
Date: Monday, November 4, 2019 at 3:32 PM
To: Scott Rankin 
Cc: "[email protected]" 

Subject: Re: Huge shared hit for small table

On Mon, Nov 4, 2019 at 2:38 PM Scott Rankin 
mailto:[email protected]>> wrote:
Hello all,

We are trying to debug some slow performance in our production environment 
(Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node 
that seems… weird.  This is a very large query involving a number of joins, but 
it performs pretty well in our staging environment (which has roughly the same 
data set as production, with a few tweaks).  However, there is one node in the 
EXPLAIN plan that is wildly different:

Could there be a long-open transaction, which is preventing hint-bits from 
getting on set on the table rows, as well on the index rows?

...

The tables in both environments are about the same size (18MB) and the indexes 
are about the same size (360kb/410kb) – and the shared hits are pretty much the 
same on the other nodes of the query between the two environments.

If this table has more turn-over than those other tables (as measured in rows, 
not in percentage of the table), this would not be inconsistent with my theory.

This has happened one time before, and we did a “REINDEX” on the program table 
– and that made the problem mostly go away.  Now it seems to be back, and I’m 
not sure what to make of it.


A reindex would not by itself fix the problem if it were the long open 
transaction.  But  if the long open transaction held a sufficient lock on the 
table, then the reindex would block until the transaction went away on its own, 
at which point the problem would go away on its own, so it might **appear** to 
have fixed the problem.

Cheers,

Jeff

This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written 
advice include a disclaimer. To the extent the preceding message contains 
advice relating to a Federal tax issue, unless expressly stated otherwise the 
advice is not intended or written to be used, and it cannot be used by the 
recipient or any other taxpayer, for the purpose of avoiding Federal tax 
penalties, and was not written to support the promotion or marketing of any 
transaction or matter discussed herein.


Re: Huge shared hit for small table

2019-11-04 Thread Peter Geoghegan
On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin  wrote:
> Definitely no long-running transactions on this table; in fact, this table is 
> pretty infrequently updated – on the order of a few tens of rows updated per 
> day.

But a long running transaction will have an impact on all tables --
not just the tables that happen to have been accessed so far in the
long running transaction. This is necessary because nothing stops the
long running transaction from SELECTing data from any table at any
time -- we need to pessimistically keep around the data required to
make that work.

-- 
Peter Geoghegan




Re: Huge shared hit for small table

2019-11-04 Thread Scott Rankin
I think we have a winner.  I looked in and found a process that was 'idle in 
transaction' for a couple days - and once I killed it, query performance went 
back to normal.

Thank you all for the very quick responses on this.

On 11/4/19, 3:41 PM, "Peter Geoghegan"  wrote:

On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin  wrote:
> Definitely no long-running transactions on this table; in fact, this 
table is pretty infrequently updated – on the order of a few tens of rows 
updated per day.

But a long running transaction will have an impact on all tables --
not just the tables that happen to have been accessed so far in the
long running transaction. This is necessary because nothing stops the
long running transaction from SELECTing data from any table at any
time -- we need to pessimistically keep around the data required to
make that work.

--
Peter Geoghegan



This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written 
advice include a disclaimer. To the extent the preceding message contains 
advice relating to a Federal tax issue, unless expressly stated otherwise the 
advice is not intended or written to be used, and it cannot be used by the 
recipient or any other taxpayer, for the purpose of avoiding Federal tax 
penalties, and was not written to support the promotion or marketing of any 
transaction or matter discussed herein.


Re: Huge shared hit for small table

2019-11-04 Thread Jeff Janes
On Mon, Nov 4, 2019 at 3:38 PM Scott Rankin  wrote:

> Definitely no long-running transactions on this table;
>

Any long running transactions at all?  The lock on the table is only
necessary to explain why the problem would have gone away at the same time
as the reindex finished.  If there is a long running transaction which
doesn't touch this table, it would still cause the problem. It is just that
the reindinex would not solve the problem (because the
not-entirely-dead-yet tuples would have to be copied into the new index),
and with no lock there is no reason for them to be correlated in time,
other than sheer dumb luck.

Does another reindex solve the problem again?

>  in fact, this table is pretty infrequently updated – on the order of a
few tens of rows updated per day.

That would seem to argue against this explanations, but all the others ones
too I think.  But a few tens of rows per day and a transaction left open
for a few tens of days, and you could get enough zombie tuples to add up to
trouble.  Particularly if there is one row (as defined by prog.id) which is
seeing both most of those updates, an most of the index-scan activity.

But now I am curious, if it is a small table and the index scan is going to
be invoked 21,956 times in one query, it seems like it should hash it
instead.  Does it misestimate how often that index scan is going to get
invoked? (assuming the index scan is the 2nd child of a nested loop, what
is the expected and actual row count of the 1st child of that loop?)

Cheers,

Jeff

>


FPGA optimization ...

2019-11-04 Thread Gunther

The time has come.

FPGA optimization is in the palm of our hands (literally a 2 TB 40 GB/s 
IO PostgreSQL server fits into less than a shoe box), and on Amazon AWS 
F1 instances.


Some demos are beginning to exist: https://github.com/Xilinx/data-analytics.


But a lot more could be done. How about linear sort performance at O(N)? 
https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. 
And how about ​https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, 
the following four sorting accelerators are used:


 * Tuple Sorter : Sorts an N-tuple using a sorting network.
 * Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples
   in on-chip memory.
 * Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM.
 * Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in
   flash.

Order of magnitude speed improvements? Better than Hadoop clusters on a 
single chip? 40 GB/s I/O throughput massive full table scan, blazing 
fast sort-merge joins? Here it is. Anybody working more on that? Should 
be an ideal project for a student or a group of students.


Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
exempt? I can donate and possibly find some people at Purdue University 
who might take this on. Interest?


regards,
-Gunther



Re: FPGA optimization ...

2019-11-04 Thread Tomas Vondra

On Mon, Nov 04, 2019 at 06:33:15PM -0500, Gunther wrote:

The time has come.

FPGA optimization is in the palm of our hands (literally a 2 TB 40 
GB/s IO PostgreSQL server fits into less than a shoe box), and on 
Amazon AWS F1 instances.


Some demos are beginning to exist: https://github.com/Xilinx/data-analytics.


But a lot more could be done. How about linear sort performance at 
O(N)? https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. 
And how about ​https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, 
the following four sorting accelerators are used:


* Tuple Sorter : Sorts an N-tuple using a sorting network.
* Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples
  in on-chip memory.
* Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM.
* Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in
  flash.

Order of magnitude speed improvements? Better than Hadoop clusters on 
a single chip? 40 GB/s I/O throughput massive full table scan, blazing 
fast sort-merge joins? Here it is. Anybody working more on that? 
Should be an ideal project for a student or a group of students.




For the record, this is not exactly a new thing. Netezza (a PostgreSQL
fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another
PostgreSQL fork, also using FPGAs with newer PostgreSQL releases.

Those are proprietary forks, though. The main reason why the community
itself is not working on this directly (at least not on pgsql-hackers)
is exactly that it requires specialized hardware, which the devs
probably don't have, making development impossible, and the regular
customers are not asking for it either (one of the reasons being limited
availability of such hardware, especially for customers running in the
cloud and not being even able to deploy custom appliances).

I don't think this will change, unless the access to systems with FPGAs
becomes much easier (e.g. if AWS introduces such instance type).

Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
exempt? I can donate and possibly find some people at Purdue 
University who might take this on. Interest?




I don't think there's any such non-profit, managing/funding development.
At least I'm not avare of it. There are various non-profits around the
world, but those are organizing events and local communities.

I'd say the best way to do something like this is to either talk to one
of the companies participating in PostgreSQL devopment (pgsql-hackers is
probably a good starting point), or - if you absolutely need to go
through a non-profit - approach a university (which does not mean people
from pgsql-hackers can't be involved, of course). I've been involved in
a couple of such research projects in Europe, not sure what exactly is
the situation/rules in US.

regards

[1] https://swarm64.com/netezza-replacement/

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: FPGA optimization ...

2019-11-04 Thread Gunther

Hi Thomas, you said:


For the record, this is not exactly a new thing. Netezza (a PostgreSQL
fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another
PostgreSQL fork, also using FPGAs with newer PostgreSQL releases.


yes, I found the swarm thing on Google, and heard about Netezza years 
ago from the Indian consulting contractor that had worked on it (their 
price point was way out of the range that made sense for the academic 
place where I worked then).


But there is good news, better than you thought when you wrote:


Those are proprietary forks, though. The main reason why the community
itself is not working on this directly (at least not on pgsql-hackers)
is exactly that it requires specialized hardware, which the devs
probably don't have, making development impossible, and the regular
customers are not asking for it either (one of the reasons being limited
availability of such hardware, especially for customers running in the
cloud and not being even able to deploy custom appliances).

I don't think this will change, unless the access to systems with FPGAs
becomes much easier (e.g. if AWS introduces such instance type).


It already has changed! Amazon F1 instances. And Xilinx has already 
packaged a demo https://aws.amazon.com/marketplace/pp/B07BVSZL51. This 
demo appears very limited though (only for TPC-H query 6 and 12 or so).


Even the hardware to hold in your hand is now much cheaper. I know a guy 
who's marketing a board with 40 GB/s throughput. I don't have price but 
I can't imagine the board plus 1 TB disk to be much outside of US$ 2k. I 
could sponsor that if someone wants to have a serious shot at it.


Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
exempt? I can donate and possibly find some people at Purdue 
University who might take this on. Interest?




I don't think there's any such non-profit, managing/funding development.
At least I'm not avare of it. There are various non-profits around the
world, but those are organizing events and local communities.

I'd say the best way to do something like this is to either talk to one
of the companies participating in PostgreSQL devopment (pgsql-hackers is
probably a good starting point), or - if you absolutely need to go
through a non-profit - approach a university (which does not mean people
from pgsql-hackers can't be involved, of course). I've been involved in
a couple of such research projects in Europe, not sure what exactly is
the situation/rules in US.


Yes, might work with a University directly. Although I will contact the 
PostgreSQL foundation in the US also.


regards,
-Gunther