Hash Index on Partitioned Table

2023-05-31 Thread peter.boris...@kartographia.com

Dear PostgreSQL Community,
 
I have a rather large database with ~250 billion records in a partitioned 
table. The database has been running and ingesting data continuously for about 
3 years.
 
I had a "regular" BTree index on one of the fields (a unique bigint column) but 
it was getting too big for the disk it was on. The index was consuming 6.4 TB 
of disk space.
 
I created a new disk with double the size, dropped the original index, and 
started to generate a new index.
 
After doing some research I decided to try to create a hash index instead of a 
BTree. For my purposes, the index is only used to find specific numbers ("=" 
and "IN" queries). From what I read, the hash index should run a little faster 
than btree for my use case and should use less disk space.
 
After 115 hours, the hash index is still generating and is using significantly 
more disk space than the original BTree index (8.4 TB vs 6.4 TB). I don't know 
how to check the status of the index creation task so I can't really estimate 
how much longer it will take or how much disk space it will consume.
 
Questions:
 
(1) Why is the hash index consuming more disk space than the btree index? Is it 
because the hash of the bigint values larger than the storing the bigints in 
the btree?
(2) Are there any known issues having a hash index on this many records?
(3) Are there any known issues having a hash index on partitioned tables?
(4) Is there any way to estimate when the index process will complete?
 
Server info:
 - PostgreSQL 13
 - Ubuntu 20.04.6 LTS
 - 64 cores (only 1 is ever used during index creation)
 - Memory usage is steady at 58GB/188GB
 - All disks are high speed NVMe drives
 - 1,686 tables in the partition
 
Thanks in advance,
Peter

Re: Hash Index on Partitioned Table

2023-05-31 Thread peter.boris...@kartographia.com

Hi Tom,
Thanks so much for your quick response. As luck would have it, the index 
FINALLY finished about an hour ago. For a size comparison:
 
BTree: 6,433 GB
Hash: 8,647 GB
 
Although I don't have a proper benchmark to compare performance, I can say the 
hash is working as good as if not faster than the BTree for my use case (web 
application).
 
I guess I was getting a little nervous waiting for the index to complete and 
seeing such a huge difference in file size but I'm ok now :-)
 
Thanks again,
Peter
 
 
-Original Message-
From: "Tom Lane" 
Sent: Wednesday, May 31, 2023 10:07am
To: "peter.boris...@kartographia.com" 
Cc: pgsql-general@lists.postgresql.org
Subject: Re: Hash Index on Partitioned Table



"peter.boris...@kartographia.com"  writes:
> I have a rather large database with ~250 billion records in a partitioned 
> table. The database has been running and ingesting data continuously for 
> about 3 years.

> I had a "regular" BTree index on one of the fields (a unique bigint column) 
> but it was getting too big for the disk it was on. The index was consuming 
> 6.4 TB of disk space.

That's ... really about par for the course. Each row requires an 8-byte
index entry, plus 12 bytes index overhead. If I'm doing the math right
then the index is physically about 78% full which is typical to good for
a b-tree. Reindexing would remove most of the extra space, but only
temporarily.

> After doing some research I decided to try to create a hash index instead of 
> a BTree. For my purposes, the index is only used to find specific numbers 
> ("=" and "IN" queries). From what I read, the hash index should run a little 
> faster than btree for my use case and should use less disk space.

I'm skeptical. The thing to bear in mind is that btree is the mainstream
use-case and has been refined and optimized far more than the hash index
logic.

> (1) Why is the hash index consuming more disk space than the btree index? Is 
> it because the hash of the bigint values larger than the storing the bigints 
> in the btree?

From memory, the index entries will be the same size in this case,
but hash might have more wasted space.

> (4) Is there any way to estimate when the index process will complete?

An index on a partitioned table isn't a single object, it's one index per
partition. So you should be able to look at how many partitions have
indexes so far. You might have to drill down to the point of counting how
many files in the database's directory, if the individual indexes aren't
showing up as committed catalog entries yet.

 regards, tom lane

Inserting into foreign table with sequences and default values

2023-12-08 Thread peter.boris...@kartographia.com

Hello,
   I have a few questions inserting data using Foreign Data Wrappers (FDW). 
Consider this simple example.
 
On PostgreSQL Database A (remote):

CREATE TABLE APP.TEST (
ID BIGSERIAL NOT NULL,
FIRST_NAME text,
LAST_NAME text,
STATUS integer NOT NULL DEFAULT 1,
CONSTRAINT PK_USER PRIMARY KEY (ID)
);
 
When I connect directly to this database I can insert without any issues. 
Example:

insert into app.test(first_name) values('Peter');
INSERT 0 1
---
 
On PostgreSQL Database B (local):

CREATE SERVER remote_server FOREIGN DATA WRAPPER postgres_fdw ...
IMPORT FOREIGN SCHEMA APP LIMIT TO (TEST)
FROM SERVER remote_server INTO public;
---
 
When I try to do a simple insert into the test table on database B I get an 
error:
 
insert into test(first_name) values('Mark');
ERROR:  null value in column "id" of relation "test" violates not-null 
constraint
 
If I manually set the id, I get another error
 
insert into "user"(id, first_name) values(2, 'Mark');
ERROR:  null value in column "status" of relation "test" violates not-null 
constraint
 
Only after I set both the id and status fields can I successfully insert.
 
insert into test(id, first_name, status) values(2, 'Mark', 1);
INSERT 0 1
 
I guess the FDW is not "seeing" the default value for status and the bigserial 
sequence for the id column. Is there anyway around this? Is there an option I 
missed when I called IMPORT FOREIGN SCHEMA? Or is there a hack using views or 
triggers?
 
Thanks in advance,
Peter
 
 
 
 

Re: Inserting into foreign table with sequences and default values

2023-12-08 Thread peter.boris...@kartographia.com

Thanks Tom,
   Your reply helped point me in the right direction. With a little trial and 
error I came up with a hack to solve my issue.
 
First off, I create a shared sequence for the user_id with a technique 
described here:
[ 
https://paquier.xyz/postgresql-2/global-sequences-with-postgres_fdw-and-postgres-core/
 ]( 
https://paquier.xyz/postgresql-2/global-sequences-with-postgres_fdw-and-postgres-core/
 )
 
This involved creating a sequence on the remote server and a view of the 
sequence on the remote server. I also updated the "test" table on the remote 
server to populate the "id" column using a the sequence via a trigger.
 
On the local server, I created a foreign table that references the view on the 
remote server and a function that returns a sequence value from the foreign 
table.
 
Finally, on the local server I created a trigger on the remote "test" table 
that was imported earlier via the "IMPORT FOREIGN SCHEMA" command.
 
It's all a little hacky but it works :-)
 
Full steps below for anyone that's interested (sorry if it gets mangled via 
email).
 
Thanks Again,
Peter
 
 

On Remote:


CREATE SEQUENCE user_id;

CREATE VIEW user_id_view AS SELECT nextval('user_id') as a;

CREATE FUNCTION user_id_nextval() RETURNS bigint AS
'SELECT a FROM user_id_view;' LANGUAGE SQL;


CREATE TABLE APP.TEST (
ID bigint NOT NULL,
FIRST_NAME text,
LAST_NAME text,
STATUS integer NOT NULL DEFAULT 1,
CONSTRAINT PK_USER PRIMARY KEY (ID)
);

 CREATE OR REPLACE FUNCTION test_function()
 RETURNS "trigger" AS
 $BODY$
 BEGIN
   New.id := case when New.id IS NULL then user_id_nextval() else New.id end;
   New.status := case when New.status IS NULL then 1 else New.status end;
   Return NEW;
 END;
 $BODY$
 LANGUAGE 'plpgsql' VOLATILE;


 CREATE TRIGGER test_trigger
 BEFORE INSERT
 ON app.test
 FOR EACH ROW
 EXECUTE PROCEDURE test_function();



On Local



CREATE FOREIGN TABLE user_id_foreign_table (a bigint)
SERVER culmen OPTIONS (table_name 'user_id_view');

CREATE FUNCTION user_id_nextval() RETURNS bigint AS
'SELECT a FROM user_id_foreign_table;' LANGUAGE SQL;


IMPORT FOREIGN SCHEMA APP LIMIT TO (TEST)
FROM SERVER culmen INTO public;


 CREATE OR REPLACE FUNCTION test_function()
 RETURNS "trigger" AS
 $BODY$
 BEGIN
   New.id := case when New.id IS NULL then user_id_nextval() else New.id end;
   New.status := case when New.status IS NULL then 1 else New.status end;
   Return NEW;
 END;
 $BODY$
 LANGUAGE 'plpgsql' VOLATILE;


 CREATE TRIGGER test_trigger
 BEFORE INSERT
 ON test
 FOR EACH ROW
 EXECUTE PROCEDURE test_function();



 INSERT INTO test(first_name) VALUES ('Bob');


 
 
 

Re: Inserting into foreign table with sequences and default values

2023-12-08 Thread peter.boris...@kartographia.com

Correction: the trigger on the remote table isn't going to work. I was trying 
so many different things today that I confused myself. Sorry...
 
On the bright side the remote sequence works great and I can insert records 
from the my database to the remote database now which is a step forward.
 
I'll report back if I make any additional progress.