Question regarding fast-hashing in PGSQL
Hey there; I have a weird use case where I am basically taking data from many different sources and merging it into a single table, while trying to avoid duplicates as much as possible. None of them share any kind of primary key, but I have determined 3 columns that, together, will almost always be unique so I am planning on using those 3 columns as a composite primary key. Two of those columns are integers, which is great. The third column is a string, UTF-8, which may be quite long (though probably no longer than 50 characters ... on average probably around 10 - 30 characters). The strings could be practically anything, and they absolutely will not be unique on their own (these three data points are basically x, y coordinates and then some name...for a given x,y coordinate there may be multiple names, but the likihood of the same name at the same x, y is almost 0) I really don't want to do a string comparison if possible because this DB will be getting very large, very quickly -- 20 million or so rows anticipated in the near future (i.e. next few weeks), with possible growth up to around 200 million (1+ year later). My idea was to hash the string to a bigint, because the likelihood of all 3 columns colliding is almost 0, and if a duplicate does crop up, it isn't the end of the world. However, Postgresql doesn't seem to have any 'native' hashing calls that result in a bigint. The closest I've found is pgcrypto's 'digest' call -- I could theoretically take an md5 hash, and just use the first 8 bytes of it to make a bigint. HOWEVER... there is no straight forward way to do this. The most straight forward way I've seen is md5 -> hex string -> substring -> bigint. This is ridiculous to me -- I'm basically converting binary to text, then converting text to binary. However, you can't convert a bytea to a bigint in any fashion that I can tell so I have to eat a bunch of overhead for fun. What would be the fastest way to do this? I will be generating potentially a LOT of these keys so I want to do it the least dumb way. I am using Digital Ocean's hosted PostgreSQL so I can't use my own C code -- but I can use PL/Psql, PL/Perl or any of these extensions: https://www.digitalocean.com/docs/databases/postgresql/resources/supported-extensions/ If my concerns about string comparisons are unfounded and I'm working way too hard to avoid something that doesn't matter ... feel free to tell me that as well. Basically, PostgreSQL performance guys, how would you tackle this one? Thanks, Stephen
Re: Question regarding fast-hashing in PGSQL
I've had a similar issue in the past. I used the md5 hash function and stored it in a UUID column for my comparisons. Bigger than a bigint, but still much faster than string comparisons directly for my use case. UUID works fine for storing md5 hashes and gives you the ability to piggyback on all the index support built for them. Hope that helps, -Adam
Re: Question regarding fast-hashing in PGSQL
This should work perfectly for me. Thank you so much! On Wed, Sep 18, 2019 at 12:50 PM Adam Brusselback wrote: > I've had a similar issue in the past. > > I used the md5 hash function and stored it in a UUID column for my > comparisons. Bigger than a bigint, but still much faster than string > comparisons directly for my use case. > UUID works fine for storing md5 hashes and gives you the ability to > piggyback on all the index support built for them. > > Hope that helps, > -Adam >
pg12 - migrate tables to partitions structure
Hey, Thanks to the new partitions features in pg12 (referencing partition table is possible) I was trying to migrate some of my tables into a partitions structure. Lets assume I have the following non partitions structure : Product(id int PK,vendor int references Vendor(id),price int) ProductPic(int picId PK,product int references product(id) ) Vendor(id int PK,name text) more tables that has references to the Product(id). I understand that the PK on the Product table must include also the partition column in order to ensure the uniqueness across all the partitions. However, on the other hand I'll need to add the partition column (Vendor) to all the tables that has a reference to the Product(id) + update that column with the relevant data. This type of maintenance requires a lot of time because I have a lot of references to the Product table. Is there any other option in PG12 to allow references to partition table ?
Re: pg12 - migrate tables to partitions structure
How many rows are you dealing with currently? What are your queries like? Have you looked at doing a hash partition on product.id? Is this on a test system or destined for a production environment in the near future? I ask because PG12 is still in beta.
Re: pg12 - migrate tables to partitions structure
Hey Michael, first of all thanks for the quick response. Right now the production env is on a different version(10). I'm doing all my tests on a test environment. I'm familiar with the hash partitions but my queries doesnt involve the product.id therefore iti isnt relevant. All the queries uses the vendor product and thats why this column is a perfect fit as a partition column. My main table is big (10M+) (Product), but other tables can also be big(1M+).. > >
Re: pg12 - migrate tables to partitions structure
> > All the queries uses the vendor product and thats why this column is a > perfect fit as a partition column. > My main table is big (10M+) (Product), but other tables can also be > big(1M+).. > I assume you have query performance problems and are hoping partitioning will help? Are you read heavy, or write intensive, or both? 10 million rows, especially if they aren't super wide, doesn't seem like a huge number to me. Do you have example queries with explain plans that you think would benefit from the system being partitioned? I just know that as an engineer, sometimes I like to make use of new tools, even when it isn't the best solution for the problem I am actually experiencing. How confident are you that you NEED partitions is my real question.
Re: pg12 - migrate tables to partitions structure
Well, if u have 10M rows, and all your queries use the same column in the query and the data can split pretty even between the partitions, any specific reason not to use is ? An index will help u reach a complexity of (logn) while partition + index can be in complexity of (logm) when m = rows in partition , n=total rows >
Re: Question regarding fast-hashing in PGSQL
Stephen Conley writes: > My idea was to hash the string to a bigint, because the likelihood of all 3 > columns colliding is almost 0, and if a duplicate does crop up, it isn't > the end of the world. > However, Postgresql doesn't seem to have any 'native' hashing calls that > result in a bigint. regression=# \df hashtext* List of functions Schema | Name | Result data type | Argument data types | Type +--+--+-+-- pg_catalog | hashtext | integer | text| func pg_catalog | hashtextextended | bigint | text, bigint| func (2 rows) The "extended" hash API has only been there since v11, so you couldn't rely on it if you need portability to old servers. But otherwise it seems to respond precisely to your question. If you do need portability ... does the text string's part of the hash *really* have to be 64 bits wide? Why not just concatenate it with a 32-bit hash of the other fields? regards, tom lane
Re: pg12 - migrate tables to partitions structure
Is this being done because it can be, or is it solving a real-life pain point? Just wondering what the perspective is here. Much of partitioning strategy seems to me to revolve around how the system is used, and not just the schema and what is possible. For instance, you can mimic primary and foreign key behavior with triggers as described here, and that would bypass some of the restrictions on what can be done. https://www.depesz.com/2018/10/02/foreign-key-to-partitioned-table/ This would allow you to change out the primary key for a simple index perhaps, and partition however you want. Just because something can be done, does not mean it should be. >
