Question regarding fast-hashing in PGSQL

2019-09-18 Thread Stephen Conley
Hey there;

I have a weird use case where I am basically taking data from many
different sources and merging it into a single table, while trying to avoid
duplicates as much as possible.  None of them share any kind of primary
key, but I have determined 3 columns that, together, will almost always be
unique so I am planning on using those 3 columns as a composite primary key.

Two of those columns are integers, which is great.  The third column is a
string, UTF-8, which may be quite long (though probably no longer than 50
characters ... on average probably around 10 - 30 characters).  The strings
could be practically anything, and they absolutely will not be unique on
their own (these three data points are basically x, y coordinates and then
some name...for a given x,y coordinate there may be multiple names, but the
likihood of the same name at the same x, y is almost 0)

I really don't want to do a string comparison if possible because this DB
will be getting very large, very quickly -- 20 million or so rows
anticipated in the near future (i.e. next few weeks), with possible growth
up to around 200 million (1+ year later).

My idea was to hash the string to a bigint, because the likelihood of all 3
columns colliding is almost 0, and if a duplicate does crop up, it isn't
the end of the world.

However, Postgresql doesn't seem to have any 'native' hashing calls that
result in a bigint.  The closest I've found is pgcrypto's 'digest' call --
I could theoretically take an md5 hash, and just use the first 8 bytes of
it to make a bigint.

HOWEVER... there is no straight forward way to do this.  The most straight
forward way I've seen is md5 -> hex string -> substring -> bigint.  This is
ridiculous to me -- I'm basically converting binary to text, then
converting text to binary.  However, you can't convert a bytea to a bigint
in any fashion that I can tell so I have to eat a bunch of overhead for fun.

What would be the fastest way to do this?  I will be generating potentially
a LOT of these keys so I want to do it the least dumb way.   I am using
Digital Ocean's hosted PostgreSQL so I can't use my own C code -- but I can
use PL/Psql, PL/Perl or any of these extensions:

https://www.digitalocean.com/docs/databases/postgresql/resources/supported-extensions/

If my concerns about string comparisons are unfounded and I'm working way
too hard to avoid something that doesn't matter ... feel free to tell me
that as well.  Basically, PostgreSQL performance guys, how would you tackle
this one?


Thanks,

Stephen


Re: Question regarding fast-hashing in PGSQL

2019-09-18 Thread Adam Brusselback
I've had a similar issue in the past.

I used the md5 hash function and stored it in a UUID column for my
comparisons. Bigger than a bigint, but still much faster than string
comparisons directly for my use case.
UUID works fine for storing md5 hashes and gives you the ability to
piggyback on all the index support built for them.

Hope that helps,
-Adam


Re: Question regarding fast-hashing in PGSQL

2019-09-18 Thread Stephen Conley
This should work perfectly for me.  Thank you so much!

On Wed, Sep 18, 2019 at 12:50 PM Adam Brusselback 
wrote:

> I've had a similar issue in the past.
>
> I used the md5 hash function and stored it in a UUID column for my
> comparisons. Bigger than a bigint, but still much faster than string
> comparisons directly for my use case.
> UUID works fine for storing md5 hashes and gives you the ability to
> piggyback on all the index support built for them.
>
> Hope that helps,
> -Adam
>


pg12 - migrate tables to partitions structure

2019-09-18 Thread Mariel Cherkassky
Hey,
Thanks to the new partitions features in pg12 (referencing partition table
is possible) I was trying to migrate some of my tables into a partitions
structure.

Lets assume I have the following non partitions structure :

Product(id int PK,vendor int references Vendor(id),price int)
ProductPic(int picId PK,product int references product(id) )
Vendor(id int PK,name text)
 more tables that has references to the Product(id).

I understand that the PK on the  Product table must include also the
partition column in order to ensure the uniqueness across all the
partitions. However, on the other hand I'll need to add the partition
column (Vendor) to all the tables that has a reference to the Product(id) +
update that column with the relevant data. This type of maintenance
requires a lot of time because I have a lot of references to the Product
table. Is there any other option in PG12 to allow references to partition
table ?


Re: pg12 - migrate tables to partitions structure

2019-09-18 Thread Michael Lewis
 How many rows are you dealing with currently? What are your queries like?
Have you looked at doing a hash partition on product.id? Is this on a test
system or destined for a production environment in the near future? I ask
because PG12 is still in beta.


Re: pg12 - migrate tables to partitions structure

2019-09-18 Thread Mariel Cherkassky
Hey Michael,
first of all thanks for the quick response.
Right now the production env is on a different version(10). I'm doing all
my tests on a test environment. I'm familiar with the hash partitions but
my queries doesnt involve the product.id therefore iti isnt relevant. All
the queries uses the vendor product and thats why this column is a perfect
fit as a partition column.
My main table is big (10M+) (Product), but other tables can also be
big(1M+)..

>
>


Re: pg12 - migrate tables to partitions structure

2019-09-18 Thread Michael Lewis
>
> All the queries uses the vendor product and thats why this column is a
> perfect fit as a partition column.
> My main table is big (10M+) (Product), but other tables can also be
> big(1M+)..
>

I assume you have query performance problems and are hoping partitioning
will help? Are you read heavy, or write intensive, or both? 10 million
rows, especially if they aren't super wide, doesn't seem like a huge number
to me. Do you have example queries with explain plans that you think would
benefit from the system being partitioned? I just know that as an engineer,
sometimes I like to make use of new tools, even when it isn't the best
solution for the problem I am actually experiencing. How confident are you
that you NEED partitions is my real question.


Re: pg12 - migrate tables to partitions structure

2019-09-18 Thread Mariel Cherkassky
Well, if u have 10M rows, and all your queries use the same column in the
query and the data can split pretty even between the partitions, any
specific reason not to use is ? An index will help u reach a complexity of
(logn) while partition + index can be in complexity of (logm) when m = rows
in partition , n=total rows

>


Re: Question regarding fast-hashing in PGSQL

2019-09-18 Thread Tom Lane
Stephen Conley  writes:
> My idea was to hash the string to a bigint, because the likelihood of all 3
> columns colliding is almost 0, and if a duplicate does crop up, it isn't
> the end of the world.

> However, Postgresql doesn't seem to have any 'native' hashing calls that
> result in a bigint.

regression=# \df hashtext*
   List of functions
   Schema   |   Name   | Result data type | Argument data types | Type 
+--+--+-+--
 pg_catalog | hashtext | integer  | text| func
 pg_catalog | hashtextextended | bigint   | text, bigint| func
(2 rows)

The "extended" hash API has only been there since v11, so you
couldn't rely on it if you need portability to old servers.
But otherwise it seems to respond precisely to your question.

If you do need portability ... does the text string's part of the
hash *really* have to be 64 bits wide?  Why not just concatenate
it with a 32-bit hash of the other fields?

regards, tom lane




Re: pg12 - migrate tables to partitions structure

2019-09-18 Thread Michael Lewis
Is this being done because it can be, or is it solving a real-life pain
point? Just wondering what the perspective is here.

Much of partitioning strategy seems to me to revolve around how the system
is used, and not just the schema and what is possible. For instance, you
can mimic primary and foreign key behavior with triggers as described here,
and that would bypass some of the restrictions on what can be done.
https://www.depesz.com/2018/10/02/foreign-key-to-partitioned-table/

This would allow you to change out the primary key for a simple index
perhaps, and partition however you want. Just because something can be
done, does not mean it should be.

>