Re: bigint out of range

2019-05-19 Thread Peter J. Holzer
On 2019-05-18 19:16:19 -0500, Ron wrote:
> On 5/18/19 5:39 PM, Peter J. Holzer wrote:
> > On 2019-05-18 17:14:59 -0500, Ron wrote:
> > > On 5/18/19 3:49 PM, Peter J. Holzer wrote:
> > >  On 2019-05-18 15:19:22 -0500, Ron wrote:
> > >  On 5/18/19 2:27 PM, Peter J. Holzer wrote:
> > >  On 2019-05-18 10:49:53 -0700, David G. Johnston wrote:
> > > 
> > >  You don’t perform math on a hash
> > > 
> > >  That's not generally true. Hashes are used for further 
> > > computation for
> > >  example in hash tables or in cryptography.
> > > 
> > >  How is it "using math" to use a hash key in a hash lookup table?
> > > 
> > >  hash modulo table size.
> > > 
> > > 
> > > I've seen that used when the tablespace is pre-allocated, and you hash 
> > > modulo
> > > the tablespace page number.  (Yes, performance tanks when you start 
> > > filling up
> > > pages.)  How do you hash on the (ever growing) table size?
> > The hash function returns a number in a range much larger than the
> > possible number of buckets. 64 bits is a good choice today.
> > 
> > To determine the bucket you need to reduce this number to something in
> > the range [0, nr_buckets). This is where modulo comes in:
> > 
> > i = h % nr_buckets
> > 
> > If the the table fills up, you increase nr_buckets, reallocate and
> > rehash all entries.
> 
> Ouch.  Response time on a big table would take a serious hit if that rehash
> happened in the middle of the day on a big OLTP system.

So that might be a reason not to use hash indexes tables where the worst
case insert time is a concern (the average insert time is still O(1)).

But please be aware that I answered your question 'How is it "using
math" to use a hash key?', not 'How are hash indexes in PostgreSQL
implemented?'. So my answer covered the most simple and generic
implementation.

You can split buckets lazily, and in fact PostgreSQL does this. I just
looked at the code briefly, but I don't really understand how it works.
Guess I would have to read Margo Seltzer's paper paper for that.

Still, even with that optimization (or tradeoff - it may make the lookup
slower) the README notes that splits are expensive.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature


Re: bigint out of range

2019-05-19 Thread Ron

On 5/19/19 5:43 AM, Peter J. Holzer wrote:
[snip]

But please be aware that I answered your question 'How is it "using
math" to use a hash key?', not 'How are hash indexes in PostgreSQL
implemented?'. So my answer covered the most simple and generic
implementation.


I understand.

--
Angular momentum makes the world go 'round.




Re: bigint out of range

2019-05-19 Thread Stephen Frost
Greetings,

* David G. Johnston (david.g.johns...@gmail.com) wrote:
> On Thu, May 16, 2019 at 8:31 AM Daulat Ram 
> wrote:
> 
> > url_hash| bigint  |   | not null |
> 
> Change the type of url_hash; make it text instead of bigint.

Making it text wastes a bunch of space actually, since it's really a
binary value.  I tend to recommend using a bytea for storing hashes.
Given that hashes are really fixed width, it'd be nice if we had a set
of proper datatypes for them, perhaps not so much to avoid the 1-byte
overhead from storing as a variable-length bytea, but because we could
then avoid having a 7-byte hole due to padding if the hash is followed
by a bigint or such.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: bigint out of range

2019-05-19 Thread Stephen Frost
Greetings,

* Peter J. Holzer (hjp-pg...@hjp.at) wrote:
> On 2019-05-18 19:16:19 -0500, Ron wrote:
> > > If the the table fills up, you increase nr_buckets, reallocate and
> > > rehash all entries.
> > 
> > Ouch.  Response time on a big table would take a serious hit if that rehash
> > happened in the middle of the day on a big OLTP system.

As noted below, that isn't actually how it works with PG's hash indexes.

> So that might be a reason not to use hash indexes tables where the worst
> case insert time is a concern (the average insert time is still O(1)).

The worst-case insert time is certainly worse than the average but it's
not nearly as bad as being imagined here.

> You can split buckets lazily, and in fact PostgreSQL does this. I just
> looked at the code briefly, but I don't really understand how it works.
> Guess I would have to read Margo Seltzer's paper paper for that.
> 
> Still, even with that optimization (or tradeoff - it may make the lookup
> slower) the README notes that splits are expensive.

There's multiple different levels, really, from "best case" where the
new item can just be placed directly on to a page that has free space,
to "slightly worse case" where an overflow page has to be added, to
"even worse case" where a prior page split has to be finished, to
"probably worst case" where a full page split has to happen.  There
might even be some pathological cases where a prior page split has to be
finished and then an overflow page has to be added and then another page
split has to be done, or some such.  Even so, for equality-based
lookups (where the data set has few or zero duplicates), hash indexes
can work quite well, but it's of course good to understand that inserts
aren't always going to be exactly the same speed every time.

Thanks,

Stephen


signature.asc
Description: PGP signature