Re: Optimizing `WHERE x IN` query

2019-07-09 Thread Omar Roth
The suggested query indeed appears to be faster. Thank you.

> Did you try if a properly normalized model performs better?

I've tested the below schema, which doesn't appear to perform much better but 
has a couple other advantages for my application:

```
create table subscriptions (
email text references users(email),
ucid text,
primary key (email, ucid)
);

explain (analyze, buffers) select cv.* from channel_videos cv, subscriptions s 
where cv.ucid = s.ucid and s.email = $1;

```

Is there something else here I'm missing?






Re: Optimizing `WHERE x IN` query

2019-07-09 Thread Nicolas Charles

Le 07/07/2019 à 16:33, Thomas Kellerer a écrit :

Omar Roth schrieb am 07.07.2019 um 15:43:

Currently, the query I'm using to generate a user's feed is:

```
SELECT * FROM channel_videos WHERE ucid IN (SELECT 
unnest(subscriptions) FROM

users WHERE email = $1) ORDER BY published DESC;
```


You could try an EXISTS query without unnest:

select cv.*
from channel_videos cv
where exists ucid (select *
   from users u
   where cv.ucid = any(u.subscriptions)
 and u.email = $1);

Did you try if a properly normalized model performs better?



Hi


We had big performance issues with queries like that, and we modified 
them to use && (see 
https://www.postgresql.org/docs/current/functions-array.html ), 
resulting in a big perf boost


so, with your model, the query could be

```
select cv.*
from channel_videos cv

inner join user u on cv.ucid && u.subscription

where u.email = $1;
```

or

```
select cv.*
from channel_videos cv

inner join ( select subscription  from user where email = $1) as u on 
cv.ucid && u.subscription ;


```

(disclaimer, I didn't try this queries, they may contain typos)


Regards

Nicolas





Re: UUID v1 optimizations...

2019-07-09 Thread Ancoron Luciferis
On 08/07/2019 02:26, Peter Geoghegan wrote:
> Please don't top post -- trim the your response down so that only
> still-relevant text remains.
> 
> On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis
>  wrote:
>> Primary key indexes after an ANALYZE:
>> table_name | bloat  | index_mb | table_mb
>> ---++--+--
>>  uuid_v1   | 767 MiB (49 %) | 1571.039 | 1689.195
>>  uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195
>>  uuid_seq  | 759 MiB (49 %) | 1562.766 | 1689.195
>>  uuid_serial   | 700 MiB (47 %) | 1504.047 | 1689.195
>>
>> OK, sadly no reclaim in any of them.
> 
> I don't know how you got these figures, but most likely they don't
> take into account the fact that the FSM for the index has free blocks
> available. You'll only notice that if you have additional page splits
> that can recycle that space. Or, you could use pg_freespacemap to get
> some idea.

Hm, I think I've already read quite a bit about the internals of the PG
b-tree index implementation but still cannot get to the answer how I
could influence that on my end as I want to stay compatible with the
standard UUID data storage but need time sorting support.

Anyway, I've made a bit of progress in testing and now have the full
tests executing unattended with the help of a script:
https://github.com/ancoron/pg-uuid-test

I've uploaded one of the test run results here:
https://gist.github.com/ancoron/d5114b0907e8974b6808077e02f8d109

After the first mass deletion, I can now see quite some savings for
both, serial and for my new time-sorted index:
 table_name  |  bloat  | index_mb | table_mb
-+-+--+--
 uuid_v1 | 1500 MiB (48 %) | 3106.406 | 3378.383
 uuid_serial | 800 MiB (33 %)  | 2406.453 | 3378.383
 uuid_v1_ext | 800 MiB (33 %)  | 2406.453 | 3378.383

...but in a second case (DELETE old + INSERT new), the savings are gone
again in both cases:
 table_name  |  bloat  | index_mb | table_mb
-+-+--+--
 uuid_v1 | 1547 MiB (49 %) | 3153.859 | 3378.383
 uuid_serial | 1402 MiB (47 %) | 3008.055 | 3378.383
 uuid_v1_ext | 1403 MiB (47 %) | 3008.055 | 3378.383

So, the question for me would be: Is there any kind of data that plays
optimal with space-savings in a rolling (e.g. last X rows) scenario?

> 
>> 5.) REINDEX
>> Table: uuid_v1  Time: 21549.860 ms (00:21.550)
>> Table: uuid_v1_timestampTime: 27367.817 ms (00:27.368)
>> Table: uuid_seq Time: 19142.711 ms (00:19.143)
>> Table: uuid_serial  Time: 16889.807 ms (00:16.890)
>>
>> Even in this case it looks as if my implementation is faster than
>> anything else - which I really don't get.
> 
> Sorting already-sorted data is faster. CREATE INDEX is mostly a big
> sort operation in the case of B-Tree indexes.

Understood, this seems to be confirmed by my time-sorted index in the
new tests:
uuid_v1: 27632.660 ms (00:27.633)
uuid_serial: 20519.363 ms (00:20.519) x1.35
uuid_v1_ext: 23846.474 ms (00:23.846) x1.16

> 
>> I might implement a different opclass for the standard UUID to enable
>> time-wise index sort order. This will naturally be very close to
>> physical order but I doubt that this is something I can tell PostgreSQL, or?
> 
> PostgreSQL only knows whether or not your page splits occur in the
> rightmost page in the index -- it fills the page differently according
> to whether or not that is the case.
> 

As I've implemented the new opclass and the new tests showing the
results now, I think I can say that the time-sorting behavior as opposed
to rather random really benefits the overall performance, which is what
I actually care about most.


Cheers,

Ancoron




Re: Custom opclass for column statistics?

2019-07-09 Thread Ancoron Luciferis
On 06/07/2019 15:58, Tom Lane wrote:
> Ancoron Luciferis  writes:
>> I've been wondering whether it is possible somehow to have the standard
>> column statistics to respect a certain operator class?
> 
> In principle, pg_statistic can represent stats for a non-default opclass.
> Teaching ANALYZE to collect such stats when appropriate, and then teaching
> the planner to use them when appropriate, is left as an exercise for the
> reader.

Hehe, now that you are saying it, I realize what I was actually asking
for with this... ;)

> 
> I think the "when appropriate" bit is actually the hardest part of that.
> Possibly, if you were satisfied with a relatively manual approach,
> you could proceed by using CREATE STATISTICS to declare interest in
> keeping standard stats for a non-default sort order.  Not sure what to
> do if you want it to be automatic, because I don't think people would
> hold still for having ANALYZE collect stats for any random non-default
> opclass automatically.  Maybe a new opclass property?

I totally agree with the complications around all that.

Now I think if I want better statistics and better plans for my new
time-sorted index, I will need a new data type for which I can set the
opclass as default, which also would provide users the guarantee that
they'll get what they expect.

Thanx and cheers,

Ancoron