Optimizing `WHERE x IN` query

2019-07-07 Thread Omar Roth
Hi!

I am attempting to replicate YouTube's subscription feed. Each user has a list 
of channel IDs (as text[]) that they are subscribed to. The `users` table 
looks like:

```
=# \d users
 Table "public.users"
  Column   |   Type   | Collation | Nullable | Default
---+--+---+--
+-
 email | text |   | not null |
 subscriptions | text[]   |   |  |
 feed_needs_update | boolean  |   |  |
Indexes:
"users_pkey" PRIMARY KEY, btree (email)
"email_unique_idx" UNIQUE, btree (lower(email))
"users_email_key" UNIQUE CONSTRAINT, btree (email)
```

For storing videos, I have a table `channel_videos` from which I want to 
select all videos where the video's `ucid` (channel ID) is in the user's 
subscriptions. `channel_videos` looks like:

```
=# \d channel_videos
 Table "public.channel_videos"
   Column   |   Type   | Collation | Nullable | 
Default
+--+---+--
+-
 id | text |   | not null |
 title  | text |   |  |
 published  | timestamp with time zone |   |  |
 updated| timestamp with time zone |   |  |
 ucid   | text |   |  |
 author | text |   |  |
 views  | bigint   |   |  |
Indexes:
"channel_videos_id_key" UNIQUE CONSTRAINT, btree (id)
"channel_videos_published_idx" btree (published)
"channel_videos_ucid_idx" btree (ucid)
```

In case it might help with indexing, a UCID always begins with `UC`, then 22 
random characters in [a-zA-Z0-9_-] (Base64-urlsafe).

Currently, the query I'm using to generate a user's feed is:

```
SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM 
users WHERE email = $1) ORDER BY published DESC;
```

This works great when `subscriptions` and `channel_videos` are both fairly 
small. Unfortunately, at this point `channel_videos` contains roughly 28M 
rows. Attempting to generate a feed for a user with 177 subscriptions takes 
around 40-70s, here's the relevant `EXPLAIN (ANALYZE, BUFFERS)`:

```
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM channel_videos WHERE ucid IN 
(SELECT unnest(subscriptions) FROM users WHERE email = 'omarroth') ORDER BY 
published DESC;
  QUERY 
PLAN
--
 Sort  (cost=478207.59..478506.73 rows=119656 width=142) (actual 
time=68599.562..68613.824 rows=46456 loops=1)
   Sort Key: channel_videos.published DESC
   Sort Method: external merge  Disk: 6352kB
   Buffers: shared hit=456 read=13454 dirtied=13 written=456, temp read=794 
written=797
   ->  Nested Loop  (cost=55.94..459526.49 rows=119656 width=142) (actual 
time=0.555..68531.550 rows=46456 loops=1)
 Buffers: shared hit=453 read=13454 dirtied=13 written=456
 ->  HashAggregate  (cost=10.18..11.18 rows=100 width=32) (actual 
time=0.417..0.850 rows=177 loops=1)
   Group Key: unnest(users.subscriptions)
   Buffers: shared hit=39 read=7
   ->  ProjectSet  (cost=0.41..8.93 rows=100 width=32) (actual 
time=0.305..0.363 rows=177 loops=1)
 Buffers: shared hit=39 read=7
 ->  Index Scan using users_email_key on users  
(cost=0.41..8.43 rows=1 width=127) (actual time=0.049..0.087 rows=1 loops=1)
   Index Cond: (email = 'omarroth'::text)
   Buffers: shared hit=5 read=4
 ->  Bitmap Heap Scan on channel_videos  (cost=45.76..4583.18 
rows=1197 width=142) (actual time=28.835..387.068 rows=262 loops=177)
   Recheck Cond: (ucid = (unnest(users.subscriptions)))
   Heap Blocks: exact=12808
   Buffers: shared hit=414 read=13447 dirtied=13 written=456
   ->  Bitmap Index Scan on channel_videos_ucid_idx  
(cost=0.00..45.46 rows=1197 width=0) (actual time=22.255..22.255 rows=263 
loops=177)
 Index Cond: (ucid = (unnest(users.subscriptions)))
 Buffers: shared hit=291 read=762 written=27
 Planning Time: 1.463 ms
 Execution Time: 68619.316 ms
(23 rows)

```

Since a feed is expected to be fairly consistent to what would appear on 
YouTube, `channel_videos` receives fairly frequent UPDATEs and INSERTs and is 
vacuumed fairly frequently:

```
=# SELECT * FROM pg_stat_user_tables WHERE relname = 'channel_videos';
 relid | schemaname |relname

Re: Optimizing `WHERE x IN` query

2019-07-07 Thread Thomas Kellerer

Omar Roth schrieb am 07.07.2019 um 15:43:

Currently, the query I'm using to generate a user's feed is:

```
SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
users WHERE email = $1) ORDER BY published DESC;
```


You could try an EXISTS query without unnest:

select cv.*
from channel_videos cv
where exists ucid (select *
   from users u
   where cv.ucid = any(u.subscriptions)
 and u.email = $1);

Did you try if a properly normalized model performs better?





Re: Custom opclass for column statistics?

2019-07-07 Thread Tom Lane
Ancoron Luciferis  writes:
> I've been wondering whether it is possible somehow to have the standard
> column statistics to respect a certain operator class?

In principle, pg_statistic can represent stats for a non-default opclass.
Teaching ANALYZE to collect such stats when appropriate, and then teaching
the planner to use them when appropriate, is left as an exercise for the
reader.

I think the "when appropriate" bit is actually the hardest part of that.
Possibly, if you were satisfied with a relatively manual approach,
you could proceed by using CREATE STATISTICS to declare interest in
keeping standard stats for a non-default sort order.  Not sure what to
do if you want it to be automatic, because I don't think people would
hold still for having ANALYZE collect stats for any random non-default
opclass automatically.  Maybe a new opclass property?

regards, tom lane




Re: UUID v1 optimizations...

2019-07-07 Thread Peter Geoghegan
Please don't top post -- trim the your response down so that only
still-relevant text remains.

On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis
 wrote:
> Primary key indexes after an ANALYZE:
> table_name | bloat  | index_mb | table_mb
> ---++--+--
>  uuid_v1   | 767 MiB (49 %) | 1571.039 | 1689.195
>  uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195
>  uuid_seq  | 759 MiB (49 %) | 1562.766 | 1689.195
>  uuid_serial   | 700 MiB (47 %) | 1504.047 | 1689.195
>
> OK, sadly no reclaim in any of them.

I don't know how you got these figures, but most likely they don't
take into account the fact that the FSM for the index has free blocks
available. You'll only notice that if you have additional page splits
that can recycle that space. Or, you could use pg_freespacemap to get
some idea.

> 5.) REINDEX
> Table: uuid_v1  Time: 21549.860 ms (00:21.550)
> Table: uuid_v1_timestampTime: 27367.817 ms (00:27.368)
> Table: uuid_seq Time: 19142.711 ms (00:19.143)
> Table: uuid_serial  Time: 16889.807 ms (00:16.890)
>
> Even in this case it looks as if my implementation is faster than
> anything else - which I really don't get.

Sorting already-sorted data is faster. CREATE INDEX is mostly a big
sort operation in the case of B-Tree indexes.

> I might implement a different opclass for the standard UUID to enable
> time-wise index sort order. This will naturally be very close to
> physical order but I doubt that this is something I can tell PostgreSQL, or?

PostgreSQL only knows whether or not your page splits occur in the
rightmost page in the index -- it fills the page differently according
to whether or not that is the case.

-- 
Peter Geoghegan