Re: Index for range queries on JSON (user defined fields)

2020-12-04 Thread Nick Cleaton
On Fri, 4 Dec 2020 at 15:39, Marco Colli  wrote:

> Hello!
>
> We have a multi-tenant service where each customer has millions of users
> (total: ~150M rows). Now we would like to let each customer define some
> custom columns for his users and then let the customer search his users
> efficiently based on these columns.
>
> This problem seems really hard to solve with PostgreSQL:
>
> https://stackoverflow.com/questions/5106335/how-to-design-a-database-for-user-defined-fields
>
> In particular the easiest way would be to add a JSON field on the users
> table (e.g. user metadata). However the PostgreSQL GIN index only supports
> exact matches and not range queries. This means that a query on a range
> (e.g. age > 30) would be extremely inefficient and would result in a table
> scan.
>

You could have a table of (tenant, customer, setting_name, setting_value)
so that a btree index on (tenant, setting_name, setting_value) would work
for "select customer from my_table where tenant=$1 and setting_name='age'
and setting_value > 30"

That doesn't deal with setting values having a variety of types, but you
could have a distinct user defined settings table for each setting value
type that you want to support.


Re: Terribly slow query with very good plan?

2022-02-04 Thread Nick Cleaton
On Fri, 4 Feb 2022 at 09:11, Les  wrote:

|
> ->  Index Only Scan using oo_file_idx_relpath on media.oo_file f2  
> (cost=0.55..108499.27 rows=5381 width=0) (actual time=564.756..564.756 rows=0 
> loops=45)|
>   Filter: (f2.relpath ~~ (f.relpath || '%'::text))
>   
>   |
>   Rows Removed by Filter: 792025  
>   
>   |
>   Heap Fetches: 768960
>   
>   |
>   Buffers: shared hit=7014130 
>   
>   |
> Planning Time: 0.361 ms
> Execution Time: 25415.088 ms


> ->  Seq Scan on media.oo_file of2  (cost=0.00..144714.70 rows=86960 
> width=0) (actual time=0.044..0.044 rows=1 loops=1)|
>   Filter: (of2.relpath ~~ 'Felhasználók%'::text)  
> |
>   Rows Removed by Filter: 15  
> |
>   Buffers: shared hit=2   
> |
> Planning Time: 0.290 ms   
> |
> Execution Time: 0.076 ms  
> |
>
> In other words, I could write a pl/sql function with a nested loop instead of 
> the problematic query, and it will be 1000 times faster.
>
> What am I missing?

In the fast case the 'Felhasználók%' part is known at query planning
time, so it can be a prefix search.

In the slow case, the planner doesn't know what that value will be, it
could be something that starts with '%' for example.

Also your logic looks a bit unsafe, the query you have would include
files under all top-level folders with names starting with
Felhasználók, so you could accidentally merge in files in folders
called Felhasználókfoo and Felhasználókbar for example.




Re: Terribly slow query with very good plan?

2022-02-04 Thread Nick Cleaton
On Fri, 4 Feb 2022 at 10:09, Les  wrote:
>
> Oh I see, the query planner does not know that there will be no % characters 
> in file and folder names.
>
> But what is the solution then? It just seems wrong that I can speed up a 
> query 1000 times by replacing it with a nested loop in a pl/sql function :(

You don't need a nested loop, doing it in two stages in pl/pgsql would
be enough I think, first get the folder name and then construct a new
query using it as a constant.

I'd use SELECT FOR SHARE when getting the folder name, so that no
other process can change it underneath you before you run your second
query.

With the ^@ operator, my guess is that because the planner knows
nothing about the folder name value it could be the empty string,
which would be a prefix of everything.




Re: Terribly slow query with very good plan?

2022-02-04 Thread Nick Cleaton
On Fri, 4 Feb 2022 at 12:27, Les  wrote:

> PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan for 
> A, I, C letters (with the "like" query).

That's interesting.

Does it help if you create an additional index on relpath with the
text_pattern_ops modifier, e.g.

CREATE INDEX ... USING btree (relpath text_pattern_ops);




Re: Terribly slow query with very good plan?

2022-02-04 Thread Nick Cleaton
On Fri, 4 Feb 2022 at 13:07, Les  wrote:
>
>>
>> > PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan 
>> > for A, I, C letters (with the "like" query).
>>
>> That's interesting.
>>
>> Does it help if you create an additional index on relpath with the
>> text_pattern_ops modifier, e.g.
>>
>> CREATE INDEX ... USING btree (relpath text_pattern_ops);
>
>
> It does not help.

What if you try applying the C collation to the values from the table:

where fi.is_active  and fi.relpath collate "C" ^@ 'A'




Re: Terribly slow query with very good plan?

2022-02-04 Thread Nick Cleaton
On Fri, 4 Feb 2022 at 13:21, Les  wrote:
>
>> What if you try applying the C collation to the values from the table:
>>
>> where fi.is_active  and fi.relpath collate "C" ^@ 'A'
>
>
> Slow

What about this:

fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")




Re: Terribly slow query with very good plan?

2022-02-04 Thread Nick Cleaton
On Fri, 4 Feb 2022 at 14:07, Les  wrote:
>
>
>
>> > Slow
>>
>> What about this:
>>
>> fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")
>
>
> It uses index scan.

> Although the same with 'Természettudomány' uses seq scan:
>
>
> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
> select fi.id from media.oo_file fi
> where fi.is_active
> and fi.relpath between
> ('Természettudomány' collate "C")
> and ('Természettudomány'||chr(255) collate "C")
> limit 1;
>
> QUERY PLAN
>  |
> ---+
> Limit  (cost=0.00..2.13 rows=1 width=8) (actual time=7521.531..7521.532 
> rows=0 loops=1)|
>   Output: id  
>  |
>   Buffers: shared hit=17018 read=150574   
>  |
>   ->  Seq Scan on media.oo_file fi  (cost=0.00..188195.39 rows=88290 width=8) 
> (actual time=7521.528..7521.529 rows=0 loops=1)  |
> Output: id
>  |
> Filter: (fi.is_active AND (fi.relpath >= 'Természettudomány'::text 
> COLLATE "C") AND (fi.relpath <= 'Természettudomány '::text COLLATE "C"))|
> Rows Removed by Filter: 1075812   
>  |
> Buffers: shared hit=17018 read=150574 
>  |
> Planning Time: 8.918 ms   
>  |
> Execution Time: 7521.560 ms

That may be because it's expecting to get 88290 rows from the
sequential scan, and the"limit 1" means it expects sequential scan to
be fast because on average it will only need to scan 1/88290 of the
table before it finds a matching row, then it can stop.

Try it without the "limit 1"