Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNER Query )

Virendra Kumar Thu, 07 May 2020 12:31:26 -0700

Here is my thought on why row is not limiting when joined vs why it is limiting 
when not joined.
When not joined and where clause is having IN, it is using index 
process_activity_process_instance_id_app_id_created_idx which has columns 
process_instance_id, created which is in order by and hence no additional 
ordering is required and a direct rows limit can be applied here.


When in join condition it has to fetch rows according to filter clause, join 
them and then order ( sort node in plan) hence it cannot limit rows while 
fetching it first time from the table.
You are also missing pi.user_id = '317079413683604' in exists clause. It is 
worth trying to put there and run explain again and see where it takes. But to 
your point row limitation cannot happen in case of join as such in the query.

Regards,
Virendra 

    On Thursday, May 7, 2020, 11:52:00 AM PDT, Amarendra Konda 
<[email protected]> wrote:  
 
 Hi Virendra,
Thanks for your time. 
Here is the table and index structure
 process_activity
                            Table "public.process_activity"
       Column       |            Type             |         Modifiers          
--------------------+-----------------------------+----------------------------
 process_activity_id         | bigint                      | not null default 
next_id()
 process_activity_type       | smallint                    | not null
 voice_url          | text                        | 
 process_activity_user_id    | bigint                      | not null
 app_id             | bigint                      | not null
 process_instance_id    | bigint                      | not null
 alias              | text                        | not null
 read_by_user       | smallint                    | default 0
 source             | smallint                    | default 0
 label_category_id  | bigint                      | 
 label_id           | bigint                      | 
 csat_response_id   | bigint                      | 
 process_activity_fragments  | jsonb                       | 
 created            | timestamp without time zone | not null
 updated            | timestamp without time zone | 
 rule_id            | bigint                      | 
 marketing_reply_id | bigint                      | 
 delivered_at       | timestamp without time zone | 
 reply_fragments    | jsonb                       | 
 status_fragment    | jsonb                       | 
 internal_meta      | jsonb                       | 
 interaction_id     | text                        | 
 do_not_translate   | boolean                     | 
 should_translate   | integer                     | 
 in_reply_to        | jsonb                       | 
Indexes:
    "process_activity_pkey" PRIMARY KEY, btree (process_activity_id)
    "fki_process_activity_konotor_user_user_id" btree 
(process_activity_user_id) WITH (fillfactor='70')
    "process_activity_process_instance_id_app_id_created_idx" btree 
(process_instance_id, app_id, created) WITH (fillfactor='70')
    "process_activity_process_instance_id_app_id_read_by_user_created_idx" 
btree (process_instance_id, app_id, read_by_user, created) WITH 
(fillfactor='70')
    "process_activity_process_instance_id_idx" btree (process_instance_id) WITH 
(fillfactor='70')
 



process_instance
                             Table "public.process_instance"
         Column          |            Type             |          Modifiers     
     
-------------------------+-----------------------------+-----------------------------
 process_instance_id     | bigint                      | not null default 
next_id()
 process_instance_alias  | text                        | not null
 app_id                  | bigint                      | not null
 user_id                 | bigint                      | not null
 
Indexes:
    "process_instance_pkey" PRIMARY KEY, btree (process_instance_id)
    "fki_conv_konotor_user_user_id" btree (user_id) WITH (fillfactor='70')

Regards, Amarendra
On Fri, May 8, 2020 at 12:01 AM Virendra Kumar <[email protected]> wrote:

Sending table structure with indexes might help little further in understanding.

Regards,
Virendra
    On Thursday, May 7, 2020, 11:08:14 AM PDT, Amarendra Konda 
<[email protected]> wrote:  
 
 Hi David,
In earlier reply, Over looked another condition, hence please ignore that one
Here is the correct one with all the needed conditions. According to the latest 
one, exists also not limiting rows from the process_activity table.

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)  SELECT pa.process_activity_id AS 
pa_process_activity_id  FROM process_activity pa WHERE pa.app_id = 
'126502930200650' AND pa.created > '1970-01-01 00:00:00'  AND EXISTS ( SELECT 1 
FROM process_instance pi where pi.app_id = pa.app_id AND pi.process_instance_id 
= pa.process_instance_id  AND pi.user_id = '137074931866340') ORDER BY 
pa.process_instance_id,  pa.created limit 50;
                                                                                
                                                                                
                                          
      QUERY PLAN                                                                
                                                                                
                                          
                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
 Limit  (cost=1079.44..1079.52 rows=32 width=24) (actual time=85.747..85.777 
rows=50 loops=1)
   Output: pa.process_activity_id, pa.process_instance_id, pa.created
   Buffers: shared hit=43070
   ->  Sort  (cost=1079.44..1079.52 rows=32 width=24) (actual 
time=85.745..85.759 rows=50 loops=1)
         Output: pa.process_activity_id, pa.process_instance_id, pa.created
         Sort Key: pa.process_instance_id, pa.created
         Sort Method: top-N heapsort  Memory: 28kB
         Buffers: shared hit=43070
         ->  Nested Loop  (cost=1.14..1078.64 rows=32 width=24) (actual 
time=0.025..72.115 rows=47011 loops=1)
               Output: pa.process_activity_id, pa.process_instance_id, 
pa.created
               Buffers: shared hit=43070
               ->  Index Scan using fki_conv_konotor_user_user_id on 
public.process_instance pi  (cost=0.43..2.66 rows=1 width=16) (actual 
time=0.010..0.015 rows=2 loops=1)
                     Output: pi.app_id, pi.process_instance_id
                     Index Cond: (c.user_id = '137074931866340'::bigint)
                     Filter: (c.app_id = '126502930200650'::bigint)
                     Buffers: shared hit=5
               ->  Index Scan using 
process_activity_process_instance_id_app_id_created_idx on 
public.process_activity pa  (cost=0.70..1061.62 rows=1436 width=32) (actual 
time=0.011..20.320 rows=23506 loops=2)
                     Output: pa.process_activity_id, pa.process_activity_type, 
pa.voice_url, pa.process_activity_user_id, pa.app_id, pa.process_instance_id, 
pa.alias, pa.read_by_user, pa.source, pa.label_category_id, pa.label_id, 
pa.csat_respons
e_id, pa.process_activity_fragments, pa.created, pa.updated, pa.rule_id, 
pa.marketing_reply_id, pa.delivered_at, pa.reply_fragments, pa.status_fragment, 
pa.internal_meta, pa.interaction_id, pa.do_not_translate, pa.should_tr
anslate, pa.in_reply_to
                     Index Cond: ((m.process_instance_id = 
pi.process_instance_id) AND (m.app_id = '126502930200650'::bigint) AND 
(m.created > '1970-01-01 00:00:00'::timestamp without time zone))
                     Buffers: shared hit=43065
 Planning time: 0.455 ms
 Execution time: 85.830 ms

On Thu, May 7, 2020 at 11:19 PM Amarendra Konda <[email protected]> wrote:

Hi David,
Thanks for the reply.This has optimized number of rows. 
Can you please explain, why it is getting more columns in output, even though 
we have asked for only one column ? 

 EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)  SELECT pa.process_activity_id AS 
pa_process_activity_id  FROM process_activity pa WHERE pa.app_id = 
'126502930200650' AND pa.created > '1970-01-01 00:00:00'  AND EXISTS ( SELECT 1 
FROM process_instance pi where pi.app_id = pa.app_id  AND pi.user_id = 
'137074931866340') ORDER BY pa.process_instance_id,m.created limit 50;
                                                                                
                                                                                
                                          
   QUERY PLAN                                                                   
                                                                                
                                          
                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------
 Limit  (cost=1.14..37.39 rows=50 width=24) (actual time=821.283..891.629 
rows=50 loops=1)
   Output: pa.process_activity_id, pa.process_instance_id, pa.created
   Buffers: shared hit=274950
   ->  Nested Loop Semi Join  (cost=1.14..266660108.78 rows=367790473 width=24) 
(actual time=821.282..891.607 rows=50 loops=1)
         Output: pa.process_activity_id, pa.process_instance_id, pa.created
         Buffers: shared hit=274950
         ->  Index Scan using 
process_activity_process_instance_id_app_id_created_idx on 
public.process_activity pa  (cost=0.70..262062725.21 rows=367790473 width=32) 
(actual time=821.253..891.517 rows=50 loops=1)
               Output: pa.process_activity_id, pa.process_activity_type, 
pa.voice_url, pa.process_activity_user_id, pa.app_id, pa.process_instance_id, 
pa.alias, pa.read_by_user, pa.source, pa.label_category_id, pa.label_id, 
pa.csat_response_id, 
m.process_activity_fragments, pa.created, pa.updated, pa.rule_id, 
pa.marketing_reply_id, pa.delivered_at, pa.reply_fragments, pa.status_fragment, 
pa.internal_meta, pa.interaction_id, pa.do_not_translate, pa.should_translat
e, pa.in_reply_to
               Index Cond: ((m.app_id = '126502930200650'::bigint) AND 
(m.created > '1970-01-01 00:00:00'::timestamp without time zone))
               Buffers: shared hit=274946
         ->  Materialize  (cost=0.43..2.66 rows=1 width=8) (actual 
time=0.001..0.001 rows=1 loops=50)
               Output: pi.app_id
               Buffers: shared hit=4
               ->  Index Scan using fki_conv_konotor_user_user_id on 
public.process_instance pi  (cost=0.43..2.66 rows=1 width=8) (actual 
time=0.020..0.020 rows=1 loops=1)
                     Output: pi.app_id
                     Index Cond: (pi.user_id = '137074931866340'::bigint)
                     Filter: (pi.app_id = '126502930200650'::bigint)
                     Buffers: shared hit=4
 Planning time: 0.297 ms
 Execution time: 891.686 ms
(20 rows)

On Thu, May 7, 2020 at 9:17 PM David G. Johnston <[email protected]> 
wrote:

On Thu, May 7, 2020 at 7:40 AM Adrian Klaver <[email protected]> wrote:

On 5/7/20 4:19 AM, Amarendra Konda wrote:
> Hi,
> 
> PostgreSQL version : PostgreSQL 9.6.2 on x86_64-pc-linux-gnu, compiled 
> by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
> 
> We have noticed huge difference interms of execution plan ( response 
> time) , When we pass the direct values  Vs  inner query to IN clause.
> 
> High level details of the use case are as follows
> 
>   * As part of the SQL there are 2 tables named Process_instance
>     (master) and Process_activity ( child)
>   * Wanted to fetch TOP 50 rows from  Process_activity table for the
>     given values of the Process_instance.
>   * When we used Inner Join / Inner query ( query1)  between parent
>     table and child table , LIMIT is not really taking in to account.
>     Instead it is fetching more rows and columns that required, and
>     finally limiting the result

It is doing what you told it to do which is SELECT all 
process_instance_i's for user_id='317079413683604' and app_id = 
'427380312000560' and then filtering further. I am going to guess that 
if you run the inner query alone you will find it returns ~23496 rows.
You might have better results if you an actual join between 
process_activity and process_instance. Something like below(obviously 
not tested):


What the OP seems to want is a semi-join:
(not tested)
SELECT pa.process_activity_id  FROM process_activity pa WHERE pa.app_id = 
'427380312000560' AND pa.created > '1970-01-01 00:00:00'AND EXISTS (  SELECT 1 
FROM process_instance pi WHERE pi.app_id = pa.app_id AND pi.user_id = 
'317079413683604')
ORDER BY pa.process_instance_id,pa.created limit 50;
I'm unsure exactly how this will impact the plan choice but it should be an 
improvement, and in any case more correctly defines what it is you are looking 
for.
David J.

Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNER Query )

Reply via email to