On Fri, Aug 10, 2012 at 01:00:52AM +0400, Sergei Petrunia wrote: > Subject: Re: Cost analysis: checking if join optimization costs match the > reality. > Hi! > > I've ran this experiment: I took the scalar subquery out of the query and then > tried to compare optimizer costs with execution times. > > == The query == > explain extended > select sql_calc_found_rows > s_name, s_address > from nation straight_join supplier > where s_suppkey in (select ps_suppkey from partsupp > where ps_partkey in (select p_partkey from part > where p_name like 'forest%') > ) > and s_nationkey = n_nationkey > and n_name = 'CANADA' > order by s_name > limit 10; > +--+-----------+--------+------++-------------+-------+-------------------+----+---------------------------------------------------------+ > |id|select_type|table |type ||key |key_len|ref > |rows|Extra | > +--+-----------+--------+------++-------------+-------+-------------------+----+---------------------------------------------------------+ > | 1|PRIMARY |nation |ref ||n_name |26 |const | > 1|Using where; Using index; Using temporary; Using filesort| > | 1|PRIMARY |supplier|ref ||i_s_nationkey|5 |nation.n_nationkey | > 207| | > | 1|PRIMARY |partsupp|ref ||i_ps_suppkey |4 |supplier.s_suppkey | > 50|Using index | > | 1|PRIMARY |part |eq_ref||PRIMARY |4 |partsupp.ps_partkey| > 1|Using where; FirstMatch(supplier) | > +--+-----------+--------+------++-------------+-------+-------------------+----+---------------------------------------------------------+ > > +--+------------+-----------+------++-------------+-------+------------------+----+---------------------------------------------------------+ > |id|select_type |table |type ||key |key_len|ref > |rows|Extra | > +--+------------+-----------+------++-------------+-------+------------------+----+---------------------------------------------------------+ > | 1|PRIMARY |nation |ref ||n_name |26 |const > | 1|Using where; Using index; Using temporary; Using filesort| > | 1|PRIMARY |supplier |ref ||i_s_nationkey|5 > |nation.n_nationkey| 207| > | > | 1|PRIMARY |<subquery2>|eq_ref||distinct_key |4 |func > | 1| | > | 2|MATERIALIZED|part |range ||p_name |58 |NULL > |2387|Using where; Using index | > | 2|MATERIALIZED|partsupp |ref ||i_ps_partkey |4 |part.p_partkey > | 2|Using index | > +--+------------+-----------+------++-------------+-------+------------------+----+---------------------------------------------------------+ > > > === Execution times === > FirstMatch: 0.20 - 0.30 sec. > Materialization: 0.20 sec - 0.30 sec > > run times for both fluctuate between 0.20 and 0.30 sec, FirstMatch seems to be > somewhat slower (dunno if the difference is statistically meaningful) > > == Optimizer costs === > FirstMatch: 11029.4 > Materialization: 3594.5
Besides that, we know: Cost of scalar subquery execution (in choose_plan()): (gdb) p join->best_read $144 = 4.5989999999999993 = 4.6 === Scalar subquery costs for FirstMatch === Number of evaluations includes the selectivity of partsupp that's not show in EXPLAIN: fm_n_evaluations= 1 * 207 * 50 * 1 * 0.0118 = 122 fm_total_subq_cost = 4.6 * 122 = 562 fm_adjusted_total_cost= 11029.4 + 562= 11591.4 === Scalar subquery costs for Materialization === sjm_n_evaluations = 2387*1 = 2387 sjm_total_subq_cost = 2387 * 4.6 = 10980. sjm_adjusted_total_cost= 3594.5 + 10980= 14574.5 That is, sjm_adjusted_total_cost > fm_total_subq_cost which means it should choose FirstMatch! == Difference is not very satisfying, though == The result is not fully satisfying though, because cost numbers are close, while we have this data on execution: <spetrunia> Mine: <spetrunia> Materialization: 5.30 sec (3x) <spetrunia> First-Match, with 'fixed' scalar predicate: 1.72 sec (1x) <spetrunia> FirstMatch, original: ~15 sec (8x) <spetrunia> Timour: <spetrunia> FirstMatch, mdev-83: 1x <spetrunia> Materialization: 5x <spetrunia> FirstMatch, original: 8.3x that is "fixed FirstMatch" is 3x-5x faster than Materialization, while the costs differ by a factor of 14574.5/11591.4 = 1.25 On the other hand, see email "Cost analysis: optimizer statistics vs real dataset properties": - supplier.rows =207 ,while in reality it's 412, precise stats gives 400. - for Materialization plan, partsupp.rows=2, precise stats gives 4. - for FirstMatch partsupp.rows=50, while precise stats (and data we will hit) shows it should be 80. Last but not the least: that email shows that the scalar-context subquery will scan 3 rows. select (select count(*) from lineitem) / (select count(distinct l_partkey, l_suppkey) from lineitem); gives 7.5, which means subquery is 2x more expensive than the optimizer thinks it is. === Conclusion === - Precise stats will cause different cost numbers. I'm not sure what they will be, but there is some hope FirstMatch may still win. BR Sergei -- Sergei Petrunia, Software Developer Monty Program AB, http://askmonty.org Blog: http://s.petrunia.net/blog _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : [email protected] Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp

