from:"Tim Kelly"

unoptimized nested loops

2022-05-31 Thread Tim Kelly


Hello Everyone,
I am stumped as to what I am doing wrong.  I have two tables

metadata: parent table, 1.28m records
data: child table, 1.24m records

metadata contains descriptions of the records in data.  data has two 
fields of concern, the id field, which is a foreign key to an identical 
field in metadata, and the content field, which contains text ranging 
from a few hundred to a few thousand characters.  The id fields are 
alphanumeric for historical reasons.  Table descriptions below, some 
fields omitted for brevity:



c_db=> \d metadata 


  Table "public.metadata"
  Column  |   Type   | Modifiers
--+--+---
 id   | character varying(60)| not null
 author   | character varying(90)|

Indexes:
"metadata_pkey" PRIMARY KEY, btree (id)
Referenced by:
TABLE "data" CONSTRAINT "fk_metadata" FOREIGN KEY (id) REFERENCES 
metadata(id)



c_db=> \d data
Table "public.data"
 Column  |  Type   | Modifiers
-+-+---
 id  | character varying(30)   | not null
 content | text|

Indexes:
"data_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"fk_metadata" FOREIGN KEY (id) REFERENCES metadata(id)




I am regularly needing to search the content field in data for short 
text segments.  I can reduce the number of rows needing to be searched 
by looking for particular types of entries in metadata.  However, 
postgresql is apparently refusing to do so, and insists on searching the 
entire content column in the data table for the text segment of 
interest.  It takes more time to search the effort to reduce than the 
entire data table straight up.


To be specific with two approaches:

c_db=> select count(id) from data;
  count
-
 1248954
(1 row)

Time: 648.358 ms


ic_db=> select count(id) from data where content like '%some text%';
  count
-
 1167810
(1 row)

Time: 180144.251 ms


c_db=>select count(id) from metadata where author like '%Kelly%';

 count
---
  3558
(1 row)

Time: 1625.455 ms


c_db=>select count(id) from data where data.content like '%some text%' 
and data.id in (select id from metadata where metadata.author like 
'%Kelly%');


 count
---
   152
(1 row)

Time: 211010.598 ms


c_db=> explain analyze select count(id) from data where data.content 
like '%some text%' and data.id in (select id from metadata where 
metadata.author like '%Kelly%');


QUERY 
PLAN

---
 Aggregate  (cost=48203.30..48203.31 rows=1 width=124) (actual 
time=213021.968..213021.969 rows=1 loops=1)
   ->  Nested Loop Semi Join  (cost=0.00..48202.99 rows=125 width=124) 
(actual time=51392.697..213021.848 rows=152 loops=1)
 ->  Seq Scan on data  (cost=0.00..47132.93 rows=125 width=135) 
(actual time=0.176..183040.366 rows=1167810 loops=1)

   Filter: (data ~~ '%some text%'::text)
 ->  Index Scan using metadata_pkey on metadata 
(cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0 
loops=1167810)

   Index Cond: ((metadata.id)::text = (data.id)::text)
   Filter: ((metadata.author)::text ~~ '%Kelly%'::text)
 Total runtime: 213022.028 ms
(8 rows)

Time: 213024.273 ms


Alternatively, using an inner join,


c_db=> select count(id) from data inner join metadata on data.id = 
metadata.id where data.content like '%some text%' and metadata.author 
like '%Kelly%';


 count
---
   152
(1 row)

Time: 212211.047 ms

c_db=> explain analyze select count(id) from data inner join metadata on 
data.id = metadata.id where data.context like '%some text%' and 
metadata.author like '%Kelly%';
QUERY 
PLAN

---
 Aggregate  (cost=48203.30..48203.31 rows=1 width=124) (actual 
time=212800.026..212800.026 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..48202.99 rows=125 width=124) (actual 
time=22463.361..212799.911 rows=152 loops=1)
 ->  Seq Scan on data  (cost=0.00..47132.93 rows=125 width=135) 
(actual time=0.542..182952.708 rows=1167810 loops=1)

   Filter: (data ~~ '%some text%'::text)
 ->  Index Scan using metadata_pkey on metadata 
(cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0 
loops=1167810)

   Index Cond: ((metadata.id)::text = (data.id)::text)
   Filter: ((metadata.author)::text ~~ '%Kelly%'::text)
 Total runtime: 212800.076 ms
(8 rows)

Time: 212805.008 ms



I do not see evidence that the nested loop is trying to reduce overhead 
by using the smaller set.  It seems to want to scan on data

Re: unoptimized nested loops

2022-06-03 Thread Tim Kelly

Thank you to David, Jeff and Tom for your responses.  Tom's response has 
made me rethink my question.  I may have provided too much information, 
in the effort to anticipate suggestions.  Let me rephrase:


I have two tables, a parent (named "metadata") and a child (named 
"data").  Each table has two fields.  One field is shared (the "id," an 
alphanumeric field).  The other field in each table is basically the 
constraint (named "author" in metadata) and the target (named "content" 
data).  Each table has about 1.25m records/rows.  There are about 1500 
orphaned child records, but all parent records have a child record.


When I do a search, as the "human in charge (HIC)" I know that the 
constraint from the parent table will yield a very small amount of child 
table records in which the target needs to be found.  A typical search 
yields a few hundred to a few thousand parent records, which should take 
milliseconds to search for the target.  A search of all of the child 
records for the target that is then compared against the constrained 
parent records to produce a pure intersection is very inefficient.


I found discussions from ten years or more ago about where the order of 
query arguments would affect the search optimization.  Depending on the 
database, the HIC would place a known driving constraint either last or 
first in the arguments.  This practice has been reasonably abandoned as 
the default practice, but I have been unable to find how to override the 
current practice of letting the planner do the work, and the planner is 
getting it wrong.  Something that should only take one or two seconds is 
now taking three to four minutes.


So, using an even more egregious scenario I found:

c_db=>select count(id) from metadata where author like '%Tim%';

 count
---
   261
(1 row)

Time: 650.753 ms


c_db=>select count(id) from data where data.content like '%some text%' 
and data.id in (select id from metadata where metadata.author like '%Tim%');


 count
---
31
(1 row)

Time: 207354.109 ms


Just as a reminder, this is 30 seconds longer than it takes to search 
the 1.25m records for the target:


c_db=> select count(id) from data where content like '%some text%';
  count
-
 1167810
(1 row)

Time: 180144.251 ms


To address David's suggestion of turning off enable_nestloop, this 
resulted in about a 10% improvement in speed.  I found no appreciable 
difference in time by setting statistics, although there was some 
ambiguity in the suggestion.  I assume the statistics are supposed to be 
set on the content column,


ALTER TABLE data ALTER COLUMN content SET STATISTICS 1000;


To address Jeff's inquiry about planning on the smaller set:

c_db=> explain analyze select count(id) from metadata where author like 
'%Tim%';
QUERY PLAN 


--
 Aggregate  (cost=114040.65..114040.66 rows=1 width=11) (actual 
time=681.639..681.639 rows=1 loops=1)
   ->  Seq Scan on metadata  (cost=0.00..114040.64 rows=5 width=11) 
(actual time=3.053..681.591 rows=261 loops=1)

 Filter: ((author)::text ~~ '%Tim%'::text)
 Total runtime: 681.685 ms
(4 rows)

Time: 682.142 ms

For completeness

c_db=> explain analyze select count(id) from metadata inner join data on 
metadata.id = data.id where author like '%Tim%' and content like '%some 
text%';
QUERY 
PLAN

---
 Aggregate  (cost=48203.00..48203.01 rows=1 width=11) (actual 
time=208239.776..208239.777 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..48202.99 rows=5 width=11) (actual 
time=34102.795..208239.754 rows=31 loops=1)
 ->  Seq Scan on data  (cost=0.00..47132.93 rows=125 width=11) 
(actual time=4.714..179369.126 rows=1167810 loops=1)

   Filter: (content ~~ '%some text%'::text)
 ->  Index Scan using metadata_pkey on metadata 
(cost=0.00..8.55 rows=1 width=11) (actual time=0.024..0.024 rows=0 
loops=1167810)

   Index Cond: ((metadata.id)::text = (data.id)::text)
   Filter: ((metadata.author)::text ~~ '%Tim%'::text)
 Total runtime: 208239.847 ms
(8 rows)

Time: 208247.698 ms


As for the version

$ psql --version
psql (PostgreSQL) 8.4.1
contains support for command-line editing


Let's describe this system as "legacy" and updating is not an option. 
If the planner from this version was not optimized compared to more 
recent versions, the need for an override is even greater.  However, I 
am very reluctant to believe the version is at the heart of the problem. 
 I believe I am missing something and perhaps failing to properly 
explain my need.


How do I override the planner and instruct the computer to do what I 
say, regardless of the

Re: unoptimized nested loops

2022-06-04 Thread Tim Kelly

I resolved the problem by eliminating the planner from the decision 
making altogether, through using a (permanently) temporary table 
populated by the subset of data records from an initial query generated 
from the metadata, then searching only in the temp table for the actual 
data records I want.  I ran this code into a function sxa ("search by 
author").  In my particular case I am looking for the same snippet of 
text over and over again so I hardwired it into the function.  The 
function uses 'like' instead of '=' on the author so I am actually 
comparing an exact match using the inner join to a pattern match in the 
function.


c_db=> select count(id) from content;
  count
-
 1248954
(1 row)

Time: 587.325 ms

c_db=> select count(id) from data inner join metadata on data.id = 
metadata.id where author = 'Powers';

 count
---
   347
(1 row)

Time: 519.435 ms

c_db=> select count(id) from data inner join metadata on data.id = 
metadata.id where author = 'Powers' and content like '%some text%';

 count
---
14
(1 row)

Time: 209895.655 ms

c_db=> select count(id) from sxa('Powers');
 count
---
14
(1 row)

Time: 1794.600 ms

The above function run time includes deleting the previous search 
results and creating 347 records in the temp table.  I also find that it 
appears to sometimes run even faster depending on the search (due to 
variation in content length) and can be even less than half the above time:


c_db=> select count(id) from sxa('Zelazny');
 count
---
13
(1 row)

Time: 790.551 ms

The inner join time run time variance of searches is greater than the 
total time for any search with the function.  Total run time for any 
inner join search was always 209 to 211 seconds, as all of the content 
values appear to be searched.  Use of a temp table reduced search time 
of 1.25m records from 3 1/2 minutes to less than two seconds and in some 
cases to less than one.


In summary, the answer to how to overcome a bad decision by the 
postgresql planner appears to be move the subset of data into a 
temporary table and force postgresql to look in it instead.  The version 
of postgresql I am using is 8.4.1, admittedly old.  If there is a newer 
version of postgresql that has fixed this, please point me to it and I 
will see if I can upgrade to it, instead of crafting functions.  Also, 
if someone could refresh my memory on how relational databases are 
supposed to work by default, I would appreciate it.  I seem to have 
gotten confused somewhere.


tim



c_db=> \d t
  Table "public.t"
 Column  | Type  | Modifiers
-+---+---
 id  | character varying(30) | not null
 content | text  |
Indexes:
"t_pkey" PRIMARY KEY, btree (id)

CREATE OR REPLACE FUNCTION public.sxa(author character varying)
 RETURNS TABLE(id character varying)
 LANGUAGE sql
AS $function$
delete from t;
insert into t (select id, content from metadata inner join data on 
metadata.id = data.id where author like $1);
select id from data where id in (select id from t where content like 
'%some text%');

$function$
;

unoptimized nested loops

Re: unoptimized nested loops

Re: unoptimized nested loops

3 matches

Site Navigation

Mail list logo

Footer information