Slow query and wrong row estimates for CTE

2021-02-15 Thread Dane Foster
Hello,

I'm seeking guidance in how to improve the performance of a slow query and
to have some other sets of eyes confirm that what I wrote does what I
intend.

According to the PostgreSQL wiki there is a set of metadata that I should
provide to help you help me. So let's begin there.

PostgreSQL version: PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc
(GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

Full table and index schema:

CREATE TABLE attempt_scores (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  attempt_report_id bigint NOT NULL,
  score_value double precision NOT NULL,
  created_at timestamptz NOT NULL DEFAULT now(),
  attempt_report_updated_at timestamptz NOT NULL,
  student_id int NOT NULL,
  course_id int NOT NULL,
  assignment_id int NOT NULL,
  score_name citext NOT NULL CHECK (length(trim(score_name)) > 0),
  attempted_by citext NOT NULL CHECK (length(trim(attempted_by)) > 0),
  CONSTRAINT for_upsert UNIQUE (attempt_report_id, score_name)
);
CREATE INDEX ON attempt_scores (attempt_report_updated_at);
COMMENT ON TABLE attempt_scores IS
  $$The collection of assignment scores extracted from the LMS database.$$;
COMMENT ON COLUMN attempt_scores.attempt_report_id IS
  $$Each assignment attempt has an associated attempt report (attempt_reports)
where the scores of the attempt is recorded. This column is the pk value from
that table.$$;
COMMENT ON COLUMN attempt_scores.score_value IS $$The score's value.$$;
COMMENT ON COLUMN attempt_scores.created_at IS $$The timestamp the
record was created.$$;
COMMENT ON COLUMN attempt_scores.student_id IS $$The student's ID.$$;
COMMENT ON COLUMN attempt_scores.course_id IS $$The course's primary
key in the LMS database.$$;
COMMENT ON COLUMN attempt_scores.assignment_id IS $$The assignment's
primary key in the LMS database.$$;
COMMENT ON COLUMN attempt_scores.score_name IS $$The source/name of
the score.$$;
COMMENT ON COLUMN attempt_scores.attempted_by IS 'The users.role column in LMS';
COMMENT ON COLUMN attempt_scores.attempt_report_updated_at IS
 $$The timestamp value of attempt_reports.updated_at on the LMS side. We use it
to find new rows added since the last time we exported to Salesforce.$$;


Table metadata:
relname | relpages | reltuples | relallvisible | relkind | relnatts |
relhassubclass | reloptions | pg_table_size
+--+---+---+-+--+++---
 attempt_scores |   130235 |   9352640 | 0 | r   |   10
| f  | [NULL] |1067180032

Other context: The PostgreSQL database is an Amazon RDS instance.

Next up is the query and a description of what it's supposed to do.

-- What this query is supposed to do is to compute averages for a set of
scoring/learning metrics but it's not so
-- straight forward. There is an umbrella metric that summarises the others
called the student performance index (SPI)
-- and the folks who want this info want the averages to be driven by the
SPI. So the basic algorithm is that for each
-- student/assignment pair, find the assignment that has the highest SPI
then use that to collect and average the
-- component metrics.
EXPLAIN (ANALYZE, BUFFERS)
WITH max_spi AS (
  SELECT student_id, assignment_id, max(score_value) spi
  FROM attempt_scores
  WHERE score_name = 'student_performance_index'
  GROUP BY student_id, assignment_id
  HAVING max(score_value) > 0
), reports AS (
  SELECT max(attempt_report_id) attempt_report_id, max(score_value) spi
  FROM max_spi m NATURAL JOIN attempt_scores
  WHERE score_value = m.spi
  GROUP BY student_id, assignment_id
)
SELECT
  avg(spi) spi,
  avg(CASE score_name WHEN 'digital_clinical_experience' THEN score_value
END) dce,
  avg(CASE score_name WHEN 'tier1_subjective_data_collection' THEN
score_value END) sdc
FROM reports NATURAL JOIN attempt_scores;

Finally, the EXPLAIN output and some links.
QUERY PLAN

-
 Aggregate  (cost=672426.02..672426.03 rows=1 width=24) (actual
time=903359.923..903368.957 rows=1 loops=1)
   Buffers: shared hit=6167172 read=4199539, temp read=99551 written=99678
   I/O Timings: read=839121.853
   ->  Nested Loop  (cost=672389.80..672425.91 rows=8 width=37) (actual
time=36633.920..885232.956 rows=7034196 loops=1)
 Buffers: shared hit=6167172 read=4199539, temp read=99551
written=99678
 I/O Timings: read=839121.853
 ->  GroupAggregate  (cost=672389.37..672389.39 rows=1 width=24)
(actual time=36628.945..41432.502 rows=938244 loops=1)
   Group Key: attempt_scores_2.student_id,
attempt_scores_2.assignment_id
   Buffers: shared hit=191072 read=329960, temp read=99551
written=99678
   I/O Timings: read=18210.866

Re: Slow query and wrong row estimates for CTE

2021-02-15 Thread Justin Pryzby
On Mon, Feb 15, 2021 at 12:49:29PM -0500, Dane Foster wrote:
> PostgreSQL version: PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc
> (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

> EXPLAIN (ANALYZE, BUFFERS)
> WITH max_spi AS (

Since v12, CTEs are usually inlined by default.
I suspect it doesn't help, but as an experiment you could try
WITH .. AS MATERIALIZED.

You could try instead: CREATE TEMPORARY TABLE + ANALYZE, which will use
statistics that "WITH" CTE's don't have (like the rowcount after GROUPing).

>  Aggregate  (cost=672426.02..672426.03 rows=1 width=24) (actual 
> time=903359.923..903368.957 rows=1 loops=1)
>Buffers: shared hit=6167172 read=4199539, temp read=99551 written=99678
>I/O Timings: read=839121.853

This shows that most of time is spent in I/O (839s/903s)

>   ->  Nested Loop  (cost=672389.80..672425.91 rows=8 width=37) (actual 
> time=36633.920..885232.956 rows=7034196 loops=1)
> Buffers: shared hit=6167172 read=4199539, temp read=99551 
> written=99678
...
>->  Hash Join  (cost=424676.58..671389.26 rows=1 
> width=24) (actual time=25169.930..34121.825 rows=833436 loops=3)
>  Hash Cond: ((attempt_scores_1.student_id = 
> attempt_scores_2.student_id) AND (attempt_scores_1.assignment_id = 
> attempt_scores_2.assignment_id) AND (attempt_scores_1.score_value = 
> (max(attempt_scores_2.score_value

This shows that it estimated 1 row but got 833k, so the plan may be no good.
As another quick experiment, you could try SET enable_nestloop=off.

> ->  Index Scan using for_upsert on attempt_scores (cost=0.43..36.42 
> rows=8 width=37) (actual time=0.394..0.896 rows=7 loops=938244)
>   Index Cond: (attempt_report_id = 
> (max(attempt_scores_1.attempt_report_id)))
>   Buffers: shared hit=5976100 read=3869579
>   I/O Timings: read=820910.987

This shows where most of your I/O time is from.
I think you could maybe improve this by clustering the table on for_upsert and
analyzing.  Very possibly your "id" and "time" columns are all correlated.
They might already/automatically be correlated - you can check the correlation
stat:
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

Without looking closely, an index might help: student_id,assignment_id
That'd avoid the sort, and maybe change the shape of the whole plan.

-- 
Justin