Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
Hi all/Justin,
As said, created index on the res_users.res_employee_id and the below link
is the explain plan result.
Link: https://explain.depesz.com/s/hoct .
And the cost of Previous query is 92,129 and the cost of current modified
query after creating the above said index is 91,462. But good thing is we
can see a very small improvement..!.
Please find the table definitions which are used in the query(which you
asked for tms_worflow_history).
1. tms_timesheet_details:
amp_test=# \d tms_timesheet_details
Table
"public.tms_timesheet_details"
Column|Type |
Modifiers
-+-+
id | integer | not null default
nextval('tms_timesheet_details_id_seq'::regclass)
status | character varying |
create_uid | integer |
effort_hours| double precision|
work_order_no | character varying |
res_employee_id | character varying |
wsr_header_id | integer |
remarks | character varying |
write_date | timestamp without time zone |
timesheet_header_id | integer |
date| date|
create_date | timestamp without time zone |
write_uid | integer |
release_no | character varying |
project_id | character varying |
loc_name| character varying |
user_id | integer |
ao_emp_id | character varying |
Indexes:
"tms_timesheet_details_pkey" PRIMARY KEY, btree (id)
"tms_timesheet_details_uniq_res_employee_id_efforts" UNIQUE, btree
(res_employee_id, work_order_no, release_no, date, project_id)
"timesheet_detail_inx" btree (wsr_header_id, timesheet_header_id)
"tms_timesheet_details_all_idx" btree (wsr_header_id, work_order_no,
release_no, date, effort_hours)
"tms_timesheet_details_id_idx" btree (id) WHERE wsr_header_id IS NOT
NULL
"ts_detail_date_idx" btree (date)
"ts_detail_hdr_id_idx" btree (timesheet_header_id)
"ts_detail_release_no_idx" btree (release_no)
"work_order_no_idx" btree (work_order_no)
2. tms_workflow_history:
amp_test=# \d tms_workflow_history
Table "public.tms_workflow_history"
Column |Type |
Modifiers
---+-+---
id| integer | not null default
nextval('tms_workflow_history_id_seq'::regclass)
create_uid| integer |
current_activity | character varying |
user_id | integer |
sequence | integer |
is_final_approver | boolean |
wsr_id| integer |
write_uid | integer |
timesheet_id | integer |
state | character varying |
write_date| timestamp without time zone |
remarks | character varying |
create_date | timestamp without time zone |
group_id | integer |
active| boolean |
Indexes:
"tms_workflow_history_pkey" PRIMARY KEY, btree (id)
"curract_state_isfinal_app_idx" btree (current_activity, state,
is_final_approver)
"timesheet_id_group_id_active_idx" btree (timesheet_id, group_id,
active)
"tms_wkf_his_active_is_final_approveridx" btree (active,
is_final_approver)
"tms_wkf_his_group_id_idx" btree (group_id)
"tms_wkf_his_timesheet_id_idx" btree (timesheet_id)
"tms_wkf_hist_current_activity_idx" btree (current_activity)
"tms_wkf_hist_state_idx" btree (state)
"wsr_id_idx" btree (wsr_id)
3. res_users:
Table "public.res_users"
Column |Type |
Modifiers
---+-+
id| integer | not null default
nextval('res_users_id_seq'::regclass)
active| boolean | default true
login | character varying | not null
password | character varying |
company_id| integer | not null
partner_id| integer | not null
create_date | timestamp without time zone |
share | boolean |
write_uid
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
Hi Pavan, that's quite a big query. I can see that the generate_series function is getting repeatedly called and the planner estimates for this sub query are out by a factor of 66. You might try to re-write using a WITH query. I am assuming that you have already analyzed all the tables and also added appropriate indexes on join/query columns. Regards Matthew -- Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
On Wed, May 23, 2018 at 12:01:06AM -0700, pavan95 wrote: > As said, created index on the res_users.res_employee_id and the below link > is the explain plan result. > > Link: https://explain.depesz.com/s/hoct > > And the cost of Previous query is 92,129 and the cost of current modified > query after creating the above said index is 91,462. But good thing is we Forget the cost - that's postgres *model* of the combined IO+CPU. If the model is off, that's may cause bad plans and could be looked into further. In any case, that index cut your runtime from 75sec to 60sec (in spite of the modelled cost). It looks like you resolved the bad estimate on the users table? > 2. tms_workflow_history: > Indexes: > "tms_workflow_history_pkey" PRIMARY KEY, btree (id) > "curract_state_isfinal_app_idx" btree (current_activity, state, > is_final_approver) > "timesheet_id_group_id_active_idx" btree (timesheet_id, group_id, active) > "tms_wkf_his_active_is_final_approveridx" btree (active, > is_final_approver) > "tms_wkf_his_group_id_idx" btree (group_id) > "tms_wkf_his_timesheet_id_idx" btree (timesheet_id) > "tms_wkf_hist_current_activity_idx" btree (current_activity) > "tms_wkf_hist_state_idx" btree (state) > "wsr_id_idx" btree (wsr_id) How big is the table ? And curract_state_isfinal_app_idx ? Have these been reindexed (or pg_repacked) recently? It seems to me that the remaining query optimization is to improve this: > Bitmap Heap Scan on tms_workflow_history history (cost=193.19..1,090.50 > rows=6,041 width=12) (actual time=3.692..15.714 rows=11,351 loops=1) I think you could consider clustering (or repacking) the table on curract_state_isfinal_app_idx (but you'll have to judge if that's okay and won't negatively affect other queries). But, what's your target runtime ? Improvements here could cut at most 15sec off the total 60sec. If you're hoping to save more than that, you'll need to (also) look further than the query: - postgres parameters: what are shared_buffers, work_mem, effective_cache_size ? + https://wiki.postgresql.org/wiki/Server_Configuration - are there other DBs/applications running on the server/VM ? - kernel tuning (readahead, VM parameters, write cache, scheduler, THP, etc) - server hardware (what OS? storage? RAM? filesystem?) - how does the storage perform outside of postgres? + something like this: /usr/sbin/bonnie++ -f -n0 -x4 -d /var/lib/pgsql Justin
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
Hi Matthew,
Yeah and you said right!. I have analyzed the entire database and also
created appropriate indexes for the columns used in WHERE/JOIN clauses.
Okay I will just provide the fourth union part of the query which you can
analyze easier(this not that big).
Please find the query part. And refer to the table definitions in my
previous posts.
Query:
select
0 as id,
header.id as header_id,
'0' as emp_id,
0 as sno,
users.alias_id as alias,
partner.name as name,
(select SUM( floor(effort_hours) + ( (effort_hours -
floor(effort_hours)) / 0.6 ))
from tms_timesheet_details where
work_order_no != 'CORPORATE' and
res_employee_id=users.res_employee_id and date in (select date::date from
generate_series((select start_date from tms_timesheet_period where
id=127),(select end_date from tms_timesheet_period where id=127),'1
day'::interval) date)) billed_hrs,
(select SUM( floor(effort_hours) + ( (effort_hours -
floor(effort_hours)) / 0.6 ))
from tms_timesheet_details where
release_no = 'unbillable_time' and
res_employee_id=users.res_employee_id and date in (select date::date from
generate_series((select start_date from tms_timesheet_period where
id=127),(select end_date from tms_timesheet_period where id=127),'1
day'::interval) date)) as unbilled_hrs,
(select SUM( floor(effort_hours) + ( (effort_hours -
floor(effort_hours)) / 0.6 ))
from tms_timesheet_details where
release_no = 'paid_time_off' and
res_employee_id=users.res_employee_id and date in (select date::date from
generate_series((select start_date from tms_timesheet_period where
id=127),(select end_date from tms_timesheet_period where id=127),'1
day'::interval) date)) as paid_time_off,
(select SUM( floor(effort_hours) + ( (effort_hours -
floor(effort_hours)) / 0.6 ))
from tms_timesheet_details where
release_no = 'unpaid_leave' and
res_employee_id=users.res_employee_id and date in (select date::date from
generate_series((select start_date from tms_timesheet_period where
id=127),(select end_date from tms_timesheet_period where id=127),'1
day'::interval) date)) as unpaid_leave,
(select SUM( floor(effort_hours) + ( (effort_hours -
floor(effort_hours)) / 0.6 ))
from tms_timesheet_details where
release_no = 'bereavement_time' and
res_employee_id=users.res_employee_id and date in (select date::date from
generate_series((select start_date from tms_timesheet_period where
id=127),(select end_date from tms_timesheet_period where id=127),'1
day'::interval) date)) as breavement_time,
(select SUM( floor(effort_hours) + ( (effort_hours -
floor(effort_hours)) / 0.6 ))
from tms_timesheet_details where
res_employee_id=users.res_employee_id
and date >='2018-04-16' and date <='2018-04-30' and release_no in
('sick_leave','casual_leave','privilege_leave','optional_holiday') ) as
leave,
'Not Submitted' state,
header.res_employee_id as header_emp_id,
'Not Submitted' as header_status
from res_users users,
res_partner partner,
tms_timesheet_status status,
tms_timesheet_header header
where
header.res_employee_id=users.res_employee_id
and status.id=header.status_id
and users.partner_id=partner.id
and status.name='Draft'
and header.timesheet_period_id=127
and header.res_employee_id in ('14145', '14147',
'ON-14148', '11331', '11332', '11333', 'ON-11334', '65432', '65416',
'54643', '23266', '4681', '56464', '64649', '89564', '98798', '1',
'44466', '87852', '65464', '65464', '44655', '8201', '65465', 'ON-78785',
'13233', 'ON-5544', 'ON-54654', '23131', '98765', '25134', '13218', '84645',
'4687', '6546', '4988', '89796', '79878', '7198', '15726', '2132', '5310',
'13056', '4446', '16825', '16740', '3912', '19601', '13200', '12981',
'ON-3332', '13166', 'ON-3144', 'ON-1251', 'ON-2799', 'ON-2338', '7286',
'ON-2381', 'ON-3102', 'ON-2938', '64782', '5407', '54641', '46379',
'G151151', '5007', '6011', '5050', '20869', '20204', '12410', '10488',
'14582', '13574', '12982', '7884', '7788', '13417', '7922', '16744',
'16746', '16756', '8292', '16745', '19989', '8297', '5020', '14184',
'17161', '20767', '20753', '20289', '19979', '19975', '20272', '4292',
'G9341010', '14791', '5121', 'ON-1767', 'ON-581', 'ON-700', 'ON-437',
'ON-562', 'ON-1726', 'OFF-1060', 'ON-147', 'OFF-612', 'OFF-635', 'OFF-857',
'ON-900280', 'ON-1934', 'ON-1922', 'ON-2258', 'OFF-2537', 'ON-2872',
'ON-
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
Hi Justin, >How big is the table ? And curract_state_isfinal_app_idx ? >Have these been reindexed (or pg_repacked) recently? The size of the table 'tms_workflow_history' is 7600Kb(which is pretty small). Yes those indexes were dropped and recreated. >It looks like you resolved the bad estimate on the users table? Yeah, even I think the same. Please find the explain plan which got increased again vastly. Is this because of the increase in rows? Link : https://explain.depesz.com/s/Ifr The above is the explain plan taken from production server. And this is the main plan to tune. Please let me know the where I'm going wrong. Thank you in Advance.!! Regards, Pavan -- Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html
Re: dsa_allocate() faliure
>> At which commit ID? 83fcc615020647268bb129cbf86f7661feee6412 (5/6) >>do you mean that these were separate PostgreSQL clusters, and they were all >>running the same query and they all crashed like this? A few worker nodes, a table is hash partitioned by "aTable.did" by Citus, and further partitioned by PG10 by time range on field "ts". As far as I could tell, Citus just does a query rewrite, and execute the same type of queries to all nodes. >>so this happened at the same time or at different times? At the same time. The queries are simple count and sum queries, here is the relevant part from one of the worker nodes: 2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not find 7 free pages 2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT aTable.did, avg((aTable.sum OPERATOR(pg_catalog./) (aTable.count)::double precision)) AS avg FROM public.aTable_102117 aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25 00:00:00+00'::timestamp with time zone) AND (aTable.ts OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary) >> a parallel worker process I think this is more of PG10 parallel bg worker issue. I don't think Citus just lets each worker PG server do its own planning. I will try to do more experiments about this, and see if there is any specific query to cause the parallel query execution to fail. As far as I can tell, the level of concurrency triggered this issue. That is executing 10s of queries as shown on the worker nodes, depending on the stats, the PG10 core may or may not spawn more bg workers. Thanks for your time! On Tue, May 22, 2018 at 9:44 PM, Thomas Munro wrote: > On Wed, May 23, 2018 at 4:10 PM, Sand Stone wrote: dsa_allocate could not find 7 free pages >> I just this error message again on all of my worker nodes (I am using >> Citus 7.4 rel). The PG core is my own build of release_10_stable >> (10.4) out of GitHub on Ubuntu. > > At which commit ID? > > All of your worker nodes... so this happened at the same time or at > different times? I don't know much about Citus -- do you mean that > these were separate PostgreSQL clusters, and they were all running the > same query and they all crashed like this? > >> What's the best way to debug this? I am running pre-production tests >> for the next few days, so I could gather info. if necessary (I cannot >> pinpoint a query to repro this yet, as we have 10K queries running >> concurrently). > > Any chance of an EXPLAIN plan for the query that crashed like this? > Do you know if it's using multiple Gather[Merge] nodes and parallel > bitmap heap scans? Was it a regular backend process or a parallel > worker process (or a Citus worker process, if that is a thing?) that > raised the error? > > -- > Thomas Munro > http://www.enterprisedb.com
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
On Wed, May 23, 2018 at 07:03:18AM -0700, pavan95 wrote: > Please find the explain plan which got increased again vastly. Is this > because of the increase in rows? > > Link : https://explain.depesz.com/s/Ifr That's explain without "analyze", so not very useful. There's handful of questions: On Wed, May 23, 2018 at 08:43:22AM -0500, Justin Pryzby wrote: > - postgres parameters: what are shared_buffers, work_mem, > effective_cache_size ? >+ https://wiki.postgresql.org/wiki/Server_Configuration > - are there other DBs/applications running on the server/VM ? > - kernel tuning (readahead, VM parameters, write cache, scheduler, THP, etc) > - server hardware (what OS? storage? RAM? filesystem?) > - how does the storage perform outside of postgres? >+ something like this: /usr/sbin/bonnie++ -f -n0 -x4 -d /var/lib/pgsql Justin
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
Hi Justin, Please find the below explain plan link. Link: https://explain.depesz.com/s/owE Any help is appreciated. Thanks in Advance. Regards, Pavan -- Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html
Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
On Wed, May 23, 2018 at 10:20:42PM -0700, pavan95 wrote: > Hi Justin, > > Please find the below explain plan link. > > Link: https://explain.depesz.com/s/owE That's explain analyze but explain(analyze,buffers) is better. Is this on a completely different server than the previous plans ? This rowcount misestimate appears to be a significant part of the problem: Merge Join (cost=228.77..992.11 ROWS=20 width=22) (actual time=4.353..12.439 ROWS=343 loops=1) Merge Cond: (history_2.timesheet_id = header_2.id) You could look at the available stats for that table's column in pg_stats. Is there an "most common values" list? Maybe you need to ALTER TABLE .. SET STATISTICS 999 (or some increased value) and re-analyze ? You can see these are also taking large component of the query time: Bitmap Index Scan on ts_detail_release_no_idx (cost=0.00..33.86 rows=1,259 width=0) (actual time=0.304..0.304 rows=1,331 LOOPS=327) Index Cond: ((release_no)::text = 'paid_time_off'::text) ... Bitmap Index Scan on ts_detail_release_no_idx (cost=0.00..33.86 rows=1,259 width=0) (actual time=0.304..0.304 rows=1,331 LOOPS=343) Index Cond: ((release_no)::text = 'paid_time_off'::text) I wonder whether it would help to CREATE INDEX ON tms_timesheet_details(timesheet_header_id) WHERE ((release_no)::text = 'paid_time_off'::text); In addition to the other settings I asked about, it might be interesting to SHOW effective_io_concurrency; You're at the point where I can't reasonably contribute much more. Justin
