zabbix on postgresql - very slow delete of events

2019-07-23 Thread Kristian Ejvind
Hi

This will be a rather lengthy post, just to give the full (I hope) picture. 
We're using Zabbix for monitoring and I'm having problems
understanding why the deletion of rows in the events table is so slow.

Zabbix: 4.2 (never mind the name of the db - it is 4.2)
new values per second: ~400
hosts: ~600
items: ~45000

OS: CentOS Linux release 7.6.1810 (Core)
Postgresql was installed from the yum repo on postgresql.org

zabbix_34=> select version();
 version
-
PostgreSQL 10.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 
(Red Hat 4.8.5-36), 64-bit
(1 row)

The database is analyzed + vacuumed nightly. The server runs Zabbix and the 
database, has 16 GB memory, 4 vCPUs (modern hardware).
Some parameters:

shared_buffers = 3GB
work_mem = 10MB(I also tested with work_mem = 128MB - no difference)
effective_cache_size = 6 GB
effective_io_concurrency = 40
checkpoint_timeout = 5 min (default)
max_wal_size = 1 GB (default)
checkpoint_completion_target = 0.8

pg_wal is already on a separate device.

events table: ~25 million rows / 2.9 GB
event_recovery table: ~12 million rows / 550 MB
alerts table: ~60 rows / 530 MB

Generally the database is quite snappy and shows no indication of problems. But 
now I've seen that housekeeping of events is
very slow - a single (normally hourly) run can take more than one day to 
finish, so events keep stacking up in the table. A typical slow
delete statement, from the postgres log:

postgresql-10-20190717-031404.log:2019-07-17 03:37:43 CEST [80965]: [4-1] 
user=zabbix,db=zabbix_34,app=[unknown],client=[local]: LOG: duration: 
27298798.930 ms statement: delete from events where (eventid between 5580621 
and 5580681 or eventid between 5580689 and 5580762 or eventid between 5580769 
and 5580844 or eventid between 5580851 and 5580867 or eventid between 5580869 
and 5580926 or eventid between 5580933 and 5580949 or eventid between 5580963 
and 5581024
--- 8< --- a lot of similar eventids snipped away -
or eventid between 5586799 and 5586839 or eventid in 
(5581385,5581389,5581561,5581563,5581564,5581580,5 
581582,5581584,5581585,5581635))

I've analyzed the deletion of a single row in events. First, some table 
information:


zabbix_34=> \d events
 Table "zabbix.events"
Column|  Type   | Collation | Nullable |Default
--+-+---+--+---
eventid  | numeric |   | not null |
source   | bigint  |   | not null | '0'::bigint
object   | bigint  |   | not null | '0'::bigint
objectid | numeric |   | not null | '0'::numeric
clock| bigint  |   | not null | '0'::bigint
value| bigint  |   | not null | '0'::bigint
acknowledged | bigint  |   | not null | '0'::bigint
ns   | bigint  |   | not null | '0'::bigint
name | character varying(2048) |   | not null | ''::character 
varying
severity | integer |   | not null | 0
Indexes:
"idx_29337_primary" PRIMARY KEY, btree (eventid)
"events_1" btree (source, object, objectid, clock)
"events_2" btree (source, object, clock)
"events_clk_3" btree (clock)
Referenced by:
TABLE "acknowledges" CONSTRAINT "c_acknowledges_2" FOREIGN KEY (eventid) 
REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE
TABLE "alerts" CONSTRAINT "c_alerts_2" FOREIGN KEY (eventid) REFERENCES 
events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE
TABLE "alerts" CONSTRAINT "c_alerts_5" FOREIGN KEY (p_eventid) REFERENCES 
events(eventid) ON DELETE CASCADE
TABLE "event_recovery" CONSTRAINT "c_event_recovery_1" FOREIGN KEY 
(eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_recovery" CONSTRAINT "c_event_recovery_2" FOREIGN KEY 
(r_eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_recovery" CONSTRAINT "c_event_recovery_3" FOREIGN KEY 
(c_eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_suppress" CONSTRAINT "c_event_suppress_1" FOREIGN KEY 
(eventid) REFERENCES events(eventid) ON DELETE CASCADE
TABLE "event_tag" CONSTRAINT "c_event_tag_1" FOREIGN KEY (eventid) 
REFERENCES events(eventid) ON DELETE CASCADE
TABLE "problem" CONSTRAINT "c_problem_1" FOREIGN KEY (eventid) REFERENCES 
events(eventid) ON DELETE CASCADE
TABLE "problem" CONSTRAINT "c_problem_2" FOREIGN KEY (r_eventid) REFERENCES 
events(eventid) ON DELETE CASCADE



zabbix_34=> \d event_recovery
  Table "zabbix.event_recovery"
Column |  Type  | Collation | Nullable | Default
---++---+--+

Re: zabbix on postgresql - very slow delete of events

2019-07-23 Thread Kenneth Marshall
On Tue, Jul 23, 2019 at 08:07:55AM +, Kristian Ejvind wrote:
> Hi
> 
> This will be a rather lengthy post, just to give the full (I hope) picture. 
> We're using Zabbix for monitoring and I'm having problems
> understanding why the deletion of rows in the events table is so slow.
> 
> Zabbix: 4.2 (never mind the name of the db - it is 4.2)
> new values per second: ~400
> hosts: ~600
> items: ~45000
> 

Hi Kristian,

Time series databases like Zabbix work poorly with the Housekeeper
service. We had many similar sorts of problems as our Zabbix usage
grew. Once we partitioned the big tables, turned off the Housekeeper,
and cleaned up by dropping partitions instead everything worked much,
much, much better. When we started using partitioning, we used the
old inheiritance style. Now you can use the native partitioning.

Regards,
Ken




Re: zabbix on postgresql - very slow delete of events

2019-07-23 Thread Kristian Ejvind
Thanks Kenneth. In fact we've already partitioned the largest history* and 
trends* tables
and that has been running fine for a year. Performance was vastly improved. But 
since you
can't have a unique index on a partitioned table in postgres 10, we haven't 
worked on that.

Regards
Kristian


?On 2019-07-23, 14:58, "Kenneth Marshall"  wrote:

Hi Kristian,

Time series databases like Zabbix work poorly with the Housekeeper
service. We had many similar sorts of problems as our Zabbix usage
grew. Once we partitioned the big tables, turned off the Housekeeper,
and cleaned up by dropping partitions instead everything worked much,
much, much better. When we started using partitioning, we used the
old inheiritance style. Now you can use the native partitioning.

Regards,
Ken






Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

Ekslingan 8
Box 222 09, SE-25467 Helsingborg

Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: [email protected]
Webb: http://www.resursbank.se





Re: zabbix on postgresql - very slow delete of events

2019-07-23 Thread Kenneth Marshall
On Tue, Jul 23, 2019 at 01:41:53PM +, Kristian Ejvind wrote:
> Thanks Kenneth. In fact we've already partitioned the largest history* and 
> trends* tables
> and that has been running fine for a year. Performance was vastly improved. 
> But since you
> can't have a unique index on a partitioned table in postgres 10, we haven't 
> worked on that.
> 
> Regards
> Kristian

Hi Kristian,

Why are you not partitioning the events and alerts tables as well? That
would eliminate this problem and you already have the infrastructure in
place to support the management since you are using it for the history
and trends tables.

Regards,
Ken