This still is an issue in at-3.1.16 "Abrupt reboot of a server, from a crash for example, can leave malformed at job files, which can cause atd to go into high cpu utilization after boot.
Logs can be flooded with such errors as: # grep atd /var/log/messages|tail Aug 26 11:24:29 isp2264 atd[30648]: Job 62696 (a0f4e8016e2e2b) - userid 250 does not match file uid 0 Aug 26 11:24:30 isp2264 atd[30649]: Job 62696 (a0f4e8016e2e2b) - userid 250 does not match file uid 0 Aug 26 11:24:30 isp2264 atd[30652]: Job 62696 (a0f4e8016e2e2b) - userid 250 does not match file uid 0 Aug 26 11:24:30 isp2264 atd[30654]: Job 62696 (a0f4e8016e2e2b) - userid 250 does not match file uid 0 (...)" I'm attaching a patch that reschedules failed jobs with a pause in order to avoid error messages flood. It also tries to prevent creation of the corrupted files by using 'O_SYNC' flag with open() system call. Best regards Kristyna Streitova
Index: at-3.1.13/at.c =================================================================== --- at-3.1.13.orig/at.c +++ at-3.1.13/at.c @@ -319,7 +319,8 @@ writefile(time_t runtimer, char queue) * bit. Yes, this is a kluge. */ cmask = umask(S_IRUSR | S_IWUSR | S_IXUSR); - if ((fd = open(atfile, O_CREAT | O_EXCL | O_TRUNC | O_WRONLY, S_IRUSR)) == -1) + if ((fd = open(atfile, + O_CREAT | O_EXCL | O_TRUNC | O_WRONLY | O_SYNC, S_IRUSR)) == -1) perr("Cannot create atjob file %.500s", atfile); if ((fd2 = dup(fd)) < 0) Index: at-3.1.13/atd.c =================================================================== --- at-3.1.13.orig/atd.c +++ at-3.1.13/atd.c @@ -103,6 +103,7 @@ int selinux_enabled=0; #define BATCH_INTERVAL_DEFAULT 60 #define CHECK_INTERVAL 3600 +#define RETRY_INTERVAL CHECK_INTERVAL /* Global variables */ @@ -845,12 +846,17 @@ run_loop() /* Something went wrong the last time this was executed. * Let's remove the lockfile and reschedule. + * + * To prevent pointless CPU heating with permanent errors, + * next execution is scheduled with RETRY_INTERVAL inserted. */ strncpy(lock_name, dirent->d_name, sizeof(lock_name)-1); lock_name[sizeof(lock_name)-1] = 0; lock_name[0] = '='; unlink(lock_name); - next_job = now; + if (next_job > now + RETRY_INTERVAL) { + next_job = now + RETRY_INTERVAL; + } nothing_to_do = 0; } continue;