Source: bacula Severity: important Tags: upstream Hello,
Under a specific type of configuration, a Bacula job may sometimes corrupt a previously written volume, losing all data on it. The following circumstances have been identified : - Multiple concurrent jobs are started at the same time, all using the same Schedule and Pool - That Pool must have a Volume Use Duration which is higher than the frequency of the Jobs in the Schedule (For example, hourly backups with a VUD of 2 hours) - The Pool uses a Device which uses File media Once these conditions, are filled, a job may randomly corrupt a volume, typically when that volume is marked as "Used". This has the following consequences : - The Job status is "OK -- with warnings" - The Job includes the following error from the "mount.c" file : "Hey!!!!! WroteVol non-zero !!!!!" - One of previously-written volumes is marked in Error - That volume size on the filesystem drops below 1 kB (Effectively erased) - Attempting to restore files from a volume in error fails (Ending up as a mismatch) === Steps to Reproduce === Configure a Bacula cluster with the following conditions : - A Device must use "Media Type = File" - A Pool must have a certain Volume Use Duration (for example, 2 hours) - A Schedule must perform regular jobs with a higher frequency than the Volume Use Duration of the Pool (for example, every hour) - Multiple Jobs must be using this Schedule and Pool - The jobs must run concurrently Under these conditions, a job will eventually corrupt a previously written volume === Additional Information === This bug happens on various releases of Bacula from the official Debian packages (5.2, 7.4 and 9.4 are affected) This bug happens on multiple separated Bacula clusters (Nothing is shared between them) In case it matters, the FDs use PKI Signatures and Encryption This bug does not happen if the Volume Use Duration is set lower than the frequency of backups, ensuring a given Volume is never re-used between "batches" of backups (This is our current workaround) This bug did not happen before we implemented Concurrent Jobs The bug has been declared upstream : https://bugs.bacula.org/view.php?id=2664 -- System Information: Debian Release: 10.11 APT prefers oldstable-updates APT policy: (500, 'oldstable-updates'), (500, 'oldstable') Architecture: amd64 (x86_64) Kernel: Linux 4.19.0-18-amd64 (SMP w/32 CPU cores) Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB:en (charmap=UTF-8) Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled