Bug#1012301: bacula: Corruption of File media during concurrent backups

Julien Chiaramello Fri, 03 Jun 2022 01:24:16 -0700

Source: bacula
Severity: important
Tags: upstream

Hello,


Under a specific type of configuration, a Bacula job may sometimes corrupt a
previously written volume, losing all data on it. The following circumstances
have been identified :

- Multiple concurrent jobs are started at the same time, all using the same
Schedule and Pool
- That Pool must have a Volume Use Duration which is higher than the frequency
of the Jobs in the Schedule (For example, hourly backups with a VUD of 2 hours)
- The Pool uses a Device which uses File media

Once these conditions, are filled, a job may randomly corrupt a volume,
typically when that volume is marked as "Used". This has the following
consequences :

- The Job status is "OK -- with warnings"
- The Job includes the following error from the "mount.c" file : "Hey!!!!!
WroteVol non-zero !!!!!"
- One of previously-written volumes is marked in Error
- That volume size on the filesystem drops below 1 kB (Effectively erased)
- Attempting to restore files from a volume in error fails (Ending up as a
mismatch)

=== Steps to Reproduce ===

Configure a Bacula cluster with the following conditions :
- A Device must use "Media Type = File"
- A Pool must have a certain Volume Use Duration (for example, 2 hours)
- A Schedule must perform regular jobs with a higher frequency than the Volume
Use Duration of the Pool (for example, every hour)
- Multiple Jobs must be using this Schedule and Pool
- The jobs must run concurrently

Under these conditions, a job will eventually corrupt a previously written
volume

=== Additional Information ===

This bug happens on various releases of Bacula from the official Debian packages
(5.2, 7.4 and 9.4 are affected)

This bug happens on multiple separated Bacula clusters (Nothing is shared
between them)

In case it matters, the FDs use PKI Signatures and Encryption

This bug does not happen if the Volume Use Duration is set lower than the
frequency of backups, ensuring a given Volume is never re-used between "batches"
of backups (This is our current workaround)

This bug did not happen before we implemented Concurrent Jobs

The bug has been declared upstream : https://bugs.bacula.org/view.php?id=2664

-- System Information:
Debian Release: 10.11
  APT prefers oldstable-updates
  APT policy: (500, 'oldstable-updates'), (500, 'oldstable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-18-amd64 (SMP w/32 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_GB:en (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Bug#1012301: bacula: Corruption of File media during concurrent backups

Reply via email to