Re: [I] [RFC] Incremental NAS Backup Support for KVM Hypervisor [cloudstack]

via GitHub Tue, 31 Mar 2026 09:08:29 -0700


jmsperu commented on issue #12899:
URL: https://github.com/apache/cloudstack/issues/12899#issuecomment-4163753464


   @abh1sar Thank you for the detailed and thoughtful review. This is exactly 
the kind of feedback that will make the implementation solid. Addressing each 
point:
   
   **1. `nas.backup.full.interval`**
   
   You're right — "days" is misleading since backups can be hourly or ad-hoc. 
The intent was to control how often a full backup is taken vs incremental. 
We'll change this to a **count-based** approach: `nas.backup.full.every` 
(default: 10) — meaning every 10th backup is a full, regardless of schedule. 
This works for hourly, daily, and ad-hoc backups alike.
   
   **2. Use `backup-begin` for full backups too**
   
   Agreed. We'll use `backup-begin` (libvirt's backup API) for both full and 
incremental backups of running VMs. For full backups, simply omit the 
`<incremental>` element from the backup XML:
   
   ```xml
   <domainbackup mode="push">
     <disks>
       <disk name="vda" backup="yes" type="file">
         <target file="/mnt/nas/backups/vm-uuid/backup-full-{timestamp}.qcow2" 
type="qcow2"/>
         <driver type="qcow2"/>
       </disk>
     </disks>
   </domainbackup>
   ```
   
   This keeps the approach consistent and lets libvirt handle all the heavy 
lifting.
   
   **3. Timestamp-based bitmap names**
   
   Agreed. We'll use epoch timestamps (e.g., `backup-1711584000`) instead of 
date-based names. This avoids collisions on hourly/ad-hoc schedules and sorts 
naturally.
   
   **4. No explicit bitmap manipulation**
   
   Good point — `backup-begin` manages bitmaps through the backup XML's 
checkpoint definition. We'll remove the manual `block-dirty-bitmap-add/remove` 
commands and let the API handle it. The backup XML with `<incremental>` tells 
libvirt which bitmap to use and what new checkpoint to create.
   
   **5. Rebase after each backup**
   
   This is a great simplification. We'll run `qemu-img rebase` immediately 
after each incremental backup completes. This way:
   - Each backup file carries its chain reference
   - Restore becomes: flatten from the target point backward (no manual chain 
assembly)
   - Simpler code, fewer edge cases
   
   **6. Stopped VMs**
   
   For stopped VMs, `backup-begin` is not available (no running QEMU process). 
We'll use a different approach:
   
   - **Full backup**: `qemu-img convert` (same as today — copy the entire disk 
image)
   - **Incremental backup**: `qemu-img compare` between the current disk and 
the last backup, then export only changed blocks using `qemu-img convert` with 
a backing file reference
   
   This is slower than the live bitmap approach but functionally correct. The 
`backup_details` table will track whether a backup used the live (bitmap) or 
offline (compare) method, so restore logic can handle both.
   
   Alternatively, we could require that incremental backups are only available 
for running VMs and fall back to full backup for stopped VMs. This is simpler 
and avoids the `qemu-img compare` complexity. Would the community prefer this 
approach?
   
   **7. Delete backup cascading**
   
   Proposed delete behavior:
   
   - **Delete a middle incremental**: Merge it into its parent (rebase the 
child onto the grandparent), then remove the file. The chain remains valid.
   - **Delete a full backup with children**: Refuse with an error — user must 
delete all children first (youngest to oldest), or we could offer a "delete 
chain" option that removes the full + all its incrementals.
   - **Delete the newest incremental**: Simple file removal + update chain 
metadata.
   
   We'll add a `chain_position` field in `backup_details` to track where each 
backup sits in the chain, making delete validation straightforward.
   
   **8. Checkpoint recreation on VM restart**
   
   This is the trickiest issue. Since CloudStack recreates the domain XML on 
stop/start, all checkpoints are lost. Our plan:
   
   On VM start (in `LibvirtComputingResource` or the start command wrapper):
   1. Query `backup_details` for the VM's latest bitmap name
   2. If an active incremental chain exists, recreate the bitmap using:
      ```
      virsh qemu-monitor-command $DOMAIN --hmp \
        'block-dirty-bitmap-add drive-virtio-disk0 {bitmap_name} 
persistent=true'
      ```
   3. Mark the bitmap as "recreated" — the next incremental will capture ALL 
changes since the VM restarted (slightly larger than if the bitmap had 
survived, but correct)
   
   This means the first incremental after a restart will be larger (contains 
all writes since restart, not just since last backup), but subsequent 
incrementals return to normal size. This is an acceptable trade-off and matches 
how other backup solutions handle the same problem.
   
   ---
   
   **Updated implementation plan based on your feedback:**
   
   1. Use `backup-begin` API for both full and incremental (running VMs)
   2. Use `qemu-img convert`/`compare` for stopped VMs (full always, 
incremental optional)
   3. Timestamp-based bitmap names
   4. Let libvirt manage bitmaps via backup XML (no manual bitmap commands)
   5. Rebase immediately after each incremental
   6. Store chain metadata in `backup_details` (provider-agnostic, per 
@JoaoJandre's feedback)
   7. Recreate bitmaps on VM start
   8. Add integration test cases in `smokes/test_backup_recovery_nas.py`
   9. Single PR targeting 4.23
   
   Will start working on this. Thank you both @abh1sar and @JoaoJandre for the 
guidance — it's significantly improved the design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [RFC] Incremental NAS Backup Support for KVM Hypervisor [cloudstack]

Reply via email to