Skip to content

nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843

Open
jmsperu wants to merge 1 commit intoapache:4.20from
jmsperu:fix/nasbackup-timeout-spacecheck-trap
Open

nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843
jmsperu wants to merge 1 commit intoapache:4.20from
jmsperu:fix/nasbackup-timeout-spacecheck-trap

Conversation

@jmsperu
Copy link

@jmsperu jmsperu commented Mar 17, 2026

Summary

  • Add BACKUP_TIMEOUT (default 6 hours) to prevent indefinitely stuck backup jobs; aborts via domjobabort when exceeded
  • Add EXIT trap with cleanup() that resumes paused VMs, removes temp dirs, and unmounts NFS on any exit (error, signal, or normal) — prevents orphan mounts from accumulating
  • Add check_free_space() pre-flight check (default 1 GB minimum) before writing backup data
  • Add -q/--quiesce flag for optional fsfreeze/thaw via qemu-guest-agent for application-consistent backups
  • Use set -eo pipefail for stricter error handling
  • Fix mount_operation(): proper if mount instead of broken $? check after pipe
  • Quote all variable expansions to prevent word splitting
  • Remove manual umount/rmdir from functions (now handled by trap)

Motivation

The current nasbackup.sh has several reliability issues observed in production:

  1. No timeout — backup polling loop (until domjobinfo --completed) runs forever if backup stalls, blocking the agent
  2. No cleanup on failure — failed backups leave orphan NFS mounts (/tmp/csbackup.XXXXX) and paused VMs that never resume (related: KVM NAS backup: VM remains paused indefinitely when backup job fails #12821)
  3. No space check — backups silently fail or produce corrupt output when the NAS is full
  4. No quiesce — running VM backups are crash-consistent only, even when qemu-guest-agent is available
  5. Broken error handling$? after a pipe always returns the pipe's exit status, not the command's

Test plan

  • Backup a running VM with sufficient space — verify backup completes normally
  • Backup a running VM with BACKUP_TIMEOUT=30 — verify timeout and domjobabort
  • Backup with NAS at <1GB free — verify pre-flight rejection
  • Kill backup mid-run — verify cleanup resumes VM and unmounts NFS
  • Backup with -q flag and qemu-guest-agent installed — verify fsfreeze/thaw in logs
  • Backup with -q flag without qemu-guest-agent — verify graceful fallback
  • Delete backup — verify cleanup trap handles unmount
  • Backup a stopped VM — verify space check and error handling on disk convert

…ror handling

- Add BACKUP_TIMEOUT (default 6h) to prevent indefinitely stuck backup jobs;
  aborts via domjobabort when exceeded
- Add EXIT trap with cleanup() that resumes paused VMs, removes temp dirs,
  and unmounts NFS on any exit (error, signal, or normal)
- Add check_free_space() pre-flight check (default 1 GB minimum)
- Add -q/--quiesce flag for optional fsfreeze/thaw via qemu-guest-agent
- Use set -eo pipefail for stricter error handling
- Fix mount_operation: proper if/then instead of broken $? after pipe
- Quote all variable expansions to prevent word splitting
- Remove manual umount/rmdir from functions (handled by trap)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.25%. Comparing base (61afb4c) to head (937e646).

Additional details and impacted files
@@            Coverage Diff            @@
##               4.20   #12843   +/-   ##
=========================================
  Coverage     16.24%   16.25%           
- Complexity    13411    13412    +1     
=========================================
  Files          5664     5664           
  Lines        500463   500463           
  Branches      60779    60779           
=========================================
+ Hits          81308    81333   +25     
+ Misses       410059   410035   -24     
+ Partials       9096     9095    -1     
Flag Coverage Δ
uitests 4.15% <ø> (ø)
unittests 17.10% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@abh1sar
Copy link
Contributor

abh1sar commented Mar 18, 2026

@jmsperu can you please check on 4.21?
There were lots of improvement done to the nasbackup.sh script
Quiesce support was also added (which can also be opt out of)

@jmsperu
Copy link
Author

jmsperu commented Mar 18, 2026

@jmsperu can you please check on 4.21? There were lots of improvement done to the nasbackup.sh script Quiesce support was also added (which can also be opt out of)
Thanks for the heads up! I see 4.22 already has quiesce and improved error handling. I'll rebase onto 4.22 and update the PR to only include what's still missing: backup timeout, trap handler with VM resume, and free space check. The other PRs (compression, verification, bandwidth throttle, encryption, parallel execution) are still new features on top of 4.22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants