I’m designing my backup. So far in the documentation, I’ve read two options: application snapshot and application backup, both are writing to the local disk.
Let’s put aside the configuration backup as it’s less than 1 GB. The real challenge comes with backing up repos.
In an on-prem infrastructure, backups are stored in the backup infrastructure, with VTL and so on. There’s no way I can request to double the size of the repo disk just to store a consistent backup that I will have, then, to transfer to the backup infrastructure.
In a cloud infrastructure, the backup would go directly to the object storage such as S3 Glacier. Neither would we rent a disk space used only during backup, though it might be easier to do in a cloud environment.
In addition to the backup and snapshot methods from the documentation, I should add the option of disk snapshot, either from the guest OS or from the disk array (only for on-prem infrastructure). These would provide a stable file system onto which the backup software must run (as snapshots are not backups). There is also the option to snapshot the entire appliance from the hypervisor and hope for the best.
Now, let’s say I’ve got 30 TB of logs I need to backup, which accounts for about one year of logs.
When I look at the documentation options, I would have to schedule a daily backup with the last day of data (but it doesn’t seem dynamic in the options), which would copy the files under /opt/makalu/backup/repos I assume, and then use SFTP to fetch these and find a way to inject them in the backup software. That doesn’t seem convenient at all.
We need to add to the previous thoughts the restoration use cases:
- The deletion of repo data by fucking up the retention configuration (I think it’s not possible to delete a repo that contains data)
- A corruption of the filesystem or destruction of the appliance
In case 1, we’d need to restore only a repo. In case 2, we need to restore everything.
When I look at the filesystem under /opt/makalu/storage, it looks neatly organised by year, month and day folder, then we’ve got unix timestamp filenames with in .gz (which triggers another question about the use of ZFS compression if the files are gzipped anyway). So, for case 1, if I could restore the appropriate folders of lost past data, I should be good. For case 2, if I restore the entire filesystem, I should be as good as possible. Maybe the filesystem won’t have the last commit on disk, but it shouldn’t be corrupted.
So, why is there all this backup thing in place that duplicates data locally for log repos? What am I missing that could just prevent good restoration if I backup the filesystem (that the hypervisor backup software should be able to quiesce externally and backup)?
Would the following be consistent?
- Run the configuration backup with LP job (because it looks like there’s a database and so on)
- Get the hypervisor to tell the guest OS to quiesce the filesystem, then snapshot the disks
- Backup the disks with whatever differential backup the backup software can do
Of course, it’s unlikely sysadmins will want to backup the 30 TB logs every day, because I predict the deduplication on an encrypted filesystem won’t be good (ha, I didn’t say but for regulation compliance the repos are encrypted by ZFS). Still, it could be a valid scenario.
How do customers backup large volume of logs?