Question

Backup best practices


Hello,

I’m designing my backup. So far in the documentation, I’ve read two options: application snapshot and application backup, both are writing to the local disk.

Let’s put aside the configuration backup as it’s less than 1 GB. The real challenge comes with backing up repos.

In an on-prem infrastructure, backups are stored in the backup infrastructure, with VTL and so on. There’s no way I can request to double the size of the repo disk just to store a consistent backup that I will have, then, to transfer to the backup infrastructure.

In a cloud infrastructure, the backup would go directly to the object storage such as S3 Glacier. Neither would we rent a disk space used only during backup, though it might be easier to do in a cloud environment.

In addition to the backup and snapshot methods from the documentation, I should add the option of disk snapshot, either from the guest OS or from the disk array (only for on-prem infrastructure). These would provide a stable file system onto which the backup software must run (as snapshots are not backups). There is also the option to snapshot the entire appliance from the hypervisor and hope for the best.

Now, let’s say I’ve got 30 TB of logs I need to backup, which accounts for about one year of logs.

When I look at the documentation options, I would have to schedule a daily backup with the last day of data (but it doesn’t seem dynamic in the options), which would copy the files under /opt/makalu/backup/repos I assume, and then use SFTP to fetch these and find a way to inject them in the backup software. That doesn’t seem convenient at all.

 

We need to add to the previous thoughts the restoration use cases:

  1. The deletion of repo data by fucking up the retention configuration (I think it’s not possible to delete a repo that contains data)
  2. A corruption of the filesystem or destruction of the appliance

In case 1, we’d need to restore only a repo. In case 2, we need to restore everything.

When I look at the filesystem under /opt/makalu/storage, it looks neatly organised by year, month and day folder, then we’ve got unix timestamp filenames with in .gz (which triggers another question about the use of ZFS compression if the files are gzipped anyway). So, for case 1, if I could restore the appropriate folders of lost past data, I should be good. For case 2, if I restore the entire filesystem, I should be as good as possible. Maybe the filesystem won’t have the last commit on disk, but it shouldn’t be corrupted.

 

So, why is there all this backup thing in place that duplicates data locally for log repos? What am I missing that could just prevent good restoration if I backup the filesystem (that the hypervisor backup software should be able to quiesce externally and backup)?

 

Would the following be consistent?

  1. Run the configuration backup with LP job (because it looks like there’s a database and so on)
  2. Get the hypervisor to tell the guest OS to quiesce the filesystem, then snapshot the disks
  3. Backup the disks with whatever differential backup the backup software can do

Of course, it’s unlikely sysadmins will want to backup the 30 TB logs every day, because I predict the deduplication on an encrypted filesystem won’t be good (ha, I didn’t say but for regulation compliance the repos are encrypted by ZFS). Still, it could be a valid scenario.

 

How do customers backup large volume of logs?


2 replies

Userlevel 4
Badge +8

I’m trying to bumb this up, as we see the same issues.

Is there a recommended backup solution or technique which doesn’t backup the data to the same disc?

Userlevel 2
Badge +3

Hi,

First of all, snapshot feature is deprecated along with ZFS file system. The default filesystem is ext4 and we compress this data with a compression job that happens once every day which compresses the log data using Gzip compression. This is also the reason why the backup of log data can only be scheduled for data that are at least one day older than the current date.

As of now, this feature is not as mature as the customers would expect, due to the very reasons you pointed out. So, for such cases we provide a custom script that uses SFTP/RSYNC/SCP based on the choice of the customer to backup data to an off-site/on-site location or on cloud storages. 

We also hope a robust backup method will be coming in, which is in the pipeline and on which the product team is currently doing assessment. This will come as a backup and archival feature in the product itself which we believe should solve these very problem that we have right now. 
Further, We also suggest customers to use their own backup and archival software (if they own one) to backup the entire filesystem using the features provided by backup and archival softwares.
Similarly, if you want to backup the data in a secondary disk when you generate a backup using the Logpoint's Backup feature, we can mount the backup location(/opt/immune/backup) on a secondary disk, on cloud storages or nfs with the help of the support team. This will store the backup in a secondary location instead of the same disk. So that if the first disk is corrupted or the entire appliance is destroyed, data will still be intact in the secondary location. 

We also have a HA feature for some cases that you mentioned like the filesystem corrupting or a single repo. This maintains a shadow copy in a high available system elsewhere so that you don't lose your valuable data. It is not an actual backup but a fault tolerance method to avoid data loss. 

Regarding a better documentation, we already have a internal ticket handling these very issues.

Reply