Faster drive cloning for backup
I make a bootable, bit-for-bit clone of my desktop machine weekly as a backup to complement daily higher-level point-in-time incremental backups with bup. I could use bup for cloning too; so far it has done everything it promised and has already been there in my hour of need once. However, it’ll take time to earn my full trust and I don’t want all my eggs in its one fairly new basket. After scripting bup I wondered what could I use to get bup-like backup speed in a full-disk clone.
Run a Linux software RAID (md) mirrored array with the clone drive as a member that is usually missing but gets reattached for backups. Track changes using md’s poorly advertised changed block bitmap; when the clone drive is reconnected, adding it back to the array will automatically resynchronise only those changed blocks. Be careful to avoid an equally poorly advertised bug in the Linux block layer else you may well corrupt the block device you’re backing up. For my desktop filesystems this makes weekly cloning around four times faster.
Picking a block driver
This needs a block driver that’s happy to sit between the file system and any other block device, and which:
- passes all requests through transparently
- records runs that get modified (dirty block map)
- returns and clears the list of runs at backup time
- ideally, preserves the dirty block map across reboots
- ideally, has had enough testing to be in the vanilla kernel
I considered and rejected the following:
- LVM snapshots: always copy-on-write — can’t just record the fact that the block is modified, meaning you need to predict the size of and reserve storage for a copy you’re never going to use
- lvmsync: requires LVM snapshots; use case is quick backups over the network immediately after taking snapshots
- DRBD with the local host acting as both the primary and the secondary: I somehow misread the docs as saying the dirty block map was only kept in memory (which would have meant rebooting/crashing forced a full resync). I also didn’t like the idea of new modules, layers and userspace tools just to record dirty blocks, and I imagine syncing over the loopback interface would be slower than it needs to be.
- fr1 (Fast RAID 1): states that it records dirty blocks but it never entered the mainline kernel and is now unmaintained.
md actually implements this, but having only ever dipped into mdadm(8) with specific problems in mind I’d never noticed its references to a “write-intent bitmap”:
If an array is using a write-intent bitmap, then devices which have been removed can be re-added in a way that avoids a full reconstruction but instead just updates the blocks that have changed since the device was removed.
md(4) (which I never knew existed til it came up in a Google search for something else) describes it more fully:
Secondly, when a drive fails and is removed from the array, md stops clearing bits in the intent log. If that same drive is re-added to the array, md will notice and will only recover the sections of the drive that are covered by bits in the intent log that are set. This can allow a device to be temporarily removed and reinserted without causing an enormous recovery cost.
Digression: Bugs and fixes
The Linux kernel has a long-standing block layer bug that can easily bite you in this setup.
A block device accepts I/O requests for chunks of data up to a certain maximum size. Larger requests are rejected with an error. The thing that performs I/O on a block device (usually a filesystem like ext2) queries this maximum size either every time it performs I/O (e.g. ext2fs) or only when it first opens the device (e.g. device mapper/LVM).
A block device composed of multiple physical devices (e.g. a RAID1) has to use the smallest limit of any of its component devices, so its chunk size limit changes as component devices come and go.
When one block device is layered upon another, such as LVM/Device Mapper/dm-crypt on top of software RAID, there’s no standard mechanism for lower layers to communicate changes in the chunk size limit. Unless every layer asks every time it performs I/O, the chunk size limit can change without that layer knowing. As LVM asks only once, adding a USB drive to a RAID with LVM on top can reduce its chunk size limit without Device Mapper or the filesystem knowing.
The next time the filesystem (e.g. ext3) attempts a large read/write, it passes
it to Device Mapper which passes it to RAID, which refuses it because it’s too
big. In an ideal world, Device Mapper passes this failure back up to the file
system which passes it on to the requesting process, which handles it
gracefully. In the real world not all processes (and perhaps not even all
filesystems) check the return value of
write() and handle failures
gracefully; some or all of the data might be lost or left inconsistent, often
with the only clue being a kernel log line like:
bio too big device md0 (240 < 256)
bash fragment to test for this (but I make no promises):
MIRROR_ARRAY=md0 CLONE_DRIVE=sda4 CURRENT_MAX_WRITE=$(< /sys/block/$MIRROR_ARRAY/queue/max_sectors_kb ) NEW_MAX_WRITE=$(< /sys/block/$CLONE_DRIVE/queue/max_sectors_kb ) HOLDERS=$( ls -1 /sys/block/$MIRROR_ARRAY/holders/ | wc -l ) if [ NEW_MAX_WRITE -lt $CURRENT_MAX_WRITE -a $HOLDERS -gt 0 ]; then echo "New member with smaller maximum write might cause corruption." else echo "New member has a safe-looking maximum write." fi
With care this situation can be avoided. The safest but most intrusive workaround is to reboot into a rescue system or single user mode before adding/removing the drive, and never start LVM/Device Mapper/whatever layers sit above RAID. This has the benefit of ensuring a consistent backup.
A less intrusive but more fragile workaround is to assume a chunk size limit and enforce it at boot time, after RAID1 starts but before LVM starts. That way backups can be made without rebooting. There are trade-offs:
- backups will only be as consistent as the array would be after a power cut
- adding another device with a yet smaller chunk size limit to the array will still cause problems
- smaller-than-necessary chunk sizes have a performance impact
- any non-standard change like this requires testing and maintenance
Those are acceptable to me. If backups aren’t quick and easy, they don’t happen.
To add this workaround to the initial RAM disk of an Ubuntu system (tested on 12.04 “Precise Pangolin”) I put an initramfs-tools hook script in /etc/initramfs-tools/hooks/md-write-size and a udev rule in /root/84-md-write-size.rules.
mdadm 3.2.3-2ubuntu1 in Ubuntu 12.04 has a bug that causes dirty
block bitmaps to be written incorrectly, and causes
mdadm to silently exit
non-zero when adding members to an array with bitmaps, and/or kernel panics.
I built a custom mdadm with the cherry-picked changes from upstream, but older versions apparently aren’t affected (not sure which) and the fixes have already made it into newer Ubuntu packages.
As you’re still reading, you’re probably comfortable migrating to or modifying a RAID1. Assuming it has two members and that the new drive is at least as big as them, add a bitmap and add the clone drive’s block device as a third member:
MIRROR_ARRAY=md0 CLONE_DRIVE=sda4 # Don't forget you need a patched mdadm for this in Ubuntu 12.04. mdadm --add /dev/$MIRROR_ARRAY /dev/$CLONE_DRIVE mdadm --grow /dev/$MIRROR_ARRAY --bitmap=internal --raid-devices=3
From here, and after every future backup, there are three steps:
- wait for the (re)sync to finish
- clear the dirty block bitmap
- remove the drive logically
Waiting is simple:
while [ $(</sys/block/$MIRROR_ARRAY/md/degraded) -ne 0 ]; do sleep 30 done
Clearing is suspended while the array is degraded, for obvious reasons, so we
turn it on in case it was off. The dirty block bitmap is a persistent on-disk
structure so is flushed only periodically to save I/O (see
Documentation/md.txt), hence the
sleep to wait for it to clear.
echo true >/sys/block/md0/md/bitmap/can_clear sleep $(( 4 * $(</sys/block/md0/md/bitmap/time_base) ))
The kernel will automatically set
can_clear to false and start tracking dirty
blocks as soon as the array becomes degraded.
mdadm --manage /dev/$MIRROR_ARRAY --set-faulty /dev/$CLONE_DRIVE \ --remove /dev/$CLONE_DRIVE
Then disconnect the drive physically.
Next time you physically reconnect the drive to perform a backup, re-add it logically:
mdadm --manage /dev/$MIRROR_ARRAY --re-add /dev/$CLONE_DRIVE
The block layer bug report links to some work
Google have Kent Overstreet has done on the block layer that might
solve it, so hopefully the initramfs workaround is temporary.
Update 2014-04: The bug is marked fixed but it looks like Kent is still working on getting the bio-splitting patches in for 3.15. A proper test would be good though; presumably it wouldn’t require more than:
- setting a tiny
max_sectors_kbon a loopback device
A full blind sync takes 4 hours; an incremental one takes a little under an hour.
Both for peace of mind and true reliability I’d recommend running a check every few backups while the intermittent member is in the array:
echo "check" >/sys/block/$MIRROR_ARRAY/md/sync_action
/proc/mdstat during the check if you’re bored; at the end check that
/sys/block/$MIRROR_ARRAY/md/mismatch_cnt is still 0. If not you’ll need to
repair the array. I’ve had no problems so far.