Faster drive cloning for backup

Posted: 2012-08-15

I make a bootable, bit-for-bit clone of my desktop machine weekly as a backup to complement daily higher-level point-in-time incremental backups with bup. I could use bup for cloning too; so far it has done everything it promised and has already been there in my hour of need once. However, it’ll take time to earn my full trust and I don’t want all my eggs in its one fairly new basket. After scripting bup I wondered what could I use to get bup-like backup speed in a full-disk clone.

Summary

Run a Linux software RAID (md) mirrored array with the clone drive as a member that is usually missing but gets reattached for backups. Track changes using md’s poorly advertised changed block bitmap; when the clone drive is reconnected, adding it back to the array will automatically resynchronise only those changed blocks. Be careful to avoid an equally poorly advertised bug in the Linux block layer else you may well corrupt the block device you’re backing up. For my desktop filesystems this makes weekly cloning around four times faster.

Picking a block driver

This needs a block driver that’s happy to sit between the file system and any other block device, and which:

I considered and rejected the following:

md actually implements this, but having only ever dipped into mdadm(8) with specific problems in mind I’d never noticed its references to a “write-intent bitmap”:

If an array is using a write-intent bitmap, then devices which have been removed can be re-added in a way that avoids a full reconstruction but instead just updates the blocks that have changed since the device was removed.

md(4) (which I never knew existed til it came up in a Google search for something else) describes it more fully:

Secondly, when a drive fails and is removed from the array, md stops clearing bits in the intent log. If that same drive is re-added to the array, md will notice and will only recover the sections of the drive that are covered by bits in the intent log that are set. This can allow a device to be temporarily removed and reinserted without causing an enormous recovery cost.

I’ve seen other people ask about using RAID1 this way but I didn’t see a good guide in the first couple of FWSE pages.

Digression: Bugs and fixes

Block Layer

The Linux kernel has a long-standing block layer bug that can easily bite you in this setup.

A block device accepts I/O requests for chunks of data up to a certain maximum size. Larger requests are rejected with an error. The thing that performs I/O on a block device (usually a filesystem like ext2) queries this maximum size either every time it performs I/O (e.g. ext2fs) or only when it first opens the device (e.g. device mapper/LVM).

A block device composed of multiple physical devices (e.g. a RAID1) has to use the smallest limit of any of its component devices, so its chunk size limit changes as component devices come and go.

When one block device is layered upon another, such as LVM/Device Mapper/dm-crypt on top of software RAID, there’s no standard mechanism for lower layers to communicate changes in the chunk size limit. Unless every layer asks every time it performs I/O, the chunk size limit can change without that layer knowing. As LVM asks only once, adding a USB drive to a RAID with LVM on top can reduce its chunk size limit without Device Mapper or the filesystem knowing.

The next time the filesystem (e.g. ext3) attempts a large read/write, it passes it to Device Mapper which passes it to RAID, which refuses it because it’s too big. In an ideal world, Device Mapper passes this failure back up to the file system which passes it on to the requesting process, which handles it gracefully. In the real world not all processes (and perhaps not even all filesystems) check the return value of write() and handle failures gracefully; some or all of the data might be lost or left inconsistent, often with the only clue being a kernel log line like:

bio too big device md0 (240 < 256)

Here’s a bash fragment to test for this (but I make no promises):

MIRROR_ARRAY=md0
CLONE_DRIVE=sda4

CURRENT_MAX_WRITE=$(< /sys/block/$MIRROR_ARRAY/queue/max_sectors_kb )

NEW_MAX_WRITE=$(<     /sys/block/$CLONE_DRIVE/queue/max_sectors_kb )

HOLDERS=$( ls -1 /sys/block/$MIRROR_ARRAY/holders/ | wc -l )

if [ NEW_MAX_WRITE -lt $CURRENT_MAX_WRITE -a $HOLDERS -gt 0 ]; then
    echo "New member with smaller maximum write might cause corruption."
else
    echo "New member has a safe-looking maximum write."
fi

With care this situation can be avoided. The safest but most intrusive workaround is to reboot into a rescue system or single user mode before adding/removing the drive, and never start LVM/Device Mapper/whatever layers sit above RAID. This has the benefit of ensuring a consistent backup.

A less intrusive but more fragile workaround is to assume a chunk size limit and enforce it at boot time, after RAID1 starts but before LVM starts. That way backups can be made without rebooting. There are trade-offs:

Those are acceptable to me. If backups aren’t quick and easy, they don’t happen.

To add this workaround to the initial RAM disk of an Ubuntu system (tested on 12.04 “Precise Pangolin”) I put an initramfs-tools hook script in /etc/initramfs-tools/hooks/md-write-size and a udev rule in /root/84-md-write-size.rules.

mdadm

mdadm 3.2.3-2ubuntu1 in Ubuntu 12.04 has a bug that causes dirty block bitmaps to be written incorrectly, and causes mdadm to silently exit non-zero when adding members to an array with bitmaps, and/or kernel panics.

I built a custom mdadm with the cherry-picked changes from upstream, but older versions apparently aren’t affected (not sure which) and the fixes have already made it into newer Ubuntu packages.

Using it

As you’re still reading, you’re probably comfortable migrating to or modifying a RAID1. Assuming it has two members and that the new drive is at least as big as them, add a bitmap and add the clone drive’s block device as a third member:

MIRROR_ARRAY=md0
CLONE_DRIVE=sda4

# Don't forget you need a patched mdadm for this in Ubuntu 12.04.
mdadm --add /dev/$MIRROR_ARRAY /dev/$CLONE_DRIVE
mdadm --grow /dev/$MIRROR_ARRAY --bitmap=internal --raid-devices=3

From here, and after every future backup, there are three steps:

Waiting is simple:

while [ $(</sys/block/$MIRROR_ARRAY/md/degraded) -ne 0 ]; do
    sleep 30
done

Clearing is suspended while the array is degraded, for obvious reasons, so we turn it on in case it was off. The dirty block bitmap is a persistent on-disk structure so is flushed only periodically to save I/O (see Documentation/md.txt), hence the sleep to wait for it to clear.

echo true >/sys/block/md0/md/bitmap/can_clear
sleep $(( 4 * $(</sys/block/md0/md/bitmap/time_base) ))

The kernel will automatically set can_clear to false and start tracking dirty blocks as soon as the array becomes degraded.

mdadm --manage /dev/$MIRROR_ARRAY --set-faulty /dev/$CLONE_DRIVE \
                                  --remove     /dev/$CLONE_DRIVE

Then disconnect the drive physically.

Next time you physically reconnect the drive to perform a backup, re-add it logically:

mdadm --manage /dev/$MIRROR_ARRAY --re-add     /dev/$CLONE_DRIVE

Future

The block layer bug report links to some work Google have Kent Overstreet has done on the block layer that might solve it, so hopefully the initramfs workaround is temporary.

Update 2014-04: The bug is marked fixed but it looks like Kent is still working on getting the bio-splitting patches in for 3.15. A proper test would be good though; presumably it wouldn’t require more than:

  1. setting a tiny max_sectors_kb on a loopback device
  2. dd with oflag=direct.

Evaluation

A full blind sync takes 4 hours; an incremental one takes a little under an hour.

Verification

Both for peace of mind and true reliability I’d recommend running a check every few backups while the intermittent member is in the array:

echo "check" >/sys/block/$MIRROR_ARRAY/md/sync_action

Watch /proc/mdstat during the check if you’re bored; at the end check that /sys/block/$MIRROR_ARRAY/md/mismatch_cnt is still 0. If not you’ll need to repair the array. I’ve had no problems so far.