It's Pi all the way down...


In an earlier series of posts (Playing with Blocks 1, 2, 3, and 4) we had a look at customizing Ubuntu images for the Raspberry Pi by playing around with various block device transformations. However, we never covered what a block device actually is. This is going to be a vaguely important topic in an upcoming series of posts on NBD booting of Ubuntu, so I figured some brief coverage of the topic is warranted.


Who am I kidding?


In the UNIX world a block device is a very simple concept: it’s a storage device which provides access to its contents as an ordered sequence of identically sized blocks. That’s it. No files, no directories, no permissions, nothing more than a numbered sequence of blocks all with the same size. Typically, the blocks are some whole multiple of an underlying device’s sector size, which over most of computing history has been 512 bytes. However, the user may access block devices byte-by-byte if they wish:

For instance, a read of a few bytes (illustrated on the left in green) may read a full 512-byte sector, throw away the bytes outside the requested range, and return that which was requested [1]. A write of a few bytes (illustrated in the middle in red) is a bit more complex; the kernel needs to read the full sector from the underlying device, change those bytes we requested to write in the middle, then write the whole sector back to the underlying device [2].

Crucially, block devices are random access. We can access any block, anywhere, at any time, without having to access all blocks prior to it. In other words, block devices make storage look and “feel” like RAM. This is as opposed to the “byte-stream” view of files which owes more to the tape drives of old (“seeking” the head to a particular location, then reading or writing from that point).


The simplicity of block devices leads to them being wonderfully flexible things that can often be converted into new block devices in useful ways. Probably the most basic block device transformation that everyone is familiar with is “partitioning”, which simply sub-divides one block device into multiple new ones (minus some header space). This transformation effectively just re-numbers the blocks but doesn’t change their content or ordering.

More complex transformations are also available.

The LVM sub-system (explored previously) gathers multiple block devices into a “volume group”. The volume group can then be used to produce other block devices called “logical volumes”. These are similar to partitions, but have no need to be contiguous on the underlying block devices, and can even span multiple underlying devices.

A drive (sda) is divided into two partitions (sda1, sda2) the latter of which is further divided into three block devices with LVM (root, home, and swap). The home device has been expanded at some point in its history as its storage is non-contiguous

Speaking of multiple underlying devices, RAID systems (like mdadm) consume multiple block devices and produce a single block device. The blocks in the produced device are duplicated to the underlying block devices (in the simplest mirror case; more complex transforms are involved in things like RAID5). Corruption or even loss of one of the underlying block devices, as in the failure of a drive, can be entirely hidden from the consumer of the produced block device.

Four drives (sda, sdb, sdc, and sdd) are partitioned such that each has a small initial partition (sda1 etc.) and a larger partition (sda2 etc.); the small initial partitions are combined as a RAID1 mirror into a boot block device. The larger partitions are combined as a RAID10 device (a stripe of mirrors) as a root device.

The LUKS encryption mechanism (also explored in previous posts) consumes one block device and produces a slightly smaller one (minus the size of the LUKS header) that scrambles (and unscrambles) the content of blocks as they are written (and read).

The Hierarchy

A file-system is just another kind of transformation of a block device; one which converts the flat sequence of blocks into a tree-like hierarchy of files (optionally with attributes like ownership, groups, permissions, and all the rest of that gubbins).

File systems can be relatively simple affairs, like the ancient but ubiquitous FAT which has no concept of file ownership, UNIX modes, and no symlinks. Alternatively they can be complex, like the common ext4 file-system which has journaling data recovery, the full suite of UNIX ownership, attributes, and linkage, and numerous tunable options.

Ultimately, the file-systems produced from various block devices are all grafted together (“mounted” in UNIX parlance) onto a “virtual” file-system which is what all userland processes (and you, the user) generally consider “the file-system”; the place you find all the files on your computer.

Two partitions (sda1 and sda2) are formatted as FAT and ext4 respectively. The vfat driver converts the sda1 block device into a typical boot file-system for the Pi (containing bootcode.bin, start4.elf, config.txt, etc). The ext4 driver converts the sda2 block device into a typical Linux root file-system (containing bin, etc, lib, usr, sbin directories). The virtual file-system then “mounts” the ext4 hierarchy under / and the FAT hierarchy under /boot/firmware

A common Linux storage layout is shown above. The drive (/dev/sda) has two partitions. The first partition (/dev/sda1) is formatted as FAT, which the kernel’s vfat driver transforms into a file-system. The second partition (/dev/sda2) is formatted as ext4, which the ext4 driver transforms into another file-system. The kernel has mounted the ext4 file-system as the root of the “virtual” file-system, and the FAT file-system under the /boot/firmware mount-point.

Holy CoW!

Some modern file-systems provide a copy-on-write facility (most notably btrfs and zfs). This allows rapid (near instantaneous) cloning of an entire file-system hierarchy. The clone doesn’t initially take any storage of its own, and reads of the clone actually go to the original files. However, writes to the clone allocate new storage for the changed content.

However, the facility is not limited to modern file-systems. LVM also provides such a facility, initially intended to provide snapshots of file-systems at a point in time (for backup, testing, or other purposes).

The “root” block device contains a number of filled blocks in green, while the “snap” block device has an equivalent number of “empty” blocks. Two reads, indicated by arrows, are made against the “snap” device but pass through to the underlying “root” device. Below this an equivalent layout is show, but now one of the “filled” blocks is “red” indicating a change in content. Above it, the equivalent block in the “snap” device is green indicating it contains the original content. A read against this block comes directly from the “snap” device instead of passing through to the underlying “root” device.

The illustration above shows a “root” block device, presumably containing a root file-system. We create a snapshot of the “root” block device called “snap”. Initially, the “snap” block device is empty, and reads to it pass through to the underlying “root” block device. However, when a block is the “root” device is written to, its original content is first copied to the “snap” device. Subsequent reads of this block in the “snap” device will go to this copied block instead. Further writes to the same block in the “root” device are ignored because the “snap” device only cares about the content at the point in time it was created.

Note that one of the beauties of this system is that we don’t care what file-system is on the “root” device. It could be FAT, ext4, XFS, even (redundantly) one of the more modern systems that supports this internally. However, a limitation of this snapshot system is that we need to allocate space for it up front [4]. Can we come up with something in which the snapshot allocates blocks dynamically?

At the top, the “image” block device (read-only) contains a number of filled blocks in green. Above it, a clone, “clone1”, exists with the same number of blocks but all are empty. A read of two blocks in “clone1”, indicated by arrows, passes through to the corresponding blocks in “image”. Below, the one of the blocks in “clone1” is now red indicating it has been changed from the original content in “image”. Another arrow, indicating a read of that block now comes straight from “clone1” and doesn’t pass through. Another clone, “clone2”, is below it with entirely empty blocks. A read of the same block which is changed in “clone1”, passes through “clone2” to the underlying “image” device.

The illustration above shows “thinly provisioned” snapshots in LVM. This operates a little like the regular snapshots in reverse. An underlying block device called “image” presumably contains an OS image. We have created a “thinly provisioned” snapshot of this called “clone1”. Reads of an unchanged block pass through to the underlying (read-only) “image”. Writes to blocks in “clone1” occur within its storage, with subsequent reads to that block no longer passing through. Finally, we create another clone, “clone2”, which reads the original block from the underlying “image”.

It’s worth noting that most of the blocks in the clones are empty. Provided they stay that way (i.e. the clones don’t change too many blocks of the underlying image), there’s little sense in actually allocating them. This is the “thin” in “thinly provisioned”. The blocks of the clones are allocated on demand so a clone initially takes up no space. This allocate-on-demand facility can implement one form of “over provisioning” of storage. Each clone thinks it has all the storage allocated to the original image, but there’s no need to have all the storage available (unless the clones grow to require it, naturally).

Hopefully this gives some idea of how modern cloud systems can spin up clones of an OS image nearly instantaneously, and how they can seemingly provide many gigabytes of storage to containers or VMs, without necessarily having installed all the storage they apparently provide.

Layer Cake

Finally, a brief illustration of just how silly this can get. Here’s a diagram of a storage layout I’ve used on servers in the past:

A spiders web of storage. Four boxes represent four hard at the bottom. Above this a larger box represents the RAID device built from the drives. Above this a hexagonal volume group is built from the large RAID device. From the volume group spring numerous block devices, for “tmp”, “root”, “home”, and “images”. From the first three spring regular file-systems, but from “images” spring several more block devices named “base” and from that “munin”, “docs”, and “git” (OS images running in containers).
  • Four hard drives exist, each containing a small boot partition and a large partition occupying the rest of the capacity
  • The four boot partitions are combined with a RAID1 mirror into a boot block device, “md1” [3].
  • The “md1” block device is formatted as FAT and contains the “/boot” hierarchy.
  • The four large partitions are combined with RAID5 into a large block device, “md2” (RAID5 implies it has the capacity of n-1, i.e. three, of the underlying devices, and can survive the loss of any one device).
  • The “md2” device is consumed by the “raidvg” volume group from which are produced various block devices: the “root”, “home”, “tmp”, and “images” volumes.
  • root”, “home”, and “tmp” are formatted with a mix of file-systems, ext4 and XFS.
  • The “images” volume is a “thin pool” from which the “base” volume is derived.
  • The “munin”, “docs”, and “git” volumes are all thinly provisioned snapshots of the “base” volume.

This may seem like a complicated stack of transformations, but consider that in essence most amount to little more than re-numbering and re-ordering of blocks (the RAID5 transform is a little more involved, admittedly). Hence, for very little performance cost, we’ve got redundancy (RAID), flexible volume creation system with snapshotting (LVM), and copy-on-write cloning for VMs or containers launched on the server (thin provisioning).

By this point you should have a reasonable understanding of what a block device is, and what can be accomplished with the various transforms that are available for them under Linux. Next time we’ll take a look at the issues with netbooting from NFS and how NBD (or block devices in general) can mitigate some of these issues.

[1]In practice it’s unlikely the bytes will actually be thrown away; the sector requested will be stored in a cache, in case a future read requests more bytes from the same sector (very likely when reading through a file sequentially).
[2]Again, the changed block will wind up in a cache so future modifications can skip the initial read step.
[3]Simple RAID1 mirroring is used to ensure that the BIOS can use any of the drives as its initial boot device without caring about the RAID layout (which it doesn’t understand). Note: This carries risks as it means the BIOS and bootloader must treat the boot partition as strictly read-only (and that’s not necessarily guaranteed).
[4]The snapshot doesn’t require all the space to be allocated, as the diagram suggests. In fact, assuming the underlying device changes infrequently, it’s common to only allocate ~5% of the origin’s storage. However, you do still need to provide some up front.