It's Pi all the way down...

by

This is part 1 of an exploration of netbooting Raspberry Pis, with an emphasis on NBD over the more traditional NFS. While obviously we’re going to be looking at Ubuntu in this series, I’m hoping this should be generic enough to be useful to be useful on a variety of distributions.

The trouble I’ve seen

The obvious question is “why NBD?” or put another way “what’s wrong with NFS?”. After all, NFS is the current “best practice” for booting Raspberry Pis, and widely used, including by some very large installations (e.g. Mythic Beasts’ fantastic Pi cloud).

Ultimately, the reason is that the Network File System … is a file-system.

What a ridiculous objection, of course it’s a file-system! The clue’s in the name! And surely you want a file-system? More typically, though, you actually have a root block device, which your kernel will then transform into a file-system, rather than being given a file-system directly.

What problems arise from the lack of (access to) an underlying block device? The crux of the issue is what the kernel can assume about the file-system, most particularly whether it can assume it has exclusive access to it, and whether it can trivially access the underlying blocks of certain files.

Consider the common “root is a block device” case:

The kernel on your machine runs the transformation (the file system driver) that converts that block device into a file-system. Crucially, it can assume that it is the only entity accessing this block device (that it has exclusive access), and that for certain operations it can (with some jiggery-pokery) bypass the file-system and treat a file as a block device [1], which makes certain things nice and simple.

Need to allocate a large contiguous file, for example as a swap file? No problem! The file-system driver implements this [2], and can even give the kernel the contiguous blocks for use in the swap system (remember how block devices “make storage look like RAM”? Foreshadowing!).

Need to lock a file? The kernel knows nothing else is mounting file systems from that block device, so it knows all the locks that exist on it and doesn’t need to coordinate with anything else. Likewise, caching is simple [3]. The kernel can assume it has absolute knowledge of which blocks are dirty and need writing back because nothing else can be producing file systems from that block device.

Need some temporary file space, private to your process? Create a unique temporary file and delete it, leaving the file handle open, then use the file as a temporary store. The space won’t be reclaimed because, even though there are no links to the file, the kernel knows there’s still an open file handle.

Now the “root is a file-system” case:

In this case, there’s a block device somewhere, but your kernel has no visibility of it, and must assume that other entities can change its blocks without notification.

This interferes with several of the scenarios above. Historically, it was fatal to several of them (fallocate for cheap allocation of large files, and swap over NFS did not work). Whilst these features do work today, it’s notable that kernel support needed to be added for swap [4], and the NFS protocol extended specifically for fallocate.

Caching becomes tricky because files are complicated and messy things that have data and “meta-data”. Here’s an excerpt from the NFS mount options:

ac / noac

Selects whether the client may cache file attributes. If neither option is specified (or if ac is specified), the client caches file attributes.

To improve performance, NFS clients cache file attributes. Every few seconds, an NFS client checks the server’s version of each file’s attributes for updates. Changes that occur on the server in those small intervals remain undetected until the client checks the server again. The noac option prevents clients from caching file attributes so that applications can more quickly detect file changes on the server.

Yikes.

The practice of using unlinked temporary files also works over NFS but again … required workarounds because NFS is stateless [1] so open file handles locally do not correspond to open file handles on the server.

All this doesn’t matter too much when the portion of the virtual file-system being mounted over NFS is relatively limited, as in the common case of mounting user home directories over NFS. However, when the entire root file-system is NFS, quite a few applications can start having “difficulty”. One of the more prominent, on Ubuntu especially, is snapd which doesn’t like running on an NFS root. Like it or not, it’s a pretty integral part of the Ubuntu eco-system at this point (providing Firefox on the desktop and LXD on the server), so it would be nice to have a netboot system that can support it too [5].

Okay, we’ve established there are some issues with running NFS as root. What’re the alternatives?

Sometimes I’m up

There are several daemons and protocols that support serving block devices over the network, and they differ in some quite interesting ways:

iSCSI
iSCSI can be trivially summarized as: SCSI commands over TCP/IP. This is by far the most popular method of serving block devices. It has the advantages that it doesn’t require expensive equipment (unlike Fibre Channel, its major competitor in the enterprise), and can be routed over multiple networks (as it’s built on IP). However, it’s not entirely trivial to configure (it is very flexible, but for our purposes here I wanted something simpler to start out with).
AoE

ATA over Ethernet. Like iSCSI, the name says it all. The client simply passes ATA disk commands over Ethernet to the server. Note that this does indeed run over Ethernet only (layer 2), not TCP/IP so it’s not routable over the Internet, only local networks. It’s very simple to set up (moreso than iSCSI), but two things made me skip it here. The first is that routing over layer 2 may be perfectly sufficient for many use-cases, but there’s several others where it’ll be a limiting factor.

The second is that aoetools, the package for AoE in Ubuntu, is an extremely “mature” package. Specifically, the upstream version in Ubuntu hasn’t changed since Xenial (16.04, 7 years ago at the time of writing). That isn’t to say it’s bad or entirely unmaintained, but there’s no active work on it as best as I can tell (and that usually doesn’t bode too well from a security point of view).

NBD

Network Block Devices differs in that doesn’t implement the commands from an existing disk protocol (SCSI or ATA). Instead, it uses its own protocol which, at it’s core, is almost laughably simple.

It also operates over TCP/IP, avoiding the layer 2 limitations of AoE, and is an actively maintained project which optionally includes facilities for TLS encryption. I won’t be using those here, but you’ll have some options for better security down the line.

Simplicity is another argument in favour of serving block devices instead of file-systems. Compare the two scenarios:

They don’t look that different but consider what serving files over a network means, versus serving block devices. What operations can be performed against a file? Opening, closing, reading, writing, truncating, locking, linking, touching, the list goes on. All this must be handled by the protocol to implement even a bare-bones network file-system. The bare-bones case for block devices (as noted above) is radically simpler.

The baseline portion of the NBD Protocol consists of commands to “read some bytes”, “write some bytes”, and “disconnect”. That’s it. There are some other commands which may optionally tell the server to trim, flush, or cache blocks, and some other messages for option negotiation at connection time, but the core of the protocol really is that simple. I like simple.

Sometimes I’m down

So far, we’ve looked at how the root file-system will be handled when netbooting. However, the root file-system only matters after the Linux kernel has started. How do we obtain the Linux kernel itself (and other sundry boot resources) at system start? This involves neither NFS nor NBD.

When netbooting, the Pi first requests an IPv4 address from the local network via DHCP. The local router responds with a DHCP offer, and our netboot server tacks a (minimal) PXE boot menu on the end suggesting where the client may find a TFTP server for booting.

This is typical jargon-laden nonsense, so let’s translate a bit:

Jargon Raspberry Pi Router Netboot Server
DHCP DISCOVER “Hello? Can anybody give me an IPv4 address? By the way, I’m a netboot client”    
DHCP OFFER   “Sure, would you like to be 192.168.0.200?”  
DHCP Proxy Option 43 PXE Boot Menu     “By the way, for ‘Raspberry Pi Boot’ see TFTP server at 192.168.0.4”
DHCP REQUEST “Okay, I’d like to be 192.168.0.200, please?”    
DHCP ACK   “Right, you are 192.168.0.200 for the next 12 hours, see me again after that”  
TFTP RRQ “Can you send me the content of SERIAL/start.elf, please?”    
TFTP OACK     “Sure, it’s going to be 225065 bytes long, and I’ll send it in chunks of 1468 bytes”

The important things to note here are as follows:

  1. We need a DHCP server. This is pretty much taken-as-read on any network these days.
  2. We need a netboot server with a DHCP proxy and a TFTP server. This is fairly simple. Any Ubuntu server can install dnsmasq (if it hasn’t already) to obtain this.
  3. We need the Raspberry Pi’s serial number.

This last point may seem a bit strange, but it’s because TFTP is, as the name suggests, trivial. The protocol provides no means for the client to identify itself to the server, so how are we to know which boot partition we should read files from?

Identification by MAC address is one possibility [6], but that’s not an option for us (unsupported by the TFTP server in dnsmasq). Instead we rely upon the Pi identifying itself by the sequence of files it initially attempts to request. When netbooting, a Pi (more specifically a Pi 4 or later) will attempt the following sequence of files [1]:

  • SERIAL/start4.elf
  • SERIAL/start.elf
  • start.elf

If the bootloader finds files with the SERIAL/ [7] prefix, all subsequent requests will also have that prefix, allowing us to easily determine which OS image files should be served from.

The Pi then proceeds to request (over TFTP):

  • The rest of the tertiary bootloader (e.g. start4.elf, fixup4.dat, and its configuration files like config.txt).
  • The device-tree for the specific board (e.g. bcm2711-rpi-4-b.dtb) and any overlays required by the configuration, or by devices that are plugged in (e.g. overlays/dwc2.dtbo, overlays/ov5647.dtbo).
  • The kernel and initramfs requested by the configuration, and its command line (e.g. vmlinuz, initrd.img, cmdline.txt).

With all this loaded into appropriate locations in memory, the bootloader hands over to the Linux kernel, which mounts the initramfs as its initial root file-system, and launches the /init binary within it.

In the case of Ubuntu, this is the usual initramfs you’ll find on pretty much any Ubuntu installation. It’ll search the kernel command line for the “real” root device, attempt to mount it, and “pivot” the root that mount.

For the NFS case, the kernel command line would include something like nfsroot=server:/exports/ubuntu-jammy root=/dev/nfs. For the NBD case, the kernel command line would include something like nbdroot=server/ubuntu-jammy root=/dev/nbd0p2.

At this point you should have a basic understanding of the Pi’s netboot process. Let’s explore what the server side configuration for TFTP, NFS, and NBD can look like from a high level. We’ll get into specifics in the next post; this is just to give you an idea of the considerations and possibilities involved.

Glory, hallelujah!

A typical server TFTP configuration (whether subsequently NFS, NBD, or anything else) is to have the boot partition of an OS image unpacked or mounted under a particular path, and then make a symlink from the Raspberry Pi’s serial number to that path. Re-writing the symlink is then enough to switch which Pi boots which image.

While this much does not differ between the NFS and NBD cases, there is the question of how to make the boot files available.

In the case of NFS, as the server is serving a file-system it is typical to simply unpack the entire OS image, both the root and boot file-systems, into a directory and call that the “image” that is served. The symlink for the Pi’s serial number then points to the /boot/firmware directory within the unpacked image.

For example, if we have two OS images, ubuntu-jammy and ubuntu-mantic, and two Raspberry Pis with serial numbers 1234abcd and 4567cdef we might lay out our files like so:

/
├─ …
├─ srv/
│  ├─ ubuntu-jammy/
│  │  ├─ bin/
│  │  ├─ boot/
│  │  │  ├─ firmware/
│  │  │  └─ …
│  │  └─ …
│  ├─ ubuntu-mantic/
│  │  ├─ bin/
│  │  ├─ boot/
│  │  │  ├─ firmware/
│  │  │  └─ …
│  │  └─ …
│  └─ boot/
│     ├─ 1234abcd->/srv/ubuntu-jammy/boot/firmware
│     └─ 4567cdef->/srv/ubuntu-mantic/boot/firmware
└─ …

The two images are completely unpacked under /srv/ubuntu-jammy and /srv/ubuntu-mantic, then symlinks under /srv/boot point to the /boot/firmware directories of the unpacked images. Our TFTP server is configured to serve /srv/boot, and our NFS server to export /srv.

The advantages are that it’s a relatively simple setup, requiring no special mounts on the server side, and that (assuming all unpacked images are in the same file-system on the server), all available space is shared between all netbooting Pis. The disadvantage (other than it being an NFS boot setup) is that all available space is shared between all netbooting Pis so one Pi can use up all available space for everyone, unless additional steps like quotas are taken.

In the case of NBD, which we’ll explore more in the next post, I would suggest a simple setup is to leave the OS image as it is (in a file) and simply expand the file to the desired size. A loop device can be created for the image file, with a partition scan to find the boot partition, which can then be mounted. I would still recommend using a symlink to point to the mount for ease of changing later.

Let’s consider the scenario from before (two images, jammy and mantic, two Pis with known serial numbers) in this setup:

/
├─ …
├─ srv/
│  ├─ ubuntu-jammy.img
│  ├─ ubuntu-mantic.img
│  ├─ mnt/
│  │  ├─ ubuntu-jammy/  (mount of ubuntu-jammy.img partition 1)
│  │  └─ ubuntu-mantic/  (mount of ubuntu-mantic.img partition 1)
│  └─ boot/
│     ├─ 1234abcd->/srv/mnt/ubuntu-jammy
│     └─ 4567cdef->/srv/mnt/ubuntu-mantic
└─ …

The two images are simply placed under /srv. Loop devices are created of each, and the first partition mounted under appropriate directories under /srv/mnt. Symlinks under /srv/boot link Pi serial numbers to the appropriate mount. Our TFTP server is configured to serve /srv/boot as before, and our NBD server is configured to export each /srv/*.img file as a block device.

The advantages are that this is a trivial setup (the image doesn’t even need unpacking), and that each image is necessarily limited to its own storage. The disadvantages are that storage isn’t shared between images. However, we saw in the prior post that there’s all manner of things we can do with block devices, so we’ll explore some possibilities there in a future post too.

I should warn there’s also a nicely hidden, but quite serious issue with this set up which we’ll look at next time when we actually build it (and then fix it … obviously). Anyway, that’s all for now!


[1](1, 2, 3) This is a gross over-simplification (or an outright lie), but serves to make the point.
[2]Alright, some of them don’t or only recently added the functionality, or it’s still experimental (cough btrfs), but they’re usually the file-systems that have their own rules for block mapping.
[3]Caching is never simple.
[4]No other file system has a “make swap work on this file-system” kernel config item
[5]I’m told Docker overlays also have issues with NFS but it’s not something I’ve played with directly and I’ve not had the time to verify it that for this article.
[6]Though not for “security” (MACs can be trivially spoofed)
[7]The SERIAL portion is the serial number of the Pi (which can be found at the end of the output of cat /proc/cpuinfo), in lower-case hexidecimal format. If the serial number is longer than 8 characters, only the last 8 characters are used.