It's Pi all the way down...

by

In the prior post we looked at the issues with using NFS as our netboot root, and some of the alternatives. This time we’ll go through setting up the server and client of an NBD netboot system. If you want to play along at home you’ll need:

  • A Raspberry Pi. I’ll be using a Pi 4B [1].
  • A real [2] server running Ubuntu. I’m going to be using another Pi for this, but any old PC should be fine too (whatever server you use will benefit from as much IO bandwidth, both disk and network, as you can get).
  • An ethernet network. You can’t netboot a Pi over wifi, so you’ll need an ethernet set up for this.

Before we begin you should ensure your server is up and running, and that you have remote SSH access to it. In the following instructions I’m assuming that your server’s hostname is server, and that you have an ubuntu user on it which can sudo to root. Adjust according to your setup!

We’re going to start with the client side of things, oddly, but there is method in the madness. Firstly, we need to ensure that the Pi’s bootloader is configured to attempt bootloading (which none are by default). And secondly, we need to do some surgery on the initramfs for later.

Some Kind of Magic

We’re going to be using the current Ubuntu LTS (“jammy”, 22.04.3) throughout this guide, both on the server later, and for the Pi’s client image, to keep things relatively simple. Fire up rpi-imager and flash Ubuntu 22.04 server onto an SD card, then boot that SD card on your chosen Pi.

Warning

Do not be tempted to upgrade packages at this point. Specifically, the kernel package must not be upgraded yet.

Now we need to ensure that the Pi is configured to attempt network booting. This is a one-time change which will be stored in the EEPROM of the Pi in question. On the Pi, extract the current boot configuration from the EEPROM, modify the existing BOOT_ORDER= line (or append a new one if none is present), and apply the modified configuration:

$ sudo rpi-eeprom-config > boot.conf
$ cat boot.conf
[all]
BOOT_UART=1
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
BOOT_ORDER=0xf41
$ sed -i -e '/^BOOT_ORDER=/d' boot.conf
$ echo BOOT_ORDER=0xf21 >> boot.conf
$ cat boot.conf
[all]
BOOT_UART=1
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
BOOT_ORDER=0xf21
$ sudo rpi-eeprom-config --apply boot.conf
Updating bootloader EEPROM
 ...
$ sudo reboot

Obviously, feel free to fire up your favourite editor and just change the BOOT_ORDER= line yourself, instead of messing with sed. The mysterious 0xf21 value is explained fully in the BOOT_ORDER documentation on the Raspberry Pi website, but simply means try the SD card first (1), followed by the network (2), instead of USB boot (4) previously, and if both fail then repeat (f) [3]. The digits are specified in reverse order for $reasons.

The reboot at the end is required to apply the new configuration to the boot EEPROM. You can run sudo rpi-eeprom-config after rebooting to check the newly applied configuration.

Next, we need to install the linux-modules-extra-raspi package for the currently running kernel version. The reason is that the nbd kernel module was moved out of the default linux-modules-raspi package for efficiency. We specifically need the version matching the running kernel version because installing this package will regenerate the initramfs (initrd.img). We’ll be copying that regenerated file into the image we’re going to netboot and it must match the kernel version in that image. This is why it was important not to upgrade any packages after the first boot.

We also need to install the NBD client package. This will add the nbd-client executable to the initramfs, along with some scripts to call it if the kernel command line specifies an NBD device as root [4]:

$ sudo apt install linux-modules-extra-$(uname -r) nbd-client

We need to gather one piece of information about the client Pi for use later on the server: its serial number. We’ll store this in a file and copy it and the initrd.img to the server. Finally, we’ll shut down the Pi and move to the server side of things:

$ grep Serial /proc/cpuinfo > pi-ident.txt
$ cat pi-ident.txt
Serial          : 1000000089025d75
$ scp -q pi-ident.txt ubuntu@server:
$ scp -q /boot/firmware/initrd.img ubuntu@server:
$ sudo poweroff

Why’d you have to be so good?

The first thing to do on the server is get the image [5] we want to serve, and do a little surgery on it. We flashed Ubuntu 22.04.3 so we set up a directory under /srv to hold the image, wget it, and check the SHA256 checksum. Note that we’re going to perform most of these steps as root:

$ sudo -i
Password:
# mkdir /srv/images
# cd /srv/images
# wget http://cdimage.ubuntu.com/releases/22.04.3/release/ubuntu-22.04.3-preinstalled-server-arm64+raspi.img.xz
 ...
# wget http://cdimage.ubuntu.com/releases/22.04.3/release/SHA256SUMS
 ...
# sha256sum --check --ignore-missing SHA256SUMS
ubuntu-22.04.3-preinstalled-server-arm64+raspi.img.xz: OK
# rm SHA256SUMS

Now we’re going to unpack the image (it’s no good mounting something that’s XZ compressed), rename it to something more manageable , and expand the image file to the full size of SD card we want to emulate (I’m using 8GB here, but change the fallocate command accordingly):

# unxz ubuntu-22.04.3-preinstalled-server-arm64+raspi.img.xz
# mv ubuntu-22.04.3-preinstalled-server-arm64+raspi.img jammy.img
# ls -lh
-rw-rw-r-- 1 root root 4.0G Oct  5 14:12 jammy.img
# fallocate -l 8G jammy.img
# ls -lh
-rw-rw-r-- 1 root root 8.0G Oct  5 14:51 jammy.img

We’ve expanded the image, but naturally the partitions inside it haven’t been changed in size, and nor have the file-systems inside those partitions. However, that’s fine. This is exactly what a freshly flashed 8GB SD card looks like. The device (in this case the image file) is 8GB, but the root partition inside is a mere 3.7 GB. We can use fdisk to see this:

# fdisk -l jammy.img
Disk jammy.img: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x542d34fa

Device     Boot  Start     End Sectors  Size Id Type
jammy.img1 *      2048  526335  524288  256M  c W95 FAT32 (LBA)
jammy.img2      526336 8320243 7793908  3.7G 83 Linux

On first boot of a 8GB SD card, the cloud-init service checks its configuration, sees that it should expand the root, re-writes the root partition, and expands the file-system. Exactly the same will happen here when we first boot this image.

Next, we need to overwrite the initrd.img in the boot partition of this image, with the one we generated on our client (and which we copied to the server earlier). In order to do so we need to mount this image. Normally, if the image file contained only the file-system we wanted to manipulate, this would be as trivial as running mount -o loop jammy.img some-path. But this image contains partitions, so the file-system we wish to mount isn’t at the start of the image.

To get around this, instead of having a loop device created implicitly (with mount’s -o loop option), we need to make our own loop-device and tell the kernel to scan it for partitions. Then we’ll create a mount-point and mount the first partition there. Finally, we’ll copy our customized initrd.img into the mount-point:

# losetup --find --show --partscan jammy.img
/dev/loop5
# ls -l /dev/loop5*
brw-rw---- 1 root disk   7, 5 Oct  5 21:20 /dev/loop5
brw-rw---- 1 root disk 259, 0 Oct  5 21:20 /dev/loop5p1
brw-rw---- 1 root disk 259, 1 Oct  5 21:20 /dev/loop5p2
# mkdir boot
# mkdir boot/jammy
# mount /dev/loop5p1 /srv/images/boot/jammy
# cp initrd.img boot/jammy/

Warning

The loop device on your system will likely have a different number; adjust /dev/loop5 references accordingly.

Next, we should also customize the cloud-init initial configuration to ensure the image installs the same packages that we installed on the client earlier:

# cat << EOF >> boot/jammy/user-data
package_update: true
packages:
- avahi-daemon
- nbd-client
- linux-modules-extra-raspi
EOF

If we don’t do this, the next time our netbooted client refreshes its initramfs, it would generate it without the NBD client (and would naturally fail at the next reboot).

Now we need to edit the kernel command line to tell it that its root device is an NBD share. The kernel command line is one long line of text with space-separated portions. We’re going to those space-separated bits into individual lines to make it easier to manipulate, remove the existing root=LABEL=writable portion [6], and insert the following portions instead:

  • ip=dhcp — we need an IP address to find the root device, and that it should obtain it via DHCP
  • nbdroot=server/jammy — set up an NBD client and connect to the jammy share on the host server (adjust this to match your server’s name or IP address)
  • root=/dev/nbd0p2 — find the actual root on the second partition of the connected NBD device

Instead of doing my usual confusing one-liner, we’ll step through the actions below, but feel free to fire up your favourite text editor and just edit cmdline.txt directly if you find that easier:

# cat boot/jammy/cmdline.txt
console=serial0,115200 dwc_otg.lpm_enable=0 console=tty1 root=LABEL=writable rootfstype=ext4 rootwait fixrtc quiet splash
# cat boot/jammy/cmdline.txt | tr ' ' '\n' > /tmp/cmdline.txt
# cat /tmp/cmdline.txt
console=serial0,115200
dwc_otg.lpm_enable=0
console=tty1
root=LABEL=writable
rootfstype=ext4
rootwait
fixrtc
quiet
splash
# sed -i -e '/^root=/ s@=.*$@=/dev/nbd0p2@' /tmp/cmdline.txt
# sed -i -e '/^root=/ i ip=dhcp' /tmp/cmdline.txt
# sed -i -e '/^root=/ i nbdroot=server/jammy' /tmp/cmdline.txt
# cat /tmp/cmdline.txt
console=serial0,115200
dwc_otg.lpm_enable=0
console=tty1
ip=dhcp
nbdroot=server/jammy
root=/dev/nbd0p2
rootfstype=ext4
rootwait
fixrtc
quiet
splash
# paste -s -d ' ' /tmp/cmdline.txt > boot/jammy/cmdline.txt
# cat boot/jammy/cmdline.txt
console=serial0,115200 dwc_otg.lpm_enable=0 console=tty1 ip=dhcp nbdroot=server/jammy root=/dev/nbd0p2 rootfstype=ext4 rootwait fixrtc quiet splash

Now it’s time to configure the DHCP proxy and NBD server we talked about in the last article. We’ll start with the packages we’re going to need: the NBD server itself, and the ubiquitous dnsmasq daemon which will be handling DHCP, and TFTP for our netbooting clients.

Note

Don’t worry if you’ve already got a DHCP server on your network. I’ve assumed that you almost certainly do and will be configuring the DHCP portion of dnsmasq in DHCP “proxy” mode where it simply steps in to augment the options transmitted by the authoritative DHCP server.

To put it another way: you shouldn’t have to dismantle or reconfigure your network to play along!

Install the required packages:

# apt install dnsmasq nbd-server

During this installation you may see several warnings about dnsmasq being unable to start due to the address already being in use. This is normal and occurs because systemd-resolved is already listening on port 53 (the DNS port) for the loopback address, so it can cache DNS requests. We now configure dnsmasq to only listen on port 53 of the ethernet NIC, to act as a DHCP proxy, and TFTP server:

# cat << EOF >> /etc/dnsmasq.conf
interface=eth0
bind-interfaces
dhcp-range=192.168.255.255,proxy
pxe-service=0,"Raspberry Pi Boot"
enable-tftp
tftp-root=/srv/images/boot
EOF
# systemctl restart dnsmasq

Note

Adjust the reference to eth0 if your Ethernet NIC is named something else. If your network’s mask is not 192.168.255.255, adjust this accordingly.

Next up is the NBD server, which simply needs to point the share “jammy” at our “jammy.img”. However, we also need to remember to change ownership of our images so the unprivileged “nbd” user can write to it:

# chown nbd:nbd jammy.img
# ls -lh
total 8.1G
drwxr-xr-x 3 root root 4.0K Oct 30 16:49 boot
-rw-r--r-- 1 nbd  nbd  8.0G Oct 30 16:55 jammy.img
# cat << EOF >> /etc/nbd-server/conf.d/jammy.conf
[jammy]
exportname = /srv/images/jammy.img
EOF
# systemctl restart nbd-server

Finally, we link our Pi’s serial number (or more precisely, the last 8 digits of it, if it’s longer than that) with the mounted boot partition.

# cat ~ubuntu/pi-ident.txt
Serial          : 1000000089025d75
# piserial=$(sed -e '1s/^Serial.*\([0-9a-f]\{8\}\)$/\1/' ~ubuntu/pi-ident.txt)
# echo $piserial
89025d75
# ln -s jammy boot/$piserial
# ls -l boot
total 3
lrwxrwxrwx 1 root root    5 Oct 30 16:49 89025d75 -> jammy
drwxr-xr-x 3 root root 2560 Jan  1  1970 jammy

Keeps Me From Runnin’

You’re now at a point where you can try netbooting your client Pi. Remove its SD card, and plug it in. You should see the “rainbow” boot screen appear fairly quickly, but there’ll be a long pause on that screen. The reason is that your Pi is transferring initrd.img (which is now much larger than normal due to our installation of linux-modules-extra) over TFTP which is not an efficient protocol without certain extensions, which the Pi’s bootloader doesn’t implement. However, eventually you should be greeted by the typical Linux kernel log scrolling by and reach a typical “booted” state the same as you would with an SD card.

If you hit any snags here, the following things are worth checking:

  • Pay attention to any errors shown on the Pi’s bootloader screen. In particular, you should be able to see the Pi obtaining an IP address via DHCP and various TFTP request attempts.
  • Run journalctl -f --unit dnsmasq.service on your server to follow the dnsmasq log output. Again, if things are working, you should be seeing several TFTP requests here. If you see nothing, double check the network mask is specified correctly in the dnsmasq configuration, and that any firewall on the server is permitting inbound traffic to port 69 (the TFTP port).
  • You will see numerous “Early terminate” TFTP errors in the dnsmasq log output. This is normal, and appears to be how the Pi’s bootloader operates (my guess would be it’s attempting to determine the size of a file with the tsize extension, terminating the transfer, allocating RAM for the file, then starting the transfer again).
  • If cloud-init’s final phase running apt update and apt install avahi-daemon linux-modules-extra-raspi nbd-client fails (it seems to randomly on my test Pi), just login and run them manually.

At this point you should have a fully booted system with a block device as the root. All is good! We can use anything we would on a regular Pi with an SD card, including snapd, docker, or anything else. But there’s a rather serious problem waiting silently for us. If things have worked correctly, nbd-client and linux-modules-extra-raspi will have been installed. This will have re-built the initramfs, and likely also have upgraded the kernel package. If you attempt to reboot at this point, you’ll likely find the next boot fails on the rainbow screen.

Spot the problem? Two things are accessing the boot partition’s block device. First, the TFTP server (dnsmasq) is reading the boot partition via a loop device. Second, the nbd-server is serving the boot partition directly from the image.

Recall that mounts of block devices assume they have exclusive access to the underlying block device. But here, the vfat driver on the server does not have exclusive access to the boot partition. What happens if we change the boot partition on the client. Does the server notice?

Try the following experiment:

  • On the client: sudo touch /boot/firmware/foo
  • On the server: ls /srv/images/boot/jammy/foo

You should find that, even if you sudo sync on the client the foo file simply won’t show up on the server’s boot mount. The boot mount needs remounting to make it re-read the necessary blocks from the underlying image. This bodes badly for anything that writes to the boot partition, as happens when installing a new kernel, or anything that causes the initramfs to be rebuilt.

Thus, to resurrect your netbooting Pi do the following on the server:

# umount /srv/images/boot/jammy
# mount /dev/loop5p1 /srv/images/boot/jammy

Well, this sucks. Solutions?

  1. Use NFS for the boot partition mount instead of NBD. This will work, but there’re several issues here. Firstly … it requires an NFS server! Secondly, you need to customize the fstab in the image before first boot (urgh). Thirdly you need to extract the content of the boot partition to an exported directory and point TFTP to that (which wastes space) or you can stick with the looped mount, but you must be absolutely certain that you mount the boot partition with NFS on the client, not NBD.
  2. Try and remount the boot partition when the client reboots. This is risky as it’s almost impossible to guarantee in practice. If you think you can get clever with DHCP hook scripts in dnsmasq to remount just as the machine is booting (because DHCP comes before the first TFTP request), you’re wrong [7]. dnsmasq scripts run asynchronously and you can’t unmount a “busy” partition. It’s still possible to do this manually, but that’s boring! [8]
  3. Write our own TFTP server that directly accesses the image without ever mounting it! Hmmm…

Join me next time for the conclusion of this series where … well, you can probably guess where this is going …


[1]The Pi 2B, 3B, and 3B+ can also netboot, but the instructions differ on each, and this article is long enough as it is. See Network boot for more information, and drop me a question in the comments if you need more!
[2]Your server needs to be a “real” server, or at the very least a full virtual machine, not a container. Specifically, you need to be able to create loop devices and mount images (both of which are not typically possible in a container environment).
[3]If you just want to add network boot instead of replacing the USB boot option, that’s fine too — you can use 0xf421 or 0xf241 here too. Also, the reason we’ve left SD card boot in the order is simply for safety (there are recovery methods available even if you remove it but it’s simpler to just leave it there).
[4]In mantic, the nbd module moved back to linux-modules-raspi, so you can skip installing linux-modules-extra there if you want, but you’ll still need nbd-client. In noble, I’m intending to seed nbd-client in the image, so no modifications at all should be required. It’s a pretty small install (172KB) so it constitutes little bloat, and the ability to netboot out of the box (with an appropriately configured Pi) seems to justify the change to me.
[5]If you’re cunning, you can avoid downloading the image again by extracting it from the rpi-imager cache. On Linux at least, this is under ~/.cache/Raspberry Pi/Imager/lastdownload.cache. Because the filename has lost it’s extension I usually use file to determine what it was (e.g. if it’s “XZ compressed data, checksum CRC64” I know it was .img.xz).
[6]Unfortunately, the initramfs isn’t intelligent enough to go searching for a label-based root in an NBD device, even though technically this is valid.
[7]Ask me how I know this!
[8]Before anyone asks, no -o remount isn’t sufficient. It needs to be completely unmounted and re-mounted.