Updated: Minor punctuation changes, and a small clarification of why the current fallback probably isn’t a fallback at all
The way Ubuntu boots on the Raspberry is changing in questing. Here’s the story behind the changes (along with my usual copious tangents, and finishing up with some details on how to avoid these changes if you really, really need to).
Boots an’ all
Our current boot setup is… far from optimal. To understand it we should first have a look at how the Pi boots, and how Ubuntu uses this (skip this section if you’re familiar with the Pi’s bootloader!).
The Pi’s native bootloader is split into three parts which we’ll simply call Stages 1, 2, and 3.
Stage 1 is always on the SoC (on all models). It’s main job (as far as we’re concerned) as to find and load stage 2. Its configuration is the autoboot.txt file. If this is present, it must be on the first partition of the boot media. Due to stage 1’s very limited resources (there’s no RAM available at this stage), this file is limited to 512 bytes, and only understands a few directives. The most important is boot_partition which tells it which partition contains the rest of the boot media.
Prior to the Pi 4, stage 2 was the bootcode.bin executable. From the Pi 4 onwards, this is part of the boot EEPROM. I’m not fully clear on all the things stage 2 does, but I do know this stage handles bringing up some more bits of hardware to load and execute the (much bigger) third stage. For instance, when network booting, it’s this stage that brings the ethernet port up, performs DHCP, and starts the TFTP process for the other boot assets.
The second stage also reads at least some of the config.txt configuration file, as this can be used to customize which binary is loaded for the third stage.
Stage 3 is the real “meat” of the bootloader. On the Pi 3 and earlier it was named start.elf; on the Pi 4 this became start4.elf file; on the Pi 5 this is another part of the boot EEPROM [1]. There were 4 historical variants of this stage:
- start.elf / start4.elf
- The default third stage
- start_cd.elf / start4cd.elf
- A “cut down” version of the bootloader with minimal facilities [2]
- start_db.elf / start4db.elf
- The “debug” build of the bootloader
- start_x.elf / start4x.elf
- The bootloader incorporating the legacy camera firmware [3]
Each of these binaries has a corresponding fixup*.dat (fixup.dat, fixup4.dat, etc.) file which contains the relocation information [5] for each binary.
This stage is responsible for loading [4] the device-tree and customizing it for the hardware detected, loading the binaries containing the Linux kernel, and (optionally) the initramfs. It parses all of config.txt which has many options that can be used to customize the booted state. It also reads cmdline.txt which contains the Linux kernel’s command line. Importantly, this also defines the eventual rootfs location.
Finally, this stage is responsible for bringing the ARM cores online and starting the kernel (passing it the addresses of the fixed up device-tree, the optional initramfs, and the kernel command line).
Summarizing, the following table shows the location of the various stages on the various generations of Pi:
Generation | SoC | EEPROM | SD / USB / NVMe |
---|---|---|---|
Pi 1, 2, 3, and Zero | 1 | 2, 3 | |
Pi 4, 400 | 1 | 2 | 3 |
Pi 5, 500 | 1 | 2, 3 |
One other important thing to note is that the various stages can only read the FAT file-system [6], so all the following assets must be placed on FAT partition(s) somewhere:
- Any bootloader assets that don’t live in EEPROM for the generation of Pis you need to support
- The base device-tree(s) of all models you need to support
- All the device-tree overlays wanted (optional)
- The Linux kernel
- The initramfs (optional)
Die with your boots on
What does Ubuntu’s current boot set up look like, particularly with regard to safety? Erm…
On the plus side we always keep two sets of boot assets around on the boot partition. But I’m afraid that’s it for the good news:
- There is no fallback facility. If your boot configuration is corrupted, if we release a bad kernel update, etc. your boot will fail and it’s up to you to pick the pieces
- The boot assets are spread across many files (the aforementioned bootloader files, the kernel, the initramfs, the base device trees, and a few hundred overlays). But the backup files aren’t in easy-to-swap directories: they’re simply next to their original files with a “.bak” suffix.
- If you know enough shell scripting you might figure out that for f in $(find /boot/firmware -type f -name "*.bak"); do cp "$f" "${f%.bak}"; done is what you want [7].
- Even if you figure this out, the old boot files probably aren’t there. If flash-kernel has run more than once since your last boot (this is extremely likely given it runs in response to initramfs rebuilds, kernel updates, flash-kernel upgrades, and so on) the “.bak” files will simply be copies of the new boot assets!
This is Bad with a capital B. No automatic fallback, the manual fallback is extremely painful, and is not even likely to work.
Puttin’ the boot in
A little-known [8] facility of the Pi’s bootloader is the “tryboot” facility. This is a rather neat system designed to provide a robust means to implement A/B booting. Classically, A/B booting involves having two separate copies of all your critical boot assets, and having some mechanism to switch between them.
Let’s say you booted from set “A”. A new kernel is released, and gets installed into “B”, and the bootloader is set to boot from “B” instead. The bootloader then needs some means of noticing if the “B” boot fails, falling back to “A”. In some cases this is done by having the bootloader record state itself in some persistent location, but the Pi’s mechanism for this is quite ingenious: it uses ephemeral state [9] to track that it should be trying “B” so that any failure results in it falling back to “A”.
How does this work in practice? A typical boot-flow reads the following files:
- Stage 1 reads boot_partition from autoboot.txt (if it exists)
- Reads stage 2 (bootcode.bin) from the aforementioned boot_partition (the first by default, but from EEPROM on the Pi 4 and 5)
- Reads stage 3 (start*.elf) from the aforementioned boot_partition (from EEPROM on the Pi 5)
- Reads config.txt to determine the rest of the configuration, which specifies the locations of all remaining assets
- Reads the kernel, initramfs, device-trees, overlays, etc. from the boot_partition
However, if you reboot the Pi in “tryboot” mode this changes to:
- Stage 1 reads boot_partition from autoboot.txt (if it exists), but optionally reads a [tryboot] section that may override boot_partition (potentially directing it to a different partition)
- Reads stage 2 (bootcode.bin) from the (potentially overridden) boot_partition (from EEPROM on the Pi 4 and 5)
- Reads stage 3 (start*.elf) from the (potentially overridden) boot_partition (from EEPROM on the Pi 5)
- Reads tryboot.txt (instead of config.txt) which specifies the locations of all remaining assets
- Reads the kernel, initramfs, device-trees, overlays, etc. from the (potentially overridden) boot_partition
The “tryboot” mode is initiated using the following command to reboot the Pi:
$ sudo reboot '0 tryboot'
This leads to a couple of typical designs for a tryboot-enabled boot implementation on the Pi…
Full ABs
A “full” A/B boot implementation consists of the following partition layout:
- Partition 1 (FAT)
Just contains autoboot.txt with the following content [10]:
[all] boot_partition=2 [tryboot] tryboot_a_b=1 boot_partition=3
- Partition 2 (FAT)
Contains the “A” set of boot assets. This includes the stage 2 and 3 binaries (if support of older Pi models is required), the kernel, initramfs, device-trees, overlays, etc. No directories required; everything’s in the root.
Boot configuration is stored in config.txt. The kernel command line in cmdline.txt points to partition 4 as the rootfs.
- Partition 3 (FAT)
- Contains the “B” set of boot assets. Identical layout to partition 2, but cmdline.txt points to partition 5 as the rootfs.
- Partition 4
- The “A” rootfs
- Partition 5
- The “B” rootfs
Note
Note the (highlighted) tryboot_a_b=1 line in autoboot.txt. This causes the third stage to read config.txt instead of tryboot.txt because the assumption is that, in this layout, the boot configuration will be precisely duplicated (both boot partitions have config.txt).
To switch between the A and B sets, autoboot.txt on the first partition is updated (preferably atomically [11]) with swapped values of the boot_partition= lines.
Typically the first partition is some smaller variant of FAT (FAT-12/16) while partitions two and three are FAT-32. Partitions 4 and 5 are whatever you use as a rootfs (commonly ext4, but can be anything your kernel and initramfs combination can support).
The benefits of this layout are pretty obvious:
- It switches everything including all the bootloader assets (at least on the Pi models without any EEPROM).
- It’s also capable of switching the entire rootfs, although this is optional (the “B” cmdline.txt could equally point at partition 4 to have a single rootfs).
The drawbacks are:
- You pretty much have to design this “up front” into your image; it’s great for fresh installations, but almost impossible to safely migrate existing installations with a more basic layout to it.
- This design is also more complex for users who wish to edit the boot configuration as they now need to figure out which partition is mounted and edit the boot files on the other partition [12]. That said, I suspect most uses of this layout try and ensure that users have no ability to mess with the boot configuration at all.
The beer belly version
A much simpler configuration uses the traditional partition layout but with some minor tweaks to the files on the boot partition:
- Partition 1 (FAT-32)
No autoboot.txt (as per usual)
Bootloader assets (bootcode.bin, start*.elf) are placed in the root of this partition.
The a/ directory contains the remainder of the boot assets: device-trees, Linux kernel, initramfs, overlays, cmdline.txt (which points at partition 2 as the rootfs).
Another directory, b/ contains a second set of these boot assets.
config.txt contains something like the following:
[all] os_prefix=a/ kernel=vmlinuz initramfs initrd.img followkernel ...
tryboot.txt contains something like:
[all] os_prefix=b/ kernel=vmlinuz initramfs initrd.img followkernel ...
- Partition 2
- The rootfs
To switch between A and B, either config.txt and tryboot.txt are exchanged, or the a/ and b/ directories are switched (preferably atomically [13]).
The advantages to this layout are primarily simplicity.
- It’s fairly easy to migrate an existing classical layout to this, and do so in a “safe” manner that guarantees a “known good” boot configuration is present at every step.
- It’s also easier for users who know all their boot configuration will be present on a single partition (they don’t need to go potentially mounting things).
The disadvantages are:
- The bootloader assets (bootcode.bin, start*.elf) cannot be switched by this method as they cannot be read from sub-directories. However, bear in mind that these are only used by older models of Pi and thus are unlikely to change much in the coming years anyway.
- Changing the boot configuration becomes a little more complex. Let’s say you want to try a new overlay. You add the dtoverlay= line to tryboot.txt and initiate “tryboot”. Your OS boots successfully, and some service switches tryboot.txt and config.txt. However, now you need to remember to make the same edit to the new tryboot.txt (your former config.txt) or the next switch will lose the change.
- If you want to edit the kernel command line, you still have to query config.txt to figure out which cmdline.txt to fiddle with (a/cmdline.txt or b/cmdline.txt).
Boot camp
As you can probably tell from the above, I don’t think either of the two designs presented is quite what we want for questing.
The “full” A/B design is far too complex to migrate to, and I don’t think there’s any way of doing it safely. This is especially the case given that we may have upgraders who have inherited a minimal 256MB boot partition from earlier releases of Ubuntu.
The “lite” version is a good basis, but the whole “a” and “b” thing strikes me as a little obscure. I also don’t particularly like the duplication of config.txt into tryboot.txt. There are an absolute ton of scripts out there that assume they can blindly append configuration to config.txt (which is why it’s usually important to make sure your configuration ends with an [all] section). While this isn’t “good practice”, the swapping of tryboot.txt and config.txt would break people who rely on these scripts.
After experimenting with a few designs over the last few weeks, here’s the design I’ve come up with, which should be landing in questing as I post this:
As in the “lite” A/B boot design, we still have our regular old FAT boot partition, and one rootfs partition. The bootloader assets (bootcode.bin, start*.elf, and so on) still live in the root of the boot partition (because they have to). All other assets move into three directories:
- current
- This directory always exists, and always contains the current boot assets of the booted system [14]. Thus, by definition, it will always contain “known good” boot assets.
- old
- If this directory exists, it contains the formerly “known good” boot assets. We keep this around when possible just in case users need it; consider a boot configuration change that isn’t fatal (the boot still succeeds), but which results in some undesirable side effect later at runtime. The new service should make it trivial to switch “old” to “current” when desired.
- new
- If this directory exists, it either contains new untested boot assets, or it contains boot assets that were tested, but failed.
Why keep failed boot assets around? It may be useful to see the configuration that failed for debugging purposes, but more importantly we can only assume that we have space for two sets of boot assets on the boot partition. Therefore, before “new” is created, “old” is always deleted first. This is safe because “current” is always “known good”.
So, we have the following “states” in our new boot layout:
- stable
- “current” exists; “old” may exist (if it does, contains “known good” assets); “new” does not exist
- untested
- “current” exists; “old” does not exist; “new” exists and contains untested boot assets
- trying
- “current” exists; “old” does not exist; “new” exists and contains boot assets we’re about to try (this state is entered immediately before rebooting into the “tryboot” mode)
- failed
- “current” exists; “old” does not exist; “new” exists and contains boot assets marked as having failed
The state transition diagram is as follows:
The “loop” on “untested” exists because (as mentioned earlier) it’s perfectly valid to have a long running system where flash-kernel has been called multiple times (a kernel update, an initramfs rebuild, etc.), overwriting the new boot assets each time. However, in this scenario the current boot assets are never touched by this.
The boot configuration is stored in config.txt and looks something like this:
[all]
os_prefix=current/
[tryboot]
os_prefix=new/
[all]
kernel=vmlinuz
initramfs initrd.img followkernel
# The rest of config.txt
# ...
Finally, autoboot.txt also exists, and just contains:
[all]
tryboot_a_b=1
This ensures the bootloader always reads config.txt, and we simply use a [tryboot] filter within that file to redirect our boot to the “new” directory. The advantages of this layout are:
- We get all the advantages of the “lite” A/B boot setup: a reliable fallback, and a configuration that we can migrate to safely.
- The boot configuration lives in a single file (config.txt) that doesn’t get arbitrarily over-written; existing scripts that append configuration to config.txt continue to work without configuration lines silently disappearing.
- There’s no confusion over which boot assets are current, which are old, and which are new: the directory names tell you everything.
Booty! Yarrr!
This all sounds a bit too good to be true! What are the drawbacks?
The most obvious one is the fact that the “tryboot” mode can never be entered from a cold boot. It requires a reboot. This means some service needs to notice (during a normal boot) that new untested boot assets are present, interrupt the boot and restart in “tryboot” mode. This will mean that, each time flash-kernel is run for whatever reason (new kernel, initramfs rebuild, etc.) the next boot will be a double boot. This will probably be a bit jarring to people at first (“why did I see the rainbow screen twice?!”), and also means that boot will take roughly twice as long (obviously).
However, I don’t see a way around this and, to be frank, it’s a small price to pay for the reliability that this mechanism should bring.
Tough as old boots
I don’t mean all this to sound like a fait accompli [15]. Some people do weird things with their Pi. Some people may be relying on all their boot assets being in the root of their boot partition. Some may have weird hardware that breaks horribly if reboots occur too close together (or which doesn’t reset quickly enough for the second boot).
In short, there must be a fallback mechanism. To that end, I’m introducing these changes as a new “method” in flash-kernel, but keeping the old one in place for those that really need it.
All boards that use flash-kernel to write their boot assets (including the Raspberry Pi under Ubuntu) have entries in the flash-kernel database (/usr/share/flash-kernel/db/all.db) which look something like this:
Machine: Raspberry Pi 3 Model B+
Machine: Raspberry Pi 3 Model B Plus
Machine: Raspberry Pi 3 Model B Plus Rev 1.3
Machine: Raspberry Pi 3 Model B Plus Rev *
Kernel-Flavors: raspi
Method: pi-try
DTB-Id: bcm2710-rpi-3-b-plus.dtb
The important setting for our purposes is the Method: line (highlighted). Prior to questing this reads Method: pi on all the (relevant) Raspberry Pi entries. From questing onwards this will read Method: pi-try. When flash-kernel encounters the pi-try method, and finds the old boot configuration on the boot partition, it will attempt to migrate it to the new pi-try layout, and re-write the config.txt configuration accordingly.
However, the older pi method (no A/B booting, write everything to the root of the boot partition) is still present in flash-kernel, and if you absolutely need to, you can switch back to it.
Warning
Using the old “pi” mechanism leaves you in the situation described earlier: your fallback “.bak” files may not really be a fallback at all, and even if they are, it’s a major pain to revert to them. Still, if you’re absolutely determined…
- Firstly, override the flash-kernel database entry by copying the relevant entry to /etc/flash-kernel/db and adjusting the Method: line.
- Then, remove the old/ or new/ folders from your boot partition.
- Run flash-kernel to re-copy the boot assets to the root of your boot partition.
- Remove the os_prefix lines from your boot configuration.
- Finally remove the current/ directory and autoboot.txt.
Done in this specific order, this procedure should be reasonably safe and leave you with a bootable system at all steps.
By way of an example, in the case of the Pi 3 entry above, the following should be sufficient to revert to the old boot method (pay close attention to the highlighted lines):
$ sudo -i
Password:
# cat << EOF >> /etc/flash-kernel/db
Machine: Raspberry Pi 3 Model B+
Machine: Raspberry Pi 3 Model B Plus
Machine: Raspberry Pi 3 Model B Plus Rev 1.3
Machine: Raspberry Pi 3 Model B Plus Rev *
Kernel-Flavors: raspi
Method: pi
DTB-Id: bcm2710-rpi-3-b-plus.dtb
EOF
# rm /boot/firmware/old
# rm /boot/firmware/new
# flash-kernel
# sync
# sed -i -e '/^os_prefix/d' /boot/firmware/config.txt
# rm -rf /boot/firmware/current
# rm /boot/firmware/autoboot.txt
Anyway, that’s enough “boot” puns from me, for now. If you do anything even slightly weird with your boot configuration, I would strongly encourage you to try out the questing dailies on a spare SD card:
To test the new implementation from a questing daily, add ppa:waveform/flash-kernel and install the flash-kernel-piboot package:
$ sudo add-apt-repository ppa:waveform/flash-kernel
$ sudo apt install flash-kernel-piboot
This will also upgrade flash-kernel and migrate your boot partition to the new layout. In theory the migration is entirely safe, i.e. it can fail at any point and you should be left with a bootable system (I’ve tested this a couple of times by yanking the power-cord in the middle of it!). However, if you can find a way to break it (from a configuration we’d actually support), then please let me know!
If you find any issues, please file a bug against the flash-kernel package in Launchpad, and tag it raspi-image to bring it to my attention.
If you do find a genuine need to fall back to the old boot configuration, again please let me know! I’d be really curious to find out what these circumstances are; I’m sure they exist, but they’ll be things I haven’t thought of (yet), and if there’s a way I can tweak the new design to accommodate them, I’d prefer to do that rather than have people forcing themselves back into the old (fundamentally unsafe) configuration.
Boots of Spanish Leather
The alternative to falling back to the old mechanism is to keep the A/B facility, but manage the tryboot mode somewhat manually…
The piboot-try command is the new script that manages all the stuff mentioned above (which will be provided by the new flash-kernel-piboot package, hopefully in Ubuntu questing by next week. When called with the --test flag this will exit with status code 0 (“true” in shell parlance) in the event there are new, untested boot assets. Otherwise, it will exit with status code 1 (“false” or “error” in shells).
When called with --reboot, if new untested boot assets are present, it will immediately re-write their status to “trying” and reboot into the “tryboot” mode.
Hence, if you know flash-kernel has run, and new boot assets are present, and you’re happy to reboot immediately you can simply run sudo piboot-try --reboot to reboot and immediately try the new boot assets (no double boot necessary). If you’re thinking of scripting this and want to query the state of the boot assets, piboot-try --test (no sudo necessary) will tell you whether sudo piboot-try --reboot would reboot (if the exit code is 0).
If you want to learn more about the command, I’ll be including a full man-page for it, man piboot-try.
[1] | On the Pi 5, none of the bootloader resides on the FAT partition anymore. In certain configurations, you can even get away with no configuration file, so just shoving a Linux kernel on the FAT partition is enough! |
[2] | Intended to maximize the RAM available at runtime (the default bootloader typically reserved 64MB of RAM for facilities it provided at runtime; the cut down variant only needed 16MB). |
[3] | One particularly odd aspect of the Pi’s boot process is that it runs entirely on the GPU (well … VPU?), rather than the ARM CPU. The legacy camera firmware basically ran its own RTOS over on the GPU, hence incorporating it into the bootloader made perfect sense. |
[4] | Actually, I’m not entirely clear if stage 3 loads the base device tree. That might be stage 2, but stage 3 handles some of the customization of the device-tree, loading overlays, and so forth. |
[5] | I’m not entirely clear on what fixup.dat really does either. I know each fixup file corresponds to each start.elf file and that it has something to do with the CPU/GPU memory split. The start.elf binary will operate without fixup.dat being present, but the wrong amount of RAM is reported by the mailbox interface in this case. I’ve also heard fixup has to do with “relocation”, and I do know that the CPU/GPU memory split has the CPU (ARM portion) at the “low” end and the GPU firmware at the “high” end so my vaguely educated guess is that start.elf gets loaded at a too-low, but definitely safe location, then fixup.dat is used to relocate it to the highest possible point it can safely sit in RAM to maximize the amount available to the ARM cores. But that’s just my guess. |
[6] | The stages all understand FAT-12, FAT-16, FAT-32, and the VFAT long filename extensions. |
[7] | Maybe… This is just off the top of my head; I haven’t tested it and give no warranty that this will (or won’t!) unbrick your boot! |
[8] | Maybe, given some interactions I’ve had online? |
[9] | The state is stored in the PM_RSTS register on the PMIC. PMIC registers (generally) survive reset (obviously not power off though), but this particular one is also reset-on-read so it is guaranteed subsequent boots will fall back if the tryboot one fails for any reason. |
[10] | The autoboot.txt file has a size limit of 512 bytes (presumably because stage 1 has to be incredibly basic, and can’t read FAT chains, so the file is limited to one sector). In other words don’t include extraneous comments or line breaks! |
[11] | Bear in mind that FAT is not a journalling file-system. Updates to any boot configuration should generally be done atomically by writing the new content to a temporary file on the target FAT partition, then renaming the temporary file over the original. The rename operation is atomic so anything reading the file-system should either see the original content or the new content, but no partially written file. |
[12] | Under the assumption that you shouldn’t ever be fiddling with your “known good” boot configuration. All experiments should be performed in the “alternate” boot assets which are then tested with tryboot before being made the new “known good” set. |
[13] | With GNU coreutils 9.5 and above, simply: mv --exchange config.txt tryboot.txt! |
[14] | This is a slight lie. There are some brief transitional states where current exists, but contains former assets. These should always be rapidly corrected, as we’ll see. |
[15] | Though it kinda is… I do control the boot process on your Pi after all! Mwuhahahaha! |