It's Pi all the way down...

by

Dear Reader,

Please be forewarned that this is a rambling mess of a post that will dart off on tangents a-plenty. This was originally planned to a post before noble’s .2 release, but too many work things go in the way and it’s laid in the drafts folder until I noticed it again while writing something else!

Anyway, this may at least be an entertaining distraction into the effort that goes in to dealing with hardware changes in Ubuntu.

Buckle up!

Background

First of all… what’s a stepping? Put simply, it’s a specific version of a particular chip. A new stepping might be released to fix a bug, or improve yields, or make other improvements to a design, but it isn’t a fundamentally new design. Think of it like a bump in a minor version number.

There have been new steppings of Raspberry Pi SoCs in the past, but you may not have noticed. For example, the Pi 4 uses the Broadcom 2711 as its SoC. At launch, this was the B0 stepping. However, when the Compute Module 4 launched later, it moved to the C0 stepping, which corrected an issue in the boot loader’s verification mechanism. I think later versions of the Pi 4 moved to this stepping as well, but looking at my range of Pi 4 boards here, they’re all B0s.

What about the Pi 5? The 4GB and 8GB models initially launched shipped with the C1 stepping of the Broadcom 2712 SoC. However, Broadcom’s initial design of the 2712 included several features which were never used in the Pi 5; increasing the cost and the power draw of the chip without any benefit. So much so, that initially the planned 2GB variant of the Pi 5 was uneconomic to produce.

The new 2712 D0 stepping stripped out this “dark silicon“, reportedly as much as 30% of the die, making the 2GB possible to produce. It was also used in the subsequent launches of the Compute Module 5, and later the 16GB Pi 5 variant launched in January this year. The assumption is that at some point the D0 will become the standard silicon on the 4GB and 8GB models too.

So far, so good. But also, so what? Why does the OS care about the stepping?

The D0 stepping introduced another change: one of the data structures associated with the shaders grew some new fields in the middle of the structure, shuffling the locations of other fields. Stripping out some details of the structures to make the diff a bit easier to read:

-<struct name="GL Shader State Record" min_ver="71">
+<struct name="GL Shader State Record Draw Index" min_ver="71">
     <field name="Point size in shaded vertex data" size="1" type="bool"/>
     <field name="Enable clipping" size="1" type="bool"/>

     <field name="Vertex ID read by coordinate shader" size="1" type="bool"/>
     <field name="Instance ID read by coordinate shader" size="1" type="bool"/>
     <field name="Base Instance ID read by coordinate shader" size="1" type="bool"/>
+    <field name="cs_basevertex" size="1" type="bool"/>
+    <field name="cs_drawindex" size="1" type="bool"/>
+
     <field name="Vertex ID read by vertex shader" size="1" type="bool"/>
     <field name="Instance ID read by vertex shader" size="1" type="bool"/>
     <field name="Base Instance ID read by vertex shader" size="1" type="bool"/>
+    <field name="vs_basevertex" size="1" type="bool"/>
+    <field name="vs_drawindex" size="1" type="bool"/>

     <field name="Fragment shader does Z writes" size="1" type="bool"/>
     <field name="Turn off early-z test" size="1" type="bool"/>

This is a breaking change, requiring changes in mesa (the library that provides various graphical APIs including OpenGL and Vulkan).

So, how can you tell if you’ve got a C1 or a D0 stepping of a 2712? You could take the cooler off your Pi 5 and look at the lid of the SoC. The stepping is just near the end of the long code.

Alternatively, you could try and use Ubuntu 24.04.1 (noble’s .1) Desktop image. If you’ve got a C1, it works! If you’ve got a D0 … erm …

RaspiOS

How do we fix this? First let’s look at RaspiOS. Obviously mesa needs updating, with the requisite changes. RaspiOS added these in version 23.2.1-1~bpo12+rpt3 with the following slightly unceremonious changelog:

mesa (23.2.1-1~bpo12+rpt3) bookworm; urgency=medium

  * Update patches

To be fair, this may have been a matter of trying to keep the existence of the new stepping relatively quiet (though anyone constructing a debdiff would see the 2712D0 mentioned quite prominently). The release date on that package is January 2024, and the Pi 5 2GB model featuring the D0 didn’t hit the market until August 2024.

Still, this is basically all that happens in RaspiOS [1]. New mesa, job done.

Ubuntu

Over in Ubuntu land, things are a bit more complex. As it turned out, the fix was already in the development version (sync’d from upstream), and even Oracular (24.10 which is about to go EOL as I write this) had a sufficient up to date mesa to incorporate the fix.

However, the interim versions are not what most Ubuntu users are interested in. The vast majority of Ubuntu users (something like 99%) stick with the LTS releases which at the time of writing is noble (24.04). In order to fix things in an existing stable release, we needed an SRU

SRU

A quick tangent on Stable Release Updates, or SRUs as we refer to them (because there’s nothing more developers love than another TLA. Actually, first another tangent on stability of Linux distros.

Linux distributions can be roughly divided into two categories: stable distros and rolling distros. Stable distros grew out of the frustrations of server operators who wanted a guarantee of stability above all else: once the system was installed and working the only updates they wished to see were those that fixed bugs. They had precisely zero interest in new few features and even minor bug fixes were probably more risk than they were worth. The only time risk was introduced into the equation was when the OS as a whole was updated (a “dist-upgrade” in Debian terms), and because this was a known risk, it would be planned for, tested, and implemented in a controlled manner.

Classic “stable” distros include Red Hat, Debian, and of course Ubuntu.

The rolling distros by contrast, have no concept of an OS “release”. They simply package the latest versions of all things at all times. Much of the time, this does work reasonably well, but it does mean that at certain times there fairly major changes that happen in a more or less “adhoc” fashion. Historically, consider the move to systemd, or transitions between Qt versions.

Classic “rolling” distros include Gentoo and Arch.

Some distros incorporate both models. Alpine and Debian are examples of these. While Debian has a “stable” release (currently “bookworm”, soon to be “trixie”), it also has the “unstable” branch called “sid” which is basically a rolling distro in itself. Alpine likewise makes six monthly releases, but has a rolling “edge” branch.

While the “stable” distro model is widely accepted (and very reasonable, in this writer’s opinion), it does cause some tension, particularly with the rise of desktop Linux usage. Desktop users often do want the fancy new features in their applications, don’t have the same level of risk-aversion of the typical hard-boiled sysadmin [2], and thus aren’t willing to wait 2 or 3 years for the next release of their stable distro to get them. This is part of the driver behind the rise of “alternate” packaging systems on various distros (Flatpaks, Snaps, AppImages, etc).

On Ubuntu, updates to stable releases (outside of the realm of snaps) have to follow the Stable Release Update process which incorporates numerous checks and balances to try and avoid regressions. Test plans must be written, reviewed, and agreed before anything is even uploaded. Updates must be minimally invasive (no full version backports — those have to follow a separate process which doesn’t automatically update existing installs). After upload, verification is carried out, and only then is the fix allowed into the release.

But not right away…

Phasing

Even after an update to a stable release has made it through review, verification, and upload, such updates go through a “phasing” procedure. This is where the update is slowly trickled out to larger and larger proportions of the userbase. The idea is that, if an update does have a catastrophic bug in it, the damage can be limited to the portion of the userbase that (randomly or by choice) get the update early. In the event of a nasty bug appearing in stable update, phasing can be halted and the issue investigated before either rolling back the update or discovering a false-positive and resuming the phasing.

Incidentally, the stable release updates report, and the phasing report are both public if you want to peak at what’s in the pipeline of your stable install! You can read more about phasing in the server documentation.

Noble

Okay, back to the story. Mesa was up to date in oracular, but we needed an SRU to noble to avoid new Pi 5 owners getting a full blown crash the second anything graphical like a login screen dared show its face on the monitor.

Time for a noble SRU. The SRU was written (LP: #2082072), the update sponsored, verification performed, phasing completed, and by the middle of October everything was fixed.

mesa (24.0.9-0ubuntu0.2) noble; urgency=medium

  * Add support for Pi 2712D0 stepping (LP: #2082072)

I said, everything was fixed!

Right?

Oh, ****!

Snaps

One thing I’d forgotten about in this whole process was that there’s no longer just one version of mesa on most Ubuntu desktop systems. Nowadays, a fair number of applications, including some vaguely important ones like Firefox, are shipped as snaps, at least partly in order to alleviate the aforementioned tension between stable distros and desktop users demanding the latest and greatest versions.

There are many ways to handle distributing the “latest” versions of things on a stable distro, but one of the things that must be dealt with is the conflict between the versions of libraries on the stable distro, and the libraries that the “latest” thing expects.

One method is to simply “bundle everything”; include all the libraries that your app expects within its package. Obviously this results in a rather bloated package which isn’t going to share any libraries with anything else on the system, and thus takes more time to load and more memory to run. It also means dealing with all the path nonsense to ensure your application only loads the versions of libraries from its own location. This doesn’t solve all the bundling issues, but it does solve most and it is at least simple.

Another method is “statically compile everything”. This is a close cousin to bundling everything, and still results in a fairly huge download but a little slimmer than bundling all the libraries. This is becoming more common with the rise of things like go and rust, which compile this way by default.

Snaps take a rather interesting (and different) approach here.

All snaps are built with a “base” snap which represents a stable release of Ubuntu (core22 represents a jammy base, core24 a noble base, and so on). The “base” snap provides the “common” libraries that the snap would expect to find on such a stable release of Ubuntu.

The snap can still statically compile, or bundle, whatever it likes, but this means at least some of its dependencies can be shared (from the base snap). Moreover, the method by which snaps are mounted means that multiple versions of these core snaps can co-exist so it’s fine to be running a snap based on core22, and another based on core24 (obviously they won’t be sharing their core libraries in this case, but that’s the trade-off).

The libraries the core snaps include are fairly limited and, notably, don’t include mesa. However, there are also “content” snaps that provide commonly used sets of libraries above and beyond the core snaps. One of these is the “gnome” content snap which provides (surprise surprise), Gnome, GTK, and, you guessed it, mesa. The gnome content snap itself comes in several versions to accomodate different “base” snaps:

  • gnome-42-2204 is the gnome content snap using core22 (jammy) as a base
  • gnome-46-2404 is the gnome content snap using core24 (noble) as a base

Now… guess which gnome content snap Firefox is (still, at the time of writing) using?

Jammy

Shortly after the middle of October, and the noble SRU landing, I noticed we had another problem. The noble daily would happily boot to the login prompt on the new D0 Pi 5, and I could login and run many things. But not Firefox. Starting that, or anything else on the core22 snap, froze the entire desktop. Not good.

So, back to the SRU queue. The original ticket (LP: #2082072) was re-targeted for jammy, the upload was made, and a bunch of other packages regressed. Damn! Various retries later and the package was ready for verification.

But… now we had a problem. This was only an update to the mesa deb package, in Ubuntu jammy (22.04). This wasn’t an update to the gnome content snap; that would have to be done after verification of the SRU. But the Pi 5 was fundamentally not supported in jammy.

Rather than trying to get a jammy image into a state it could boot (and run Gnome) on a Pi 5, I opted to locally rebuild the gnome content snap locally to verify things instead. This took a few days to work out as it’s an almost entirely undocumented process and I first had to figure out that the gnome-sdk snap needed rebuilding first (i.e. the chain goes: mesa deb -> gnome-sdk snap -> gnome snap).

To cut a long story short (or at least… less long), eventually the verification was done, the jammy mesa update was released.

mesa (23.2.1-1ubuntu3.1~22.04.3) jammy; urgency=medium

  [ Timo Aaltonen ]
  * Add support for Pi 2712D0 stepping (LP: #2082072)

  [ Alessandro Astone ]
  * patches: Backport patch for green artifacting and GPU crash on
    radeonsi with kernel >= 6.10 (LP: #2083538)

Phasing was completed, then the gnome-sdk snap and the gnome content snap were rebuilt, and finally Firefox was working on the D0 stepping.

By the time all this was done, it was the 31st of Jan. To give some context, the .2 release of noble was scheduled for the 13th of Feb, and I was adamant that we had to get the mesa fix in for that otherwise we’d have new Pi 5 owners unable to run Firefox “out of the box” on our current LTS. You could argue: “users can still login, they can always get the release from updates?”. However, the app-store is also a core22 snap, and hence also crashed the desktop. While there was still the sudo snap refresh route, I’m loathe to tell desktop users they have to use the command line.

Suffice to say, it was butt-clenching time in the run up to noble’s .2 release.

A heavy chain (of dependencies)

What’s the moral of the story?

I’m not entirely sure. Half the problem is that everything above has good reason for being the way it is. The SRU process is long and complicated because it’s borne of the pain of unexpected regressions. It is the way it is, because the alternative is actually more painful.

The snap eco-system is the way it is to avoid the more painful issues with bundling absolutely everything.

Still, what we have here is a dependency chain with many links, and I can’t help but think some of those links may be unnecessary (or at least warrant some consideration).

  • Should a content snap rely on another snap (the gnome-sdk snap in the middle of the chain, here)?
  • Should the gnome content snap rely on the jammy debs at all, or should it be able to pull in patches independently of them?
  • Should we mandate that seeded snaps always have (useful) contact information and well-documented procedures for building?

That said, there are trade-offs in every direction here. At the moment, the gnome content snap can reasonably rely on the stability (and quality) guarantees of the underlying deb. Do we really want to move away from that? Should we be pushing seeded applications to use later base snaps, so that SRUs don’t have to go back as far in time?

I don’t know the answers to all this yet. But I do know taking 3+ months to get a fix to the place it needs to be is no fun!


[1]This is categorically not true. New device-trees were also added and probably a whole host of other changes I haven’t noticed, but this post is going to be waffley enough without me going off track with every detail!
[2]I originally wrote “paranoia of the sys-admin”, but frankly it’s not paranoia. This level of risk-aversion is usually the result of hard-learned lessons!