etc break number 2
# Intro This presents an analysis of what went wrong for Nate's system on 2026-03-21, resulting in hard to recover breakage of /etc. The analysis was performed on journal records from the affected system. I'll try to paint as verbose a picture as possible. (All times are CET because I am lazy) # Boot -11 The first thing of note here is that during the first update attempt a vacuum run is done. ``` Mar 21 18:27:46 engine systemd-sysupdate[37546]: Selected update '202603211640' for install. Mar 21 18:27:46 engine systemd-sysupdate[37546]: Making room for 1 updates… Mar 21 18:27:46 engine systemd-sysupdate[37546]: ~ Removing old '/system/kde-linux_202603190254.erofs.caibx' (regular-file). Mar 21 18:27:46 engine systemd-sysupdate[37546]: ~ Removing old '/system/kde-linux_202603190254.erofs' (regular-file). Mar 21 18:27:46 engine systemd-sysupdate[37546]: ~ Removing old '/boot/EFI/Linux/kde-linux_202603180254.efi' (regular-file). ``` The strange thing about this is that the erofs it removed and the efi it removed are not part of the same version! It resulted in strange paging failure ``` Mar 21 18:27:53 engine kernel: kde-linux-sysup: page allocation failure: order:0, mode:0xc0d60(GFP_NOFS|__GFP_HIGH|__GFP_ZERO|__GFP_COMP|__GFP_NOMEMALLOC), nodemask=(null),cpuset=kde-linux-sysupdated.service,mems_al> ... ``` and eventually failed because of timeout ``` Mar 21 18:29:18 engine systemd-pull[37569]: Transfer failed: Timeout was reached Mar 21 18:29:18 engine systemd-pull[37569]: Failed to retrieve image file. Mar 21 18:29:18 engine systemd-pull[37569]: Exiting. Mar 21 18:29:18 engine systemd-sysupdate[37546]: (sd-pull-raw) failed with exit status 1: Input/output error ``` and when it tried to get some data from the backend it fails an assertion about inst ``` Mar 21 18:29:18 engine systemd-sysupdated[37532]: Started job 2 with worker PID 38007 Mar 21 18:29:18 engine audit[38007]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 pid=38007 comm="systemd-sysupda" exe="/usr/lib/systemd/systemd-sysupdate" sig=6 res=1 Mar 21 18:29:18 engine kernel: Memory cgroup min protection 0kB -- low protection 0kB Mar 21 18:29:18 engine systemd-sysupdate[38007]: Discovering installed instances… Mar 21 18:29:18 engine kernel: audit: type=1701 audit(1774114158.822:734): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=38007 comm="systemd-sysupda" exe="/usr/lib/systemd/systemd-sysupdate" sig=6 res=1 Mar 21 18:29:18 engine systemd-sysupdate[38007]: Determining installed update sets… Mar 21 18:29:18 engine systemd-sysupdate[38007]: Selected update '202603211640' for install. Mar 21 18:29:18 engine systemd-sysupdate[38007]: Assertion 'inst' failed at src/sysupdate/sysupdate.c:1174, function context_process_partial_and_pending(). Aborting. ``` https://github.com/systemd/systemd/blob/494c65236b19e160ade48315edfa0f089f3d4154/src/sysupdate/sysupdate.c#L1174 this seems to assert that each transfer bundle (i.e. erofs, uki, caibx) exists either locally or remotely? a failed assertion suggets that one is missing. current theory is that it is in fact a local file that is missing unexpectedly (and it wouldn't exist remotely anymore because it was rotated out of the SHA256SUM). Further suggesting it removed three unrelated assets in the vacuum run. On a retry of the download this magically seems to not be a problem anymore. But what appears to happen is that the updater thinks 202603211640's erofs is now already pulled. Remember: we timed out earlier! ``` Mar 21 18:30:35 engine systemd-sysupdate[38126]: Determining installed update sets… Mar 21 18:30:35 engine systemd-sysupdate[38126]: Determining available update sets… Mar 21 18:30:35 engine systemd-sysupdate[38126]: Selected update '202603211640' is already acquired and pending installation. Mar 21 18:30:35 engine systemd-sysupdate[38126]: Selected update '202603211640' for install. Mar 21 18:30:35 engine systemd-sysupdated[38112]: No output from child job, ignoring Mar 21 18:30:35 engine systemd-sysupdated[38112]: Error during execution of job callback for job 1: Job exited successfully with no work to do, assume already acquired ``` later this happens ``` Mar 21 18:30:53 engine systemd-sysupdate[38218]: Determining installed update sets… Mar 21 18:30:53 engine systemd-sysupdate[38218]: Determining available update sets… Mar 21 18:30:53 engine systemd-sysupdate[38230]: Discovering installed instances… Mar 21 18:30:53 engine systemd-sysupdate[38230]: Determining installed update sets… Mar 21 18:30:53 engine systemd-sysupdated[38112]: Invalid JSON response from 'systemd-sysupdate list': Missing key 'current' Mar 21 18:31:04 engine systemd-sysupdate[38250]: Discovering installed instances… Mar 21 18:31:04 engine systemd-sysupdate[38250]: Determining installed update sets… Mar 21 18:31:04 engine systemd-sysupdated[38112]: Invalid JSON response from 'systemd-sysupdate list': Missing key 'current' ``` which further suggests that the current image had been vacuumed away. I am guessing the auto-shutdown-after-update logic then ran, because the boot shuts down... ``` Mar 21 18:31:10 engine systemd[1660]: Created slice Slice /app/dbus-:1.1-org.kde.LogoutPrompt. Mar 21 18:31:10 engine systemd[1660]: Started dbus-:1.1-org.kde.LogoutPrompt@0.service. Mar 21 18:31:12 engine systemd[1660]: Created slice Slice /app/dbus-:1.1-org.kde.Shutdown. Mar 21 18:31:12 engine systemd[1660]: Started dbus-:1.1-org.kde.Shutdown@0.service. ``` This is followed by peculiar erofs problems during shutdown: ``` Mar 21 18:31:13 engine kernel: erofs (device erofs): read error -5 @ 1127 of nid 16539878 ... Mar 21 18:31:13 engine kernel: erofs (device erofs): read error -5 @ 12 of nid 16558677 ``` # Boot -10 Boot -10 has no changes caused by etc-factory. It still has the strange 'current' problem in sysupdate tech: ``` Mar 21 18:31:50 engine systemd-sysupdate[3070]: Discovering installed instances… Mar 21 18:31:50 engine systemd-sysupdate[3070]: Determining installed update sets… Mar 21 18:31:50 engine systemd-sysupdated[3058]: Invalid JSON response from 'systemd-sysupdate list': Missing key 'current' ``` it's peculiar because I'd expect a newly booted system to definitely have a current version. During preparation for another update we get "No output from child job, ignoring" again. ``` Mar 21 18:33:49 engine systemd-sysupdate[4189]: Determining installed update sets… Mar 21 18:33:49 engine systemd-sysupdate[4189]: Determining available update sets… Mar 21 18:33:49 engine systemd-sysupdate[4189]: Selected update '202603211640' is already acquired and pending installation. Mar 21 18:33:49 engine systemd-sysupdate[4189]: Selected update '202603211640' for install. Mar 21 18:33:49 engine systemd-sysupdated[4172]: No output from child job, ignoring Mar 21 18:33:49 engine systemd-sysupdated[4172]: Error during execution of job callback for job 1: Job exited successfully with no work to do, assume already acquired Mar 21 18:34:19 engine systemd[1]: systemd-sysupdated.service: Deactivated successfully. ``` The update itself starts downloading the efi rather than the erofs. Probably because it thinks the erofs is already properly downloaded. ``` Mar 21 18:35:12 engine systemd-sysupdate[4378]: Determining installed update sets… Mar 21 18:35:12 engine systemd-sysupdate[4378]: Determining available update sets… Mar 21 18:35:12 engine systemd-sysupdate[4378]: Selected update '202603211640' is already installed, but incomplete. Repairing. Mar 21 18:35:12 engine systemd-sysupdate[4378]: Selected update '202603211640' for install. Mar 21 18:35:12 engine systemd-sysupdate[4378]: Making room for 1 updates… Mar 21 18:35:12 engine systemd-sysupdate[4378]: \ Acquiring https://files.kde.org/kde-linux/sysupdate/v2/kde-linux_202603211640.efi → /boot/EFI/Linux/kde-linux_202603211640+1-0.efi... ``` The EFI downloads without problem and eventually the caibx download starts but gets skipped because ``` Mar 21 18:38:01 engine systemd-sysupdate[4378]: Successfully acquired 'https://files.kde.org/kde-linux/sysupdate/v2/kde-linux_202603211640.efi'. Mar 21 18:38:02 engine systemd-sysupdated[4366]: Started job 2 with worker PID 5791 Mar 21 18:38:02 engine systemd-sysupdate[5791]: Discovering installed instances… Mar 21 18:38:02 engine systemd-sysupdate[5791]: Determining installed update sets… Mar 21 18:38:02 engine systemd-sysupdate[5791]: Selected update '202603211640' for install. Mar 21 18:38:02 engine systemd-sysupdate[5791]: Failed to acquire '/system/kde-linux_202603211640.erofs.caibx', instance is already in the target but is not pending. Mar 21 18:38:02 engine systemd-sysupdated[4366]: Job 2 failed with bus error, ignoring callback: Invalid argument ``` This gets retried a bunch of times but always yields the same result, suggesting that the update aborts because of this unexpected situation. The boot shuts down without problems # Boot -9 No etc-factory changes. There are no particular pieces of information here. It appears this boot simply was going to plasma-login-manager and then getting rebooted (may be triggering https://invent.kde.org/kde-linux/kde-linux/-/issues/471 and mark this boot as bad) # Boot -8 This boot executed an etc-factory run: ``` Mar 21 18:45:57 kde-linux etc-factory[505]: Creating snapshot of /sysroot/etc/ at /sysroot/.etc.new/ Mar 21 18:45:57 kde-linux etc-factory[505]: Skipping sensitive file: /sysroot/usr/share/factory/etc/avahi/hosts Mar 21 18:45:57 kde-linux etc-factory[505]: Skipping sensitive file: /sysroot/usr/share/factory/etc/fstab Mar 21 18:45:57 kde-linux etc-factory[505]: Skipping sensitive file: /sysroot/usr/share/factory/etc/hosts Mar 21 18:45:57 kde-linux etc-factory[505]: Skipping sensitive file: /sysroot/usr/share/factory/etc/mkinitcpio-systemd-tool/config/crypttab Mar 21 18:45:57 kde-linux etc-factory[505]: Skipping sensitive file: /sysroot/usr/share/factory/etc/mkinitcpio-systemd-tool/config/fstab Mar 21 18:45:57 kde-linux etc-factory[505]: Removing dangling link: /sysroot/.etc.new/systemd/system/autovt@.service Mar 21 18:45:57 kde-linux etc-factory[505]: Last update was at 1774062165, new update is at 1774115157 ``` This one did not touch /etc/shells though. So it is not the problematic one. The boot continued with an update attempt ``` Mar 21 18:47:58 engine systemd-sysupdate[3404]: Successfully installed 'https://files.kde.org/kde-linux/sysupdate/v2/kde-linux_202603211640.efi' (url-file) as '/boot/EFI/Linux/kde-linux_202603211640+1-0.efi' (regula> Mar 21 18:47:58 engine systemd-sysupdate[3404]: * Successfully installed update '202603211640'. ``` It only downloaded the efi, not the caibx or erofs. Otherwise this boot is unremarkable. # Boot -7 This boot too had a factory run. ``` Mar 21 18:49:11 kde-linux etc-factory[506]: Creating snapshot of /sysroot/etc/ at /sysroot/.etc.new/ Mar 21 18:49:11 kde-linux etc-factory[506]: Skipping sensitive file: /sysroot/usr/share/factory/etc/avahi/hosts Mar 21 18:49:11 kde-linux etc-factory[506]: Skipping sensitive file: /sysroot/usr/share/factory/etc/fstab Mar 21 18:49:11 kde-linux etc-factory[506]: Skipping sensitive file: /sysroot/usr/share/factory/etc/hosts Mar 21 18:49:11 kde-linux etc-factory[506]: Skipping sensitive file: /sysroot/usr/share/factory/etc/mkinitcpio-systemd-tool/config/crypttab Mar 21 18:49:11 kde-linux etc-factory[506]: Skipping sensitive file: /sysroot/usr/share/factory/etc/mkinitcpio-systemd-tool/config/fstab Mar 21 18:49:11 kde-linux etc-factory[506]: Last update was at 1774115312, new update is at 0 Mar 21 18:49:11 kde-linux systemd[1]: etc-factory.service: Deactivated successfully. ``` **NOTE** that the last update time is different 1774115312, per boot -8 it should have been 1774115157. The reported timestamp equates to `Saturday, March 21, 2026 at 6:48:32 PM`. A time for which we actually have no boot record ``` -8 512735d5feb64a238ae88b697e02f1cb Sat 2026-03-21 18:45:55 CET Sat 2026-03-21 18:48:23 CET -7 c3895a6280ae4af48e9d50e8c3f01482 Sat 2026-03-21 18:49:10 CET Sat 2026-03-21 18:49:14 CET ``` The obvious conclusion is that there was a "broken" boot between -8 and -7 that mangled /etc. # Conclusion Putting all things we know together: we probably had one or more UKIs without associated erofs. We then booted into that UKI for whatever reason and performed a broken etc factorization against an empty /usr mount (because the erofs was missing). This then broke etc and all further boots because they share the etc. https://invent.kde.org/kde-linux/etc-factory/-/issues/2 would probably prevent this. But we also should put safety measures in place to not run factorization when /usr is missing. It's entirely possible that this was actually also a similar chain of events that lead to the earlier breakage observed by Nate. # Reproduction This issue seems easily reproduce by removing the backing erofs of any version and trying to boot into that version. Doing so with an unlocked root account drops to an emergency shell. Journal shows that we indeed factorize from the empty usr and consequently clean up more or less the entire /etc. # Mitigation - [x] to ease debugging we should log parts or maybe even the entire os-release file from the UKI. This would have made it possible to know which versions we are dealing with in each boot [https://invent.kde.org/kde-linux/kde-linux/-/merge_requests/454] - [x] prevent etc-factory from running when /sysroot/usr failed to mount, or is empty [https://invent.kde.org/kde-linux/etc-factory/-/merge_requests/3] - [ ] implement https://invent.kde.org/kde-linux/kde-linux/-/work_items/505 - [x] further research on what exactly happend in the systemd-sysupdate vacuum run of boot -11 is required - [x] further research is required why systemd-sysupdate thought the erofs exists when it doesn't https://github.com/systemd/systemd/issues/41288
issue