Fix Boot Failure after System Update
After performing a full system update, my Manjaro Linux unfortunately became unable to boot, which was indeed a disaster. If I couldn't fix this issue, I would have to reinstall the entire system, which would be extremely time-consuming as I would need to reconfigure everything to my preferences, and inevitably some important data would be lost in the process.
Luckily, I discovered that by selecting the older kernel version during boot, I could successfully boot into the system. However, I realized that simply relying on the old kernel wasn't a long-term solution, as it would keep me stuck on an outdated version. So, I decided to utilize the old kernel to boot up and investigate the problem further. My first course of action was to examine the kernel log, considering that the system could boot successfully with the old kernel. Through this process, I identified several kernel errors that occurred during the boot sequence with the following command:
journalctl -k -b -1
Here follows the first error:
May 11 12:45:35 Moment kernel: ACPI BIOS Error (bug): Failure creating named object [\_TZ.ETMD], AE_ALREADY_EXISTS (20240827/dswload2-326)
May 11 12:45:35 Moment kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20240827/psobject-220)
According to a blog post I read, the issue might be resolved by updating the BIOS. As a result, I switched to the Windows system in my dual-boot setup and proceeded to update the BIOS. Unfortunately, despite successfully updating the BIOS, my dual-boot configuration ceased to function, and I was no longer able to switch back to Manjaro Linux. However, I eventually managed to fix the issue by following these steps:
# 1. Enter live Manjaro using my Manjaro Linux installation USB drive
# 2. Find out the Manjaro Linux partition and EFI partition
sudo fdisk -l
# 3. Mount the two partitions
sudo mount /dev/nvme0n1p2 /mnt
sudo mount /dev/nvme0n1p1 /mnt/boot/efi
# 4. chroot
sudo manjaro-chroot -a
# 5. reinstall grub
sudo grub-install ---target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=Manjaro
sudo update-grub
# 6. exit and reboot
exit
sudo reboot
I also tried using both the Windows system and a Kali Linux installation USB drive to repair the dual-boot setup, but neither method yielded successful results.
At this juncture, despite having updated the BIOS, the boot failure and the first error during the booting process remained unresolved. So, I took a moment to calm myself and started considering whether this particular error was the underlying cause of the boot failure. To test my hypothesis, I once again rebooted the system using the old kernel and inspected the kernel log. Interestingly, the first error also surfaced during the boot with the old kernel, indicating that this error was not the root cause of the issue. Thus, I proceeded to scrutinize the sole remaining error message:
vfio-pci 0000:02:00.0: probe with driver vfio-pci failed with error -22
Upon further investigation, it appeared that the new kernel had failed to bind the device 0000:02:00.0
to vfio-pci
. Before proceeding, it was necessary to identify what device 0000:02:00.0
corresponded to.
lspci -nnk -s 0000:02:00.0
The output from executing the above command revealed that the device in question was an NVIDIA GPU, and it was currently occupied by a GPU driver instead of vfio-pci
. This prompted me to recall that I had previously configured GPU passthrough for this computer but had canceled that configuration at a later stage. However, it appeared that some aspects of the configuration had not been fully reverted. As I no longer required GPU passthrough, the immediate solution was to simply delete the configuration settings associated with GPU passthrough:
# 1. delete the configuration file
sudo rm /etc/modprobe.d/vfio.conf
# 2. modify the kernel modules if necessary
sudo vim /etc/mkinitcpio.conf
# 3. update inittramfs if necessary
sudo mkinitcpio -P
# 4. remove the NVIDIA driver(nonfree driver) from the blacklist if necessary
sudo vim /etc/modprobe.d/blacklist.conf
# 5. reinstall the NVIDIA driver(nonfree driver)
sudo pacman -S nvidia nvidia-utils nvidia-settings
# 6. remove params related to vfio
sudo vim /etc/default/grub
# 7. update GRUB
sudo update-grub
After performing a reboot test, it became evident that cleaning up the GPU passthrough configuration had also failed to rectify the boot failure issue. Interestingly, there were no additional kernel errors detected during the booting process. This led me to ponder whether the root cause could be unrelated to the kernel. To gain further insight, I needed to collect more comprehensive logs of the boot sequence:
journalctl -b -1
Upon scrutinizing the boot log using the aforementioned command, it was evident that Xorg had failed to initiate properly:
May 11 16:07:09 Moment systemd-coredump[1370]: Process 1364 (Xorg) of user 0 terminated abnormally with signal 6/ABRT, processing...
Additionally, I reviewed the Xorg log from the previous boot:
vim /var/log/Xorg.0.log.old
Two errors were identified within the log. The first one being:
Failed to load module "glxservernvidia" (module does not exist, 0)
Here are the steps to resolve the first error:
# 1. delete the Xorg configuration
sudo rm /etc/X11/xorg.conf
# 2. regenerate the Xorg configuration
sudo nvidia-xconfig
# 3. reboot the system
reboot
However, addressing this error alone does not rectify the boot failure issue. Let's now turn our attention to the second error within the Xorg log:
[ 13.127] (II) Loading /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
[ 13.136] (II) Module glxserver_nvidia: vendor=“NVIDIA Corporation”
[ 13.136] compiled for 1.6.99.901, module version = 1.0.0
[ 13.136] Module class: [X.Org](http://x.org/) Server Extension
[ 13.136] (II) NVIDIA GLX Module 565.57.01 Thu Oct 10 12:09:28 UTC 2024
[ 13.136] (EE)
[ 13.136] (EE) Backtrace:
[ 13.136] (EE) unw_get_proc_name failed: no unwind info found [-10]
[ 13.136] (EE) 0: /usr/lib/Xorg (?+0x0) [0x62623916ffbc]
[ 13.136] (EE) unw_get_proc_name failed: no unwind info found [-10]
[ 13.136] (EE) 1: /usr/lib/libc.so.6 (?+0x0) [0x75ef682e3cd0]
[ 13.136] (EE)
[ 13.136] (EE) Segmentation fault at address 0x0
[ 13.136] (EE)
Fatal server error:
[ 13.136] (EE) Caught signal 11 (Segmentation fault). Server aborting
[ 13.136] (EE)
I must admit that it was a stroke of luck that led me to notice the discrepancy between the version of the NVIDIA GLX Module and the version of the NVIDIA driver I had installed. Without further ado, here is the final solution that resolved the boot failure issue:
sudo rm /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
sudo rm /usr/lib/xorg/modules/drivers/nvidia_drv.so
sudo mkinitcpio -P
sudo depmod -a
# for the old kernel(the NVIDIA driver was not working)
sudo pacman -S extra/linux54-nvidia
# for the new kernel
sudo pacman -S extra/linux612-nvidia
# check the GPU status
nvidia-smi
# reboot the system(using the new kernel by default)
reboot
In conclusion, the boot failure stemmed from Xorg's inability to start up due to mismatched NVIDIA driver versions. Despite eventually resolving the issue, the process was time-consuming, largely because I was too impatient to check all the errors before action at first. Maintaining a calm demeanor and adopting a more methodical approach to diagnosing the problem could have saved me time and effort. For instance, instead of rushing to update the BIOS, I should have thoroughly examined other errors and considered the full boot log as a starting point, which could have spared me from the trouble of fixing the dual-boot setup.