Sometimes while working on Linux, you might face Failed to initialize nvml: driver/library version mismatch error. There can be many possible reasons for getting this error. You might get the error when you try to run a GPU workload or nvidia-smi command. In this short article, we will cover the reasons for getting Failed to initialize nvml: driver/library version mismatch error and will discuss how we can solve it using various solutions.
Failed to initialize nvml: driver/library version mismatch – Possible Solutions
There can be many possible reasons for getting this error. However, you will face this error when you try to run a GPU workload or nbidia-smi command on your terminal. Nvidia drivers are software installed on the Linux operating system which helps to smoothly operate the Nvidia graphics cards and help the system access this hardware.
The error clearly says that there is a mismatch between the version’s kernel module and the library. The difference in the version creates this problem.
Let us now see how we can solve the error using various possible methods:
Solution-1: Remove the module and load a new one
As we discussed the error is because of the difference in the version. So, we can remove the module and load a new Nvidia module. You can follow the steps to remove and load a new Nvidia module on your system.
Step-1: Check the version
The first thing is to check the version of the kernel on your system. You can use any of the following commands to check the version of the kernel:
nvidia-smi
modinfo nvidia

You can use any of the commands to check the version of the kernel on your system as shown below:
Step-2: Remove the Nvidia Driver
Now you can remove the Nvindia Driver from the system using the following commands:
sudo apt purge nvidia*

This will remove the existing Nvindia driver.
Step-3: Reinstall the Correct Driver
Now it is time to reinstall the correct version of Nvindia driver so there will be no conflict and you will get rid of the failed to initialize nvml: driver/library version mismatch error.
sudo apt install nvidia-driver-470 nvidia-settings nvidia-prime

Now you will have the correct version of the Nvindia driver on your system and you will no more get the error.
Solution-2: Drain and reboot the worker
The simplest solution to the problem is to reboot the node. The drivers will be properly initialized following the upgrade if the node is rebooted.
If a GPU worker node needs drivers to be updated, we advise draining the node first, updating the drivers, and rebooting the node before deploying new workloads. This is a guide that details the recommended approach for upgrading if you are utilizing container-based drivers.
Solution-3: Reload NVIDIA kernel modules
This solution is more complex and should only be used if it is not possible to drain and restart the problematic GPU worker. Moreover, any GPU workloads that are active on the node must be exhausted. This strategy has little benefit if you wish to prevent rebooting and draining caused by some GPU workloads that are currently operating. This is only helpful if the worker node has non-GPU workloads that cannot be removed or if the worker node cannot be restarted for any reason.
Stop NVIDIA device driver
kubectl label node konvoy.mesosphere.com/gpu-provider-
Restart kubelet
sudo systemctl restart kubelet
Check if there are any processes still using NVIDIA drivers
sudo lsof /dev/nvidia**
Check which NVIDIA kernel modules are loaded
lsmod | grep ^nvidia
Unload NVIDIA kernel modules
sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia
Verify that the modules are unloaded
lsmod | grep ^nvidia
Relaunch the NVIDIA addon pods
kubectl label node konvoy.mesosphere.com/gpu-provider=NVIDIA
Hopefully, now you will get rid of the error:
Summary
The Failed to initialize nvml: driver/library version mismatch error occurs when there is a difference between the version of Nvindia and kernel. This short article discussed how we could solve the error using various methods. You can choose any method depending on your system.
Related Issues:
- TypeError: Unhashable Type: ‘Slice’ [Solved]
- TypeError: ‘function’ object is not subscriptable [Solved]
- TypeError: ‘module’ object is not callable [Solved]