Failed to initialize nvml: driver/library version mismatch

Sometimes while working on Linux, you might face Failed to initialize nvml: driver/library version mismatch error. There can be many possible reasons for getting this error. You might get the error when you try to run a GPU workload or nvidia-smi command. In this short article, we will cover the reasons for getting Failed to initialize nvml: driver/library version mismatch error and will discuss how we can solve it using various solutions.

Failed to initialize nvml: driver/library version mismatch – Possible Solutions

There can be many possible reasons for getting this error. However, you will face this error when you try to run a GPU workload or nbidia-smi command on your terminal. Nvidia drivers are software installed on the Linux operating system which helps to smoothly operate the Nvidia graphics cards and help the system access this hardware.

The error clearly says that there is a mismatch between the version’s kernel module and the library. The difference in the version creates this problem.

Let us now see how we can solve the error using various possible methods:

Solution-1: Remove the module and load a new one

As we discussed the error is because of the difference in the version. So, we can remove the module and load a new Nvidia module. You can follow the steps to remove and load a new Nvidia module on your system.

Step-1: Check the version

The first thing is to check the version of the kernel on your system. You can use any of the following commands to check the version of the kernel:

nvidia-smi

modinfo nvidia
Failed to initialize nvml: driver/library version mismatch

You can use any of the commands to check the version of the kernel on your system as shown below:

Step-2: Remove the Nvidia Driver

Now you can remove the Nvindia Driver from the system using the following commands:

sudo apt purge nvidia*
Failed to initialize nvml: driver/library version mismatch-step 2

This will remove the existing Nvindia driver.

Step-3: Reinstall the Correct Driver

Now it is time to reinstall the correct version of Nvindia driver so there will be no conflict and you will get rid of the failed to initialize nvml: driver/library version mismatch error.

sudo apt install nvidia-driver-470 nvidia-settings nvidia-prime
failed to initialize nvml: driver/library version mismatch error

Now you will have the correct version of the Nvindia driver on your system and you will no more get the error.

Solution-2: Drain and reboot the worker

The simplest solution to the problem is to reboot the node. The drivers will be properly initialized following the upgrade if the node is rebooted.

If a GPU worker node needs drivers to be updated, we advise draining the node first, updating the drivers, and rebooting the node before deploying new workloads. This is a guide that details the recommended approach for upgrading if you are utilizing container-based drivers.

Solution-3: Reload NVIDIA kernel modules

This solution is more complex and should only be used if it is not possible to drain and restart the problematic GPU worker. Moreover, any GPU workloads that are active on the node must be exhausted. This strategy has little benefit if you wish to prevent rebooting and draining caused by some GPU workloads that are currently operating. This is only helpful if the worker node has non-GPU workloads that cannot be removed or if the worker node cannot be restarted for any reason.

Stop NVIDIA device driver

kubectl label node  konvoy.mesosphere.com/gpu-provider- 

Restart kubelet

sudo systemctl restart kubelet 

Check if there are any processes still using NVIDIA drivers

sudo lsof /dev/nvidia** 

Check which NVIDIA kernel modules are loaded

lsmod | grep ^nvidia 

Unload NVIDIA kernel modules

sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia

Verify that the modules are unloaded

lsmod | grep ^nvidia 

Relaunch the NVIDIA addon pods

kubectl label node  konvoy.mesosphere.com/gpu-provider=NVIDIA 

Hopefully, now you will get rid of the error:

Summary

The Failed to initialize nvml: driver/library version mismatch error occurs when there is a difference between the version of Nvindia and kernel. This short article discussed how we could solve the error using various methods. You can choose any method depending on your system.

Related Issues:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top