There are lots of tutorials on the internet on how to install a GPU driver on Linux. However, I end up searching again every time I want to install a GPU drivers and I get frustrated when there are multiple ways of doing the same thing and I don’t know which one is the best.

Also, there are some things I’ve learned which I’d like to share it.

I’ve tested this on 3090 GPU and Ubuntu 22.04. Yours may vary a little but the procedure it the same.

Before Installation

First, Install build-essentials if you don’t have it:

sudo apt install build-essential

(Optional) Then, sometimes you have to update gcc to the latest version (default is 11). Otherwise, you may get errors like this one:

unrecognized command-line option -ftrivial-auto-var-init=zero

In order to update GCC compiler, you just need to run these two commands (Reference):

sudo apt install --reinstall gcc-12
sudo ln -s -f /usr/bin/gcc-12 /usr/bin/gcc

You can check the installation using gcc --version:

gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.

You should also install these two packages in case you don’t have them:

sudo apt install pkg-config libglvnd-dev

In case you don’t install you would get this warning when installing the driver (Reference):

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

I would also install vulkan package to remove the warning which says:

 WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

The command to install vulkan is (Reference):

sudo apt install libvulkan1

Downloading and installing the driver

You should now download the NVIDIA driver from here. In my experience, normally the latest and greatest version of NVIDIA would be found there.

There are also some ways to install NVIDIA drivers using package managers like apt. But personally, I find this way better and more reliable. Also, some drivers have more up-to-date versions on Official NVIDIA site.

For example, at the time of writing this article, NVIDIA driver version 550 is not available on apt. But, it is easily downloadable via NVIDIA Official website.

To see your GPU information use (Reference):

sudo lshw -C display

Then, you will get an output like this:

  *-display
       description: VGA compatible controller
       product: GA102 [GeForce RTX 3090]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:04:00.0
       version: a1
       width: 64 bits
       clock: 33MHz

or

lspci | grep -i nvidia

It shows that I am using RTX 3090. Then, I would choose the operating system as linux 64-bit and download type as production branch. It would show something like this page.

Then you download a .run file. Change the permission and run the file as root:

chmod +x NVIDIA-Linux-x86_64-550.54.14.run
sudo ./NVIDIA-Linux-x86_64-550.54.14.run

Note that you can start downloading in your browser, and then copy the link into VM machine that you have and wget link to download it inside VM.

This should normally work and do the job for you. I would say yes on everything it would say afterwards. Make sure to hit yes on this screen (default is no) - Otherwise, you won’t have your GPU driver ready-to-use after restarting your linux machine.

If successful, you should see something like this when running nvidia-smi:

Blacklist Nouveau (when using VMWare ESXI)

Sometimes, you may be working on a virtual machine that is hosted on VMware ESXi. In that case, you may get errors like this after using nvidia-smi:

no-devices-were-found

To solve it, you should also do the following (Reference).

mkdir -p /etc/modprobe.d/
echo 'blacklist nouveau' | sudo tee /etc/modprobe.d/blacklist-nvidia-nouveau.conf
echo 'options nouveau modeset=0' | sudo tee -a /etc/modprobe.d/blacklist-nvidia-nouveau.conf
echo 'options nvidia NVreg_OpenRmEnableUnsupportedGpus=1' | sudo tee /etc/modprobe.d/nvidia.conf

Then update the kernel init using:

sudo update-initramfs -u

and reboot the server:

sudo reboot

After that you should install your driver using:

sudo ./NVIDIA-Linux-x86_64-550.54.14.run -m=kernel-open

Note that -m=kernel-open is important. Otherwise, it wouldn’t work.

Reboot the server afterwards and you are good to go.

Debugging

Debugging NVIDIA installation is a hard task. Here’s how to make it easier.

In case you have problem during installation, two commands help:

less /var/log/nvidia-installer.log

NVIDIA installer won’t show any detail in case there is an error. You should use the log file to see the issues.

sudo nvidia-bug-report.sh
less nvidia-bug-report.log.gz

Also, dmesg would help (make sure to run it after nvidia-smi to see more information on why it is having problem):

sudo dmesg

Installing CUDA Toolkit

After installing GPU, you might need to install CUDA on the server. Make sure to read this tutorial and go along with it.

NVIDIA CUDA Installation Guide for Linux

However, personally, I wouldn’t install that since I use Docker. I normally use docker base images which have CUDA inside them. But if you want to run code on bare metal, you can install CUDA.

NVIDIA says: you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed. GitHub - NVIDIA/nvidia-container-toolkit: Build and run containers leveraging NVIDIA GPUs

Add GPU Support For Docker

First, you have to install docker using this guide (in case you don’t have it already).

Then, I would use this tutorial to give GPU access to docker daemon.

In short, first add NVIDIA repositories to apt:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Then install nvidia-container-toolkit:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

After that, you need to update daemon.json using:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
cat /etc/docker/daemon.json

Output:

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

You should now be able to run:

docker run --gpus all nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04 nvidia-smi

The output should be the same as running nvidia-smi on bare-metal (outside docker).

Note that you don’t need to do this on Azure VM Servers. There are pre-build images which have NVIDIA driver installed and you can also use either HPC images or DSVM images from the marketplace.

References