Multiple GPUs for graphics and deep learning

For long time I have been using a good old nvidia GeForce GTX 1050 for my display and deep learning needs. I reported a few times how to get Tensorflow running on Debian/Sid, see here and here. Later on I switched to AMD GPU in the hope that an open source approach to both GPU driver as well as deep learning (ROCm) would improve the general experience. Unfortunately it turned out that AMD GPUs are generally not ready for deep learning usage.

The problems with AMD and ROCm are far and wide. First of all, it seems that for anything more complicated then simple stuff, AMD’s flagship RX 5700(XT) and all GFX10 (Navi) based cards are not(!!!) supported in ROCm. Yes, you read correct … AMD does not support 5700(XT) cards in the ROCm stack. Some simple stuff works, but nothing for real computations.

Then, even IF they would support, ROCm as distributed is currently a huge pain in the butt. The source code is a huge mess, and building usable packages from it is probably possible, but quite painful (I am member of the ROCm packaging team in Debian, and have tried many hours). And the packages provided by AMD are not installable on Debian/sid due to library incompatibilities.

So that left me with a bit a problem: for work I need to train quite some neural networks, do model selection, etc. Doing this on a CPU is a bit a burden. So at the end I decided to put the nVidia card back into the computer (well, after moving it to a bigger case – but that is a different story to tell). Here are the steps I did to get both cards working for their respective target: AMD GPU for driving the console and X (and games!), and the nVidia card doing the deep learning stuff (tensorflow using the GPU).

Starting point

Starting point was a working AMD GPU installation. The AMD GPU is also the first GPU card (top slot) and thus the one that is used by the BIOS and the Linux console. If you want the video output on the second card you need to trick, and probably don’t have console output, etc etc. So not a solution for me.

Installing libcuda1 and the nvidia kernel drivers

Next step was installing the libcuda1 package:

apt install libcuda1

This installs a lot of stuff, including the nvidia drivers, GLX libraries, alternatives setup, and update-glx tool and package.

The kernel module should be built and installed automatically for your kernel.

Installing CUDA

Follow more or less the instructions here and do

wget -O- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo tee /etc/apt/trusted.gpg.d/nvidia-cuda.asc
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" | sudo tee /etc/apt/sources.list.d/nvidia-cuda.list
sudo apt-get update
sudo apt-get install cuda-libraries-10-1

Warning! At the moment Tensorflow packages require CUDA 10.1, so don’t install the 10.0 version. This might change in the future!

This will install lots of libs into /usr/local/cuda-10.1 and add the respective directory to the ld.so path by creating a file /etc/ld.so.conf.d/cuda-10-1.conf.

Install CUDA CuDNN

One difficult to satisfy dependency are the CuDNN libraries. In our case we need the version 7 library for CUDA 10.1. To download these files one needs to have a NVIDIA developer account, which is quick and painless. After that go to the CuDNN page where one needs to select Archived releases and then Download cuDNN v7.N.N (xxxx NN, YYYY), for CUDA 10.1 and then cuDNN Runtime Library for Ubuntu18.04 (Deb).

At the moment (as of today) this will download a file libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb which needs to be installed with dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb.

Updating the GLX setting

Here now comes the very interesting part – one needs to set up the GLX libraries. Reading the output of update-glx --help and then the output of update-glx --list glx:

$ update-glx --help
update-glx is a wrapper around update-alternatives supporting only configuration
of the 'glx' and 'nvidia' alternatives. After updating the alternatives, it
takes care to trigger any follow-up actions that may be required to complete
the switch.
 
It can be used to switch between the main NVIDIA driver version and the legacy
drivers (eg: the 304 series, the 340 series, etc).
 
For users with Optimus-type laptops it can be used to enable running the discrete
GPU via bumblebee.
 
Usage: update-glx <command>
 
Commands:
  --auto <name>            switch the master link <name> to automatic mode.
  --display <name>         display information about the <name> group.
  --query <name>           machine parseable version of --display <name>.
  --list <name>            display all targets of the <name> group.
  --config <name>          show alternatives for the <name> group and ask the
                           user to select which one to use.
  --set <name> <path>      set <path> as alternative for <name>.
 
<name> is the master name for this link group.
  Only 'nvidia' and 'glx' are supported.
<path> is the location of one of the alternative target files.
  (e.g. /usr/lib/nvidia)
 
$ update-glx --list glx
/usr/lib/mesa-diverted
/usr/lib/nvidia

I was tempted into using

update-glx --config glx /usr/lib/mesa-diverted

because at the end the Mesa GLX libraries should be used to drive the display via the AMD GPU.

Unfortunately, with this neither the nvidia kernel module was loaded, the nvidia persistenced couldn’t run because the library libnvidia-cfg1 wasn’t found (not sure it was needed at all…), and with that also no way to run tensorflow on GPU.

So what I did I tried

update-glx --auto glx

(which is the same as update-glx --config glx /usr/lib/nvidia), and rebooted, and decided to check afterwards what is broken.

To my big surprise, the AMD GPU still worked out of the box, including direct rendering, and the games I tried (Overload, Supraland via Wine) all worked without a hinch.

Not that I really understand why the GLX libraries that are seemingly now in use are from nvidia but work the same (if anyone has an explanation, that would be great!), but since I haven’t had any problems till now, I am content.

Checking GPU usage in tensorflow

Make sure that you remove tensorflow-rocm and reinstall tensorflow with GPU support:

pip3 uninstall tensorflow-rocm
pip3 install --upgrade tensorflow-gpu

After that a simple

$ python3 -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
....(lots of output)
2020-09-02 11:57:04.673096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3581 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
tf.Tensor(1093.4915, shape=(), dtype=float32)
$

should indicate that the GPU is used by tensorflow!

The R Keras package should also work out of the box and pick up the system-wide tensorflow which in turn picks the GPU, see this post for example code to run for tests.

Conclusion

All in all it was easier than expected, despite the dances one has to do for nvidia to get the correct libraries. What still puzzles me is the selection option in update-glx, and might need a better support for secondary nvidia GPU cards.

2 Responses

  1. 2020/09/06

    […] of my computer from one case to another bigger one, the reason being I needed to plug in my nvidia GPU card for deep learning. In the process, somehow I lost (temporarily) the ability to connect one PCIe NVMe converter and […]

  2. 2020/09/23

    […] back to using my nVidia card for deep learning, and use the AMD for the graphic output. See this blog for details on how to do multiple GPU […]

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">