Deep Learning with Multiple GPUs in 2024: A Comprehensive Guide

Yo, fellow AI enthusiasts! Let’s be real, large language models (LLMs) are all the rage these days. They’re like, the rockstars of the AI world, right? But here’s the catch: these LLMs are memory hogs. They need a ton of GPU memory (VRAM) to run smoothly, especially if you’re developing or using them locally. It’s like trying to fit an elephant in a Mini Cooper – not gonna happen!

Now, you might think, “No problemo, I’ll just slap in a couple more GPUs and call it a day.” Well, hold your horses, cowboy! Simply having multiple GPUs is like having a fleet of sports cars but no idea how to drive them. You need the right setup – drivers, software, the whole shebang – to unleash their true potential and make them work in perfect harmony.

That’s where this totally awesome guide comes in. We’re gonna walk you through the entire process of setting up a multi-GPU Linux machine for deep learning, specifically using those badass Nvidia GPUs. Think of this as your pit crew, getting your system race-ready for the world of AI.

Who is this Guide For?

This guide is for anyone and everyone who wants to dive headfirst into the exciting world of deep learning with a multi-GPU Linux system. Whether you’re a seasoned AI pro or a curious newbie just starting out, we’ve got you covered.

And hey, don’t sweat it if you only have a single GPU right now. The steps outlined here work just as well for single-GPU setups. It’s like learning to drive a stick shift – once you’ve mastered it, driving an automatic is a piece of cake!

What are we Gonna Do?

Our mission, should you choose to accept it (and trust me, you do!), is to transform your Linux machine into a lean, mean, deep-learning machine. We’re talking about installing all the essential software, like CUDA Toolkit, PyTorch, and Miniconda. These are the tools that’ll let you play around with cool frameworks like exllamaV2 and torchtune. Get ready to unleash the power of parallel processing and make those GPUs sing!

Before We Begin…

Now, before we embark on this epic journey, there are a couple of things you need to have in your arsenal:

  • A Linux machine, because, duh, we’re doing this the cool kid way. And of course, it needs to have one or more Nvidia GPUs ready to rock and roll.
  • A little bit of familiarity with basic Linux commands. Don’t worry, we’re not talking about hacking into the Matrix here, just some simple stuff. If you can open a terminal and navigate directories, you’re golden!

Let’s Get this Show on the Road: Setting Everything Up

Verifying GPU Installation

First things first, we gotta make sure those shiny GPUs are properly installed and recognized by your system. Think of it as a headcount before the big game.

Open up your terminal. That’s where the magic happens! Now, type in ‘nvidia-smi’ and hit enter. This command is like your GPU roll call. It’ll show you a list of all the installed GPUs, along with some juicy details about them.

If the command runs without a hitch and you see all your GPUs listed there, congrats! You’re one step closer to deep learning nirvana. But, if the command throws a tantrum or the list looks a little funky, you might need to install or reinstall your Nvidia drivers. Don’t worry, it happens to the best of us. Just head over to the Nvidia website and grab the latest drivers for your specific Linux distribution. It’s like giving your GPUs a fresh pair of running shoes.

Installing CUDA Toolkit: The Secret Sauce

Now, let’s talk CUDA. This bad boy is a parallel computing platform and programming model developed by our friends at Nvidia. It’s the secret sauce that lets you harness the full power of your GPUs for deep learning tasks. Think of it as the interpreter between your GPU and the deep learning frameworks you’ll be using.

Checking for Existing CUDA Installation:

Before we go on a wild goose chase, let’s see if you already have CUDA installed. Open up your file manager and navigate to the ‘/usr/local/’ directory. Look for a folder named ‘cuda-xx’, where ‘xx’ represents the CUDA version. If you find it, congrats, you’re one step ahead! But hold your horses, we need to make sure it’s the right version.

To verify the installed CUDA version, open up your terminal again and run the command ‘nvcc –version’. This will show you the exact CUDA version you have. If it matches the version required by your desired PyTorch version (we’ll get to that in a bit), you can skip ahead to the next section. If not, or if you didn’t find the CUDA folder at all, then let’s get you set up!

Identifying the Right CUDA Version:

Not all CUDA versions are created equal, my friend. Different deep learning frameworks, like PyTorch, have specific CUDA version requirements. It’s like trying to fit a square peg in a round hole – it just won’t work! So, before you download anything, head over to the PyTorch installation guide (https://pytorch.org/) and check which CUDA version plays nicely with your desired PyTorch version. Once you have that info, you’re ready to grab the right CUDA Toolkit.

Downloading CUDA Toolkit:

Alright, time to download some goodies! Head over to the NVIDIA Developer website (https://developer.nvidia.com/cuda-downloads) and find the CUDA Toolkit download section. Now, this is where it gets a little tricky. You need to select the right installer based on your operating system (OS) and the CUDA version you just noted. For example, if you’re running Ubuntu, you’d choose something like “deb (local)”. Don’t worry, they have clear instructions on the website, so just follow along carefully.

Installing CUDA Toolkit:

Alright, you’ve got the CUDA Toolkit installer file ready to go. Now, open up your trusty terminal and navigate to the directory where you saved that installer file. Once you’re in the right directory, you’ll need to execute the installation commands provided on the Nvidia website. These commands will vary depending on your chosen installer type, so pay close attention to those instructions. It’s like following a recipe – you don’t want to accidentally add salt instead of sugar!

Important Note: While the CUDA Toolkit is working its magic, you might get some prompts about updating your kernel. Resist the urge! Decline those kernel update prompts like you’d decline a telemarketing call. Why? Because updating your kernel during the CUDA installation can sometimes lead to driver conflicts, and nobody wants that headache. It’s like trying to change a tire while driving down the road – not a good idea!

Adding CUDA to PATH:

We’re almost there, I promise! Now we need to tell your system where to find the newly installed CUDA Toolkit. Think of it as giving your system a treasure map to locate the CUDA goldmine. To do this, we need to modify a file called ‘.bashrc’ in your home directory.

Open up the ‘.bashrc’ file using a text editor. You can use ‘nano’, ‘vim’, or any other editor you prefer. Now, add the following lines at the very end of the file, replacing ‘cuda-xx’ with the actual CUDA version you installed (e.g., cuda-12.1):


export PATH="/usr/local/cuda-xx/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-xx/lib64:$LD_LIBRARY_PATH"

Save the file and close your text editor. Now, close and reopen your terminal for the changes to take effect. This is like hitting the refresh button on your system.

Verifying CUDA Installation:

Alright, final check! To make sure CUDA is properly installed and ready to roll, type in ‘nvcc –version’ in your terminal and hit enter. If you see the correct CUDA version you installed, then give yourself a pat on the back – you’ve successfully installed CUDA Toolkit! If not, double-check the previous steps and make sure you didn’t miss anything. Sometimes it’s the little things that trip us up.

Installing Miniconda: The Python Powerhouse

Alright, now that we’ve got CUDA up and running, let’s bring in the Python powerhouse – Miniconda! Miniconda is like a streamlined version of Anaconda, a popular Python distribution specifically designed for data science and machine learning. It’s like having a Swiss Army knife for your Python needs!

Downloading Miniconda Installer:

First things first, let’s grab the Miniconda installer. Open up your terminal and execute the following commands one by one. Don’t worry, these commands won’t bite! They’ll create a directory for Miniconda and download the installer file. Think of it as setting the stage for the main act.


mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh

Installing Miniconda:

Now, let’s get this party started! With the installer file downloaded, it’s time to install Miniconda. Run the following command in your terminal. It’ll install Miniconda in the directory you created earlier. It’s like rolling out the red carpet for Miniconda to shine.


bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

Initializing Conda:

Almost there! Now we need to initialize Conda so that it plays nicely with your shell. Run the following commands to initialize Conda for both ‘bash’ and ‘zsh’ shells. Think of it as introducing Conda to your system’s inner circle.


~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Verifying Conda Installation:

Alright, final check! Close and reopen your terminal to let those initialization changes take effect. Now, type in ‘conda –version’ and hit enter. If you see the Conda version number, then congrats, you’ve successfully installed Miniconda! If not, double-check the previous steps and make sure everything went according to plan. Sometimes it’s the little things that trip us up.

Installing PyTorch: The Deep Learning Champion

With CUDA and Miniconda all set up, it’s time to bring in the big guns – PyTorch! This bad boy is a deep learning framework developed by Meta, and it’s like the LeBron James of deep learning frameworks. It’s super popular, incredibly versatile, and makes building and training deep learning models a breeze. Get ready to unleash your inner AI architect!

(Optional) Creating a Conda Environment:

Now, before we install PyTorch, let’s talk about Conda environments. These are like separate playgrounds for your different Python projects, each with its own set of packages and dependencies. It’s like having separate refrigerators for your leftovers and your fresh groceries – it just keeps things organized and prevents any nasty conflicts.

Creating a Conda environment for PyTorch is totally optional, but highly recommended, especially if you plan on working with multiple deep learning projects or different versions of PyTorch in the future. It’s like having a dedicated workspace for your deep learning shenanigans.

To create a Conda environment, open up your terminal and run the following command, replacing ‘<environment-name>’ with your desired environment name. Get creative! Maybe something like ‘pytorch-party’ or ‘deep-learning-den’.


conda create -n <environment-name> python=3.11

This command creates a new Conda environment with Python 3.11. Feel free to use a different Python version if you prefer. Once the environment is created, activate it using the following command:


conda activate <environment-name>

You should see your environment name in parentheses at the beginning of your terminal prompt, like this: ‘(pytorch-party) your-username@your-hostname:~$’. This means you’re now inside your shiny new Conda environment, ready to install PyTorch without messing up your other Python projects. It’s like stepping into a phone booth and transforming into a deep learning superhero!

Installing PyTorch:

Alright, time to install the main attraction! The exact command for installing PyTorch depends on your specific CUDA version. Remember that compatibility chart we talked about earlier? This is where it comes in handy. Head over to the PyTorch installation guide (https://pytorch.org/) and look for the installation instructions for your CUDA version.

For example, if you have CUDA 12.1 installed, you might run a command like this:


pip3 install torch torchvision torchaudio

This command will install PyTorch, along with the ‘torchvision’ and ‘torchaudio’ packages, which are super useful for working with image and audio data, respectively. Think of them as PyTorch’s trusty sidekicks.

Verifying PyTorch Installation:

Once the installation is complete, let’s make sure PyTorch is properly installed and can see those awesome GPUs. Open up a Python interpreter by typing ‘python’ in your terminal and hitting enter.

Now, run the following commands one by one:


import torch
print(torch.cuda.device_count())

The first command imports the PyTorch library, and the second command checks how many GPUs PyTorch can detect. If the output matches the number of GPUs you saw earlier with ‘nvidia-smi’, then congrats, you’ve successfully installed PyTorch and it can see all your GPUs! Time to celebrate with a virtual high five! If not, double-check the previous steps and make sure everything is configured correctly. Sometimes it’s the little things that trip us up.

Unleashing the Multi-GPU Beast: Time to Play!

Alright, you’ve made it! You’ve successfully set up your multi-GPU Linux machine for deep learning. You’ve got CUDA, Miniconda, and PyTorch all working in perfect harmony. Now it’s time to unleash the beast and dive into the exciting world of deep learning! Here are a few ideas to get you started:

Hugging Face Models: Your AI Playground

If you’re looking for pre-trained deep learning models to play with, Hugging Face (https://huggingface.co/) is your new best friend. It’s like the GitHub of AI models, with thousands of pre-trained models for everything from natural language processing to computer vision. You can easily explore these models, fine-tune them on your own data, and even contribute your own creations! It’s like having an AI playground at your fingertips.

exllamav2 (Inference): Making LLMs Run Like a Dream

Remember those memory-hogging LLMs we talked about earlier? Well, exllamav2 is here to save the day! This awesome framework is specifically designed for efficient LLM inference, meaning it can run those massive LLMs on your own hardware, even if you don’t have a server farm at your disposal. And the best part? It can leverage all those shiny GPUs you just set up to accelerate the inference process. It’s like giving your LLMs a performance boost!

torchtune (Fine-tuning): Training Models at Warp Speed

Fine-tuning pre-trained models on your own data is like teaching an old dog new tricks – it can be incredibly powerful! And with torchtune, you can accelerate this process by leveraging the power of multiple GPUs. torchtune is built on top of PyTorch and provides a bunch of tools and techniques for distributed training, making it super easy to train your models faster and more efficiently. It’s like having a personal trainer for your deep learning models!

Wrapping it Up: You’re a Deep Learning Rockstar!

Congratulations, my friend! You’ve reached the end of this epic journey. You’ve learned how to set up a multi-GPU Linux machine for deep learning, installed all the essential software, and even explored some cool frameworks for working with LLMs. You’re well on your way to becoming a deep learning rockstar!

Remember, this guide is just the beginning. The world of deep learning is constantly evolving, with new tools, techniques, and models emerging all the time. So keep experimenting, keep learning, and most importantly, have fun! After all, deep learning is all about pushing the boundaries of what’s possible with AI. Now go out there and build something amazing!