High Performance Computing on GPUs (GCP vs AWS)

Mia Tadic, Darija Strmecki


In this third part of the HPC blog series, we train the same machine learning model as in the previous two blog posts (click HERE for part 1 and HEREfor part 2), but we train it on GPUs, instead of CPUs.

We expect time acceleration since their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for algorithms that process large blocks of data in parallel. In this blog post, we will first describe how to set up a cluster with GPUs on both GCP and AWS. Then we will describe how to set up an appropriate environment for GPUs and what adjustments our model from previous blog posts needs to know that it should run on GPUs, and not on CPUs. We will then show the example of the sbatch script and how the output should look like, and, finally, we will show and compare the results on both cloud platforms. Here is the overview:

  • Cluster Setup on GCP
    • GPU Quota Over Regions
    • GPU Quota Over Machine Types
    • Setting Up a Slurm Cluster with GPUs on GCP
  • Cluster Setup on AWS
  • Conda Environment
    • Install Anaconda
    • Create Environment
  • Machine Learning Model
  • Start the Training
  • Output
  • Results: Comparison of Training Distributed Model on Different Numbers of Nodes
    • GCP
    • AWS
  • Conclusion

Cluster Setup on GCP

You must ensure what number of GPUs is supported in your region and on different machine types, we will cover it in following two sections.

GPU Quota Over Regions

Compute Engine enforces quotas on resource usage for a variety of reasons. Not all regions and not all projects have the same quotas. Check the GPUs on Compute Engine to decide which region and GPU types suit your situation. Additionally, check the quotas page to ensure that you have enough GPUs available in your project, and to request a quota increase. (For more details: Resource quotas)

If you haven’t set a proper number of GPUs in your YAML file, your deployment will still succeed. But a problem arises when you run a job – your job will jump between pending and configuring states:

$ sinfo 
debug* up infinite 10 CF g1-compute-0-[0-9] 
$ sinfo 
debug* up infinite 10 PD BeginTime

Pending state with the reason BeginTime usually suggests that there are no available resources (GPU, CPU, …).

Another sign that your number of GPUs is not properly set is if this Google notification appears:

Start VM instance “slurm1-compute-0-image”

Just now

Quota ‘NVIDIA_V100_GPUS’ exceeded. Limit: 1.0 in region europe-west4

GPU Quota Over Machine Types

You can attach GPUs only to general-purpose N1 machine types. GPUs are not supported by other machine types. (For more details: GPUs and machine types).

Setting Up a Slurm Cluster with GPUs on GCP

Cluster setup is described in the first part of the blog series about HPC. The only difference is in the GPU configuration fields in the slurm-cluster.yaml file. Here is the YAML file that we used:

# [START cluster_yaml]
- path: slurm.jinja

- name: slurm-cluster
  type: slurm.jinja
 cluster_name                  : slurm8

 zone                              : europe-west4-a

 controller_machine_type : n1-standard-2
 login_machine_type            : n1-standard-2
 compute_image_machine_type : n1-standard-2
 default_users                          : 
 partitions :
     - name                  : debug
       machine_type      : n1-standard-2
       max_node_count : 8
       zone                   : europe-west4-a

 # Optional GPU configuration fields
       gpu_type                     : nvidia-tesla-v100
       gpu_count                    : 1

# [END cluster_yaml]

We specified the NVIDIA Tesla V100 GPU type and 1 GPU per compute node. You have to be aware that there will be no GPUs on login and controller instance, but only on compute nodes.

Cluster Setup on AWS

AWS has no quota on GPUs as GCP does. Another difference is that you cannot attach GPU to a machine. AWS has predefined which machine types contain GPUs and which ones don’t. So you just have to choose machine type with GPU type and count you want. Described instances can be found here.

Cluster setup is the same as in the second part of the blog series about HPC, the only difference is the compute instance’s machine type. We chose p3.2xlarge with 1 NVIDIA Tesla V100.

Conda Environment

Now that our clusters are ready and steady, we can set up the environment for the Keras machine learning model.

Install Anaconda

Before downloading the Anaconda installer script, visit the Anaconda Downloads page and check if there is a new version of Anaconda for Python 3 available for download.

1. SSH to login instance and obey all the steps from there.

2. Navigate to the /tmp directory and download the Anaconda installation script using the link that you copied from the Downloads page:

$ mkdir tmp$ cd tmp/$ curl -O https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh

3. Verify the Data Integrity of the Script:

$ sha256sum Anaconda3-5.3.1-Linux-x86_64.sh
d4c4256a8f46173b675dd6a62d12f566ed3487f932bab6bb7058f06c124bcc27 Anaconda3-5.3.1-Linux-x86_64.sh

Make sure the hash printed from the command above matches the one available at theAnaconda with Python 3 on 64-bit Linux page for your appropriate Anaconda version.

4. Run the Anaconda Installation Script:

$ bash Anaconda3-5.3.1-Linux-x86_64.sh

You will enter the Anaconda installation process with several requests for approval and continuation. It’s okay to go along with all of the default settings. If the installer asks for the installation of Visual Studio Code, choose ‘no’ since it is not needed here.

5. Activate the Anaconda installation by loading the new PATH environment variable which was added by the Anaconda installer into the current shell session:

$ source ~/.bashrc

If the environment variable is not added, execute the following command:

source /home/<your_username>/anaconda3/etc/profile.d/conda.sh

To check, try to execute the following command:

conda --version

Create Environment

To run Tensorflow on GPU, CUDA and cuDNN are required.

· CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications.

· The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.

A fast and easy way to install them is to create a conda environment and install tensorflow-gpu package which automatically installs all its dependencies, including CUDA and cuDNN. The disadvantage of this approach is that you cannot manage versions of CUDA and cuDNN and you have to go with the version that Anaconda provides. These versions are good for us.

Since Tensorflow 2.x.x is quite new in the market, it is essential that all of its dependencies’ versions are compatible. To know which version of tensorflow-gpu to install, find compatible versions of Tensorflow, CUDA, cuDNN, and Python: https://www.tensorflow.org/install/source#linux.

We got our model to work with Tensorflow 2.0.0 and its dependencies are:

Version Python version cuDNN CUDA
tensorflow-2.0.0 2.7, 3.3-3.7 7.4 10.0

Now that we know everything, we can finally create the conda environment and automatically install tensorflow-gpu version 2.0.0 and python version 3.7.4. since our code works specifically with these ones (at the time of writing), and also install the latest version of tensorflow-datasets (at the time of writing it is version 3.0.0):

$ conda create -n tf-env tensorflow-gpu=2.0.0 tensorflow-datasets python=3.7.4

Check if installed versions of CUDA and cuDNN match, using the conda list command:

Name Version
cudatoolkit 10.0.130
cudnn 7.6.5

cuDNN doesn’t match perfectly and it will cause Tensorflow to be unable to detect GPU. This problem can be hacked → the solution is the first code snippet in the section Model below.

To activate the environment, use:

$ conda activate tf-env

Machine Learning Model

As we have already mentioned, the model is the same as in running on CPUs described in the first blog post (we are running distributed training with TensorFlow, using MultiWorkerMirroredStrategy with Keras model which classifiesMNIST datasetthat comprises 60,000 training images of the handwritten digits 0–9, formatted as 28×28-pixel monochrome images). The code is pretty much the same, with two new additions, which will be emphasized with “New for GPUs!”.

We introduce the most relevant parts of the code.

  • New for GPUs! Tensorflow 2.0.0 is slightly incompatible with Anaconda’s CUDA and cuDNN versions. We have to force Tensorflow to detect GPU:
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)
  • Set up the TF_CONFIG environment variable:
# Get nodelist and current nodename from Slurm
nodelist = os.getenv("SLURM_JOB_NODELIST")
nodename = os.getenv("SLURMD_NODENAME")
port_number = 22222

/* Extract informations from above environment variables and format them for TF_CONFIG */

# Set environment variable
os.environ['TF_CONFIG'] = json.dumps({
'cluster': <cluster_dict>,
'task': {'type': 'worker', 'index': <task_index>} })

# E.g. cluster_dict = {'worker': ["compute-0-0:2222", "compute-0-1:2222", "compute-0-2:2222"]}
# E.g. task_index = 0, if $SLURMD_NODENAME is the first instance in the above list
  • Prepare the dataset, define the strategy, and build the model:
# Preparing dataset (60,000 examples)
def mnist_dataset(batch_size):
  (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
  # The `x` arrays are in uint8 and have values in the range [0, 255].
  # We need to convert them to float32 with values in the range [0, 1]
  x_train = x_train / np.float32(255)
  y_train = y_train.astype(np.int64)
  train_dataset = tf.data.Dataset.from_tensor_slices(
    (x_train, y_train)).shuffle(60000).repeat().batch(batch_size)
  return train_dataset

# Choose the right strategy
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Build the Keras model (build and compile a simple convolutional neural networks Keras model to train with our MNIST dataset)
def build_and_compile_cnn_model():
   model = tf.keras.Sequential([
       tf.keras.Input(shape=(28, 28)),
       tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
       tf.keras.layers.Conv2D(32, 3, activation='relu'),
       tf.keras.layers.Dense(128, activation='relu'),
 return model
  • New for GPUs! Print all visible devices, GPU must be listed:
from tensorflow.python.client import device_lib

def get_available_devices():
   local_device_protos = device_lib.list_local_devices()
   return [x.name for x in local_device_protos]


# Output: ['/device:CPU:0', '/device:XLA_CPU:0', '/device:GPU:0', '/device:XLA_GPU:0']
# Train the model with MultiWorkerMirroredStrategy
num_workers = int(os.getenv("SLURM_JOB_NUM_NODES"))

# Here the batch size scales up by number of workers since `tf.data.Dataset.batch` expects the global batch size.
# Every node gets one batch with size `per_worker_batch_size`.
global_batch_size = per_worker_batch_size * num_workers
multi_worker_dataset = mnist_dataset(global_batch_size)

with strategy.scope():
  # Model building/compiling need to be within `strategy.scope()`.
  multi_worker_model = build_and_compile_cnn_model()

# Keras' `model.fit()` trains the model with specified number of epochs and number of steps per epoch.
steps = 60000 // global_batch_size # there is 60,000 examples in dataset
history = multi_worker_model.fit(multi_worker_dataset, epochs=8, steps_per_epoch=steps)

Start the Training

To run an HPC job (which is distributed training in our case), run the sbatch script usingsbatch (SCRIPT NAME).The script we used is as follows:

#SBATCH --partition= (debug on GCP, and compute on aws)
#SBATCH --job-name=kerasGPU-job
#SBATCH --gres=gpu:1
#SBATCH --nodes=8
#SBATCH --output=out-kerasGPU.txt # send stdout to out.txt

echo ". $HOME/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
source ~/.bashrc
conda activate tf-env
srun --gres=gpu:1 --nodes=8 python keras.py

Lines 9&10 are needed because Slurm has difficulty with activating conda environment on compute instances, even though it is installed in the /home directory. Line 11 activates the conda environment and line 12 starts the parallel HPC job.

An important part of the sbatch script which distinguishes GPU job from CPU job is the parameter gres=gpu:1. It requests 1 GPU per node.


When Tensorflow runs code on GPU, your output file (named out-kerasGPU.txt with one of the parameters in the sbatch script) will receive a bunch of similar messages to these:

2020-05-11 17:11:39.628185: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-11 17:11:40.662700: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-11 17:11:40.663344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:04.0
2020-05-11 17:11:42.295762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-05-11 17:11:42.402748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 15052 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0)

Results: Comparison of Training Distributed Model on Different Numbers of Nodes


We emphasize that we have 8 GPUs available on GCP because of the GPU Quota Over Regions and GPU Quota Over Machine Types.

Number of nodes (1 GPU per node) Code execution time Loss Accuracy
1 23.978 seconds 0.0047 0.9988
2 159.896 seconds 0.0093 0.9977
3 119.068 seconds 0.0254 0.9926
4 94.293 seconds 0.0376 0.9886
6 83.217 seconds 0.0530 0.9838
8 75.389 seconds 0.0596 0.9815

We were also curious about what happens when we increase the number of GPUs per node:

Number of nodes (2 GPUs per node) Code execution time Loss Accuracy
1 51.921 seconds 0.0035 0.9992
2 206.433 seconds 0.0152 0.9959
3 165.023 seconds 0.0221 0.9938
4 147.129 seconds 0.0312 0.9906
Number of nodes (4 GPUs per node) Code execution time Loss Accuracy
1 128.877 seconds 0.0028 0.9994
2 278.956 seconds 0.0145 0.9960

: nodes take 5 minutes (on average) to allocate.

In the tables, we show three factors that we use for comparison: time, loss, and accuracy. Loss and accuracy are model’s metrics. Loss is a number indicating model’serror in prediction. Accuracy is the ratio of correctly classified images to the whole set of images. Loss is desired to be as close to number zero as possible, and accuracy is desired to be as close to number one as possible.

Looking at the execution time, much better option for us is the lesser number of GPUs per node. The more GPUs per node, the more time is needed for code execution to be finished.

But looking at the loss and accuracy metrics, we see that the situation is reversed. The more GPUs per node, the better the model’s metrics: loss gets lower, and accuracy gets higher.

We can also notice that the more nodes we use (in each of the three tables), the more time we save, but the more we lose on precision.

The more we distribute the training (across machines, GPUs, …) the more time we save, but we lose on model precision. It is a decision to make depending on your situation – whether time or precision is more important.


Number of nodes (1GPU per node) Code execution time Loss Accuracy
1 27.273 seconds 0.0024 0.9995
2 131.852 seconds 0.0127 0.9966
3 112.568 seconds 0.0127 0.9966
4 94.664 seconds 0.0392 0.9881
6 66.271 seconds 0.0520 0.9844
8 68.576 seconds 0.0640 0.9807
10 64.285 seconds 0.0732 0.9782
20 60.900 seconds 0.1089 0.9676

Note: nodes take approx. 4 minutes to allocate.

Since we have more GPUs on AWS than on GCP, it helped us to notice that the time starts converging at some point and cannot accelerate much more. This is probably due to a model’s simplicity, so it reaches its peak quite early.

We see that with the number of nodes increasing, execution time is decreasing, but the loss and accuracy go further from optimal values.

When comparing AWS’ 1 GPU per node to GCP’s 1 GPU per node, AWS’ execution time is better than GCP’s. GCP starts better off with 1 node, but AWS wins with greater number of nodes. Additionally, loss and accuracy are slightly worse on AWS.

Here we can again notice that the more nodes we use (in each of the three tables), the more time we save, but the more we lose on precision.


Obvious relationship between time and model’s precision (loss, accuracy) is that they are inversely proportional. If you want to save some time, you will slightly lose on model’s prediction, and vice versa.

Some situations could benefit greatly from time gain and are resistant to a slight deviation from precision, but in some situations model’s precision in prediction could be extremely important.

Regarding the comparison between GCP and AWS, AWS was slightly better in the time aspect, but GCP was slightly better in the model’s precision (loss, accuracy).

Click here for part 1:High Performance Computing with Slurm on GCP

Click here for part 2: High Performance Computing with Slurm on AWS