Labs & musings

High Performance Computing with Slurm on AWS High Performance Computing with Slurm on AWS

Code / 12.01.2021

High Performance Computing with Slurm on AWS

As a second part of the blog series about HPC, we test performances of aSlurmcluster deployed on AWS cloud platform, to compare the setup and the results with GCP's deployment from the previous post in the series.

In this blog post, we will describe the cluster setup on AWS, show the results of training the same machine learning model as in the previous blog post, and describe the cluster cleanup. To see the environment setup anddetailed description ofthe model, please check out the previous blog post.In short, the model classifies images of handwritten digits 0-9.

Setting Up aSlurmCluster on AWS

We will useAWSParallelClusterto set up an HPC cluster. We will do so by using AWS CLI from the local Windows machine,i.e.fromtheCommand Prompt. As AWS says,AWSParallelClusteris an AWS-supported open-source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS. It provides all the resources needed for your HPC applications and supports a variety of job schedulers such as AWS Batch, SGE, Torque, andSlurm.

HPC_on_AWS_pic1

AWS provides quite anautomaticsetup, since all youhave todo istoinstallParallelClusterlocallyas a Python packageand then juststart the interactive cluster configuration which will proposeaseriesof questionfor you to answer andset it allup for you.So,let'sstart:

1. Create virtual environment and activate it:

> virtualenv hpc-venv
> hpc-venv/Scripts/activate.bat

2. Install AWS CLI:

(hpc-venv) > pip install --upgradeawscli
(hpc-venv) >awsconfigure

This stepis forconnectingyour local machine to your AWS account. For that purpose, you will have to set your access key, secret access key,region,and output format.
Access key and secret access key can be created in AWS console → Services → Security, Identity, & Compliance → IAM→ Accessmanagement → Users → choose your user → Security credentials → Create access key. Save the secret key locally because youwill notbe able toreach itagain.

3.InstallParallelCluster:

(hpc-venv) > pip installaws-parallelcluster

4. Start configuring thecluster to suit your needs withpclusterconfigure. It will outputthe mentionedseries of questions with possible options and you justhave tochoose from them. It even offers you different HPC job managers/schedulers which will thenbeinstalledand preparedfor you without your effort.
Here are someof the informationthat wefilled in:

AWS Region ID: eu-central-1
Scheduler:slurm
Operating System: centos7
Minimum cluster size (instances): 0
Maximum cluster size (instances): 10
Master instance type: (we tested both m4.large and c4.2xlarge)
Compute instance type: (we tested both m4.large and c4.2xlarge)
Network Configuration: Master in a public subnet and compute fleet in a private subnet

It willalsoask you for anEC2 key pairsinceit is needed to SSH into yourmasterinstance and runSlurmjobs. It can be created in AWS console → Services → Compute → EC2 → Network & Security → Key Pairs → Create key pair. Save the key locally because youwill notbe able to reach it anymore.

Pay attention totheminimum cluster size. Itimplies how manycomputeinstances will be aliveall the time. Thisis a great feature if youwill notkeep the cluster alive for a longerperiod.For example,if you terminateyourmasterandcomputeinstances over night,and still see them running in the morning –wrongly adjustedminimum cluster sizeparameter is the reasonwhyit happened.ParallelClusterwill constantly be allocating them and then terminating them (aroundevery 13minutes).We kept our cluster aliveonlyfor a fewdays,so wehaveset theminimum cluster sizeto 0.Wehavealso set themaintain_initial_sizeparameter tofalseintheconfig file.

Finally,commandpclusterconfigurecreatedaconfig file inC:\Users\YOUR-USER-NAME\.parallelcluster\,andyou canfurtheredit it if you need to before the next step.

5. CreateParallelCluster:

(hpc-venv)> pcluster create -c C:\Users\YOUR-USER-NAME\.parallelcluster\config slurmCPU-parallelcluster

6.SSH intothemasterinstance from local CMD:

(hpc-venv)> ssh -i  (EC2-key) centos@MASTER-INSTANCE-PUBLIC-IP

Note:YoucannotSSH tocomputenodes from your local machineif you have put theminaprivate subnetduringpclusterconfigure. You can SSH tocomputenodes only from yourmasterinstance.

The stepsfrom abovewill createthefollowing resources:

  • CloudFormation Stacks (for deploying the whole cluster with its dependent AWS resources)

  • EC2 Instances (masterandcomputeinstances)

  • EC2 Auto Scaling Group (for quick scaling of the nodes)

  • EC2 Launch Template (for specifying EC2 instance configuration,e.g.AMI, instance type)

  • EBS Volume (attached storage to EC2 instances)

  • DynamoDB table (for storing some metadata)

Tip:you can find cluster’s events/logs in: EC2 → Auto Scaling Groups → Activity → Activity history.

Training Distributed ML Model withSlurm

We are training our model on CPUs.As mentioned, the modeland the environmentaredescribedin detailin the previous blog post.In short,itis aKerasmachine learning model that classifies 60,000 MNIST images of handwritten digits. It is a simple convolutional neural network algorithm.We are usingTensorflow’sMultiWorkerMirroredStrategyforsynchronous distribution oftraining with data parallelism across multiple machines.

Two machine types that we tested are:

  • m4.large(general purpose) → because it is the most similar to GCP’sn1-standard-2which we used for testing on GCPin the previous blog post

  • c4.2xlarge (compute optimized) → recommended for HPC

Instance

vCPU*

Mem (GiB)

Dedicated EBS Bandwidth (Mbps)

Network Performance

Linux/UNIX Usage

Processor

m4.large

2

8

450

Moderate

$0.12 per Hour

2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell)

c4.2xlarge

8

15GiB

1,000

High

$0.454 per Hour

High frequency (2.9 GHz) Intel Xeon E5-2666 v3 (Haswell)

Results: Comparison of Training Distributed Model on Different Numbers of Nodes

  • m4.large

Number of nodes

Code execution time

Loss

Accuracy

1

381.350 seconds

0.0035

0.9993

2

750.046 seconds

0.0156

0.9958

3

666.984 seconds

0.0282

0.9914

4

563.375 seconds

0.0369

0.9890

6

418.380 seconds

0.0541

0.9833

8

335.298 seconds

0.0599

0.9823

10

272.900 seconds

0.0712

0.9789

    • c4.2xlarge

    Number of nodes

    Code execution time

    Loss

    Accuracy

    1

    112.338 seconds

    0.0029

    0.9994

    2

    158.687 seconds

    0.0139

    0.9963

    3

    140.527 seconds

    0.0291

    0.9911

    4

    110.211 seconds

    0.0383

    0.9883

    6

    84.259 seconds

    0.0545

    0.9838

    8

    65.279 seconds

    0.0761

    0.9774

    10

    55.647 seconds

    0.0710

    0.9781

    In the tables, we show three factors that we use for comparison: time, loss, and accuracy. Loss and accuracy are model’s metrics.Lossis a number indicating model's error inprediction.Accuracy is the ratio of correctly classified images to the whole set of images.Loss is desired to beas close to numberzeroas possible, and accuracy is desired to be as close to number one as possible.

    When comparing GCP's and AWS' results on similar machines (n1-standard-2 on GCPin previous blog post,and m4.largeon AWS), we notice that GCP has significantly bettertimeresults.AWS starts better off with 1 node, but with all the other nodes time is almost three times better on GCP.The time duration peak withtraining ontwo nodesremainson AWS as it was on GCP (only bigger).Regarding the accuracy and the loss, those areslightlybetter on AWS’s m4.large.

    When comparing AWS’m4.largeandc4.2xlarge,c4.2xlarge is obviously a much better solution for high performance computingregarding the execution time, while loss and accuracy are almost the same.

    Cluster Cleanup

    Delete the cluster and along created dependent resources fromthevirtual environment on your local machine (created at the beginning of this blog post):

    (hpc-venv) > pcluster delete YOUR-CLUSTER-NAME

    If youdo notrememberthecluster name, you can find it by:

    (hpc-venv) >pclusterlist

    If you chose to create VPC duringpclusterconfigure, you should additionally delete created VPC and its belonging CloudFormation Stack.

    Click here for part 1 in the series: High Performance Computing with Slurm on GCP

    BACK TO LAB

    Cookie policy

    To make this website run properly and to improve your experience, we use cookies. For more detailed information, please check our Cookie Policy.

    Choice of cookies on this website

    Allow or deny the website to use functional and/or advertising cookies described below:

    Settings Accept necessary Accept selected