Nvidia DGX-1

The Stanford Computer Vision Lab has added the Nvidia DGX-1 machine to their computer cluster. Currently, it is only accessible within the Stanford network. Please file a help request at http://support.cs.stanford.edu if you have any questions regarding the use of the machine.


Specification

Hostname visionlab-dgx1.stanford.edu
CPU 2x Intel E5-2698 v4 2.2 GHz @ 20-core
RAM 512GB
GPU 8x Tesla P100
Networking 10GbE
Storage 4x 2TB SSD RAID0, NFS-shared storage

Here is how to get started

  • Please request your access to visionlab-dgx1.stanford.edu by filling our support request at https://support.cs.stanford.edu. Please state which sponsoring faculty you are working with.
  • SSH into visionlab-dgx1.stanford.edu from campus network or via Stanford VPN service from off campus. (Full, non-split tunnel is required)

Nvidia-Docker

Nvidia suggests using Nvidia-Docker and their provided containers for optimized performance and convenience. The Following table outlines the containers they officially support and available on the DGX as of this writing.

REPOSITORY TAG
nvidia/cuda latest
nvcr.io/nvidia/digits 17.04
nvcr.io/nvidia/caffe 17.04
nvcr.io/nvidia/tensorflow 17.04
nvcr.io/nvidia/pytorch 17.04
nvcr.io/nvidia/caffe2 17.04
nvcr.io/nvidia/theano 17.04
nvcr.io/nvidia/mxnet 17.04
nvcr.io/nvidia/cntk 17.04
nvcr.io/nvidia/torch 17.04

#Check the current loaded containers, the nvidia containers should already be loaded. Note the TAG column, you'll need to use this when running the docker command
docker images

REPOSITORY                  TAG                 IMAGE ID            CREATED             SIZE
nvidia/cuda                 latest              569f547756e0        8 days ago          1.671 GB
nvcr.io/nvidia/digits       17.04               3736f3fe071f        4 weeks ago         4.171 GB
nvcr.io/nvidia/caffe        17.04               87c288427f2d        4 weeks ago         2.794 GB
nvcr.io/nvidia/tensorflow   17.04               121558cb5849        6 weeks ago         3.028 GB
nvcr.io/nvidia/pytorch      17.04               2f0834174e65        6 weeks ago         3.793 GB
nvcr.io/nvidia/caffe2       17.04               e5b67a4f6726        6 weeks ago         2.633 GB
nvcr.io/nvidia/theano       17.04               24943feafc9b        6 weeks ago         2.386 GB
nvcr.io/nvidia/mxnet        17.04               24afec0cd359        7 weeks ago         2.338 GB
nvcr.io/nvidia/cntk         17.04               61e61de9fa43        7 weeks ago         5.741 GB
nvcr.io/nvidia/torch        17.04               a337ffb42c8e        7 weeks ago         2.9 GB
nvidia/cuda                 7.5                 cf43500d0050        5 months ago        1.232 GB

#If you want, you can load your own container
docker load --input /raid/scratch/u/<framework>.tar

#Test nvidia-smi
nvidia-docker run --rm nvidia/cuda nvidia-smi

jimmyw@visionlab-dgx1:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
Wed May 24 23:59:21 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 0000:06:00.0     Off |                    0 |
| N/A   35C    P0    31W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 0000:07:00.0     Off |                    0 |
| N/A   37C    P0    34W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 0000:0A:00.0     Off |                    0 |
| N/A   36C    P0    35W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 0000:0B:00.0     Off |                    0 |
| N/A   37C    P0    33W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  Off  | 0000:85:00.0     Off |                    0 |
| N/A   38C    P0    32W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  Off  | 0000:86:00.0     Off |                    0 |
| N/A   35C    P0    32W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  Off  | 0000:89:00.0     Off |                    0 |
| N/A   37C    P0    33W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  Off  | 0000:8A:00.0     Off |                    0 |
| N/A   38C    P0    32W / 300W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

#Launch a framework container in interactive mode. Note the TAG, it is always REPOSITORY:TAG when tag isn't "latest".
nvidia-docker run --rm -ti nvcr.io/nvidia/torch:17.04

jimmyw@visionlab-dgx1:~$ nvidia-docker run --rm -ti nvcr.io/nvidia/torch:17.04
  ______             __   |  Torch7
 /_  __/__  ________/ /   |  Scientific computing for Lua.
  / / / _ \/ __/ __/ _ \  |
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch
                          |  http://torch.ch

NVIDIA Release 17.04 (build 17724)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2016, Soumith Chintala, Ronan Collobert, Koray Kavukcuoglu, Clement Farabet
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

root@0a24ff58cde5:/workspace# th

  ______             __   |  Torch7
 /_  __/__  ________/ /   |  Scientific computing for Lua.
  / / / _ \/ __/ __/ _ \  |  Type ? for help
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch
                          |  http://torch.ch

th>

#If there's a need for the container to access the host server network, just add --net=host
nvidia-docker run --rm --net=host -ti nvcr.io/nvidia/tensorflow:17.04

jimmyw@visionlab-dgx1:~$ nvidia-docker run --rm --net=host -ti nvcr.io/nvidia/tensorflow:17.04

================
== TensorFlow ==
================

NVIDIA Release 17.04 (build 21630)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

root@visionlab-dgx1:/workspace# apt-get update
Get:1 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:2 http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial InRelease [17.5 kB]
Get:3 http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial/main amd64 Packages [7096 B]
Get:4 http://archive.ubuntu.com/ubuntu xenial-updates InRelease [102 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-security InRelease [102 kB]
Get:6 http://archive.ubuntu.com/ubuntu xenial/main Sources [1103 kB]
Get:7 http://archive.ubuntu.com/ubuntu xenial/restricted Sources [5179 B]
Get:8 http://archive.ubuntu.com/ubuntu xenial/universe Sources [9802 kB]
Get:9 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages [1558 kB]
Get:10 http://archive.ubuntu.com/ubuntu xenial/restricted amd64 Packages [14.1 kB]
Get:11 http://archive.ubuntu.com/ubuntu xenial/universe amd64 Packages [9827 kB]
Get:12 http://archive.ubuntu.com/ubuntu xenial-updates/main Sources [315 kB]
Get:13 http://archive.ubuntu.com/ubuntu xenial-updates/restricted Sources [3202 B]
Get:14 http://archive.ubuntu.com/ubuntu xenial-updates/universe Sources [193 kB]
Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [693 kB]
Get:16 http://archive.ubuntu.com/ubuntu xenial-updates/restricted amd64 Packages [13.2 kB]
Get:17 http://archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [593 kB]
Get:18 http://archive.ubuntu.com/ubuntu xenial-security/main Sources [86.8 kB]
Get:19 http://archive.ubuntu.com/ubuntu xenial-security/restricted Sources [2779 B]
Get:20 http://archive.ubuntu.com/ubuntu xenial-security/universe Sources [31.7 kB]
Get:21 http://archive.ubuntu.com/ubuntu xenial-security/main amd64 Packages [334 kB]
Get:22 http://archive.ubuntu.com/ubuntu xenial-security/restricted amd64 Packages [12.8 kB]
Get:23 http://archive.ubuntu.com/ubuntu xenial-security/universe amd64 Packages [142 kB]
Fetched 25.2 MB in 3s (6808 kB/s)
Reading package lists... Done