You should be aware of a number of global defaults:
Memory per job = 4GB, user can specify more via --mem
Core per job = 2, user can specify more via --cpus-per-task
Walltime = varies by partitions, check https://sc.stanford.edu/ (Partition). Most are 7days, user can specify upto 21days via --time
We see many users wrap their interactive jobs with screen or tmux so they can detach and re-attach later. While this is a feasible use case, we want to state that if there's any network interruption between the headnode (sc) and the compute nodes (and they do happen occasionally), these jobs will get cancelled automatically by Slurm. Jobs submitted via sbatch on the other hand, can better sustain these kind of interruptions.
Virtual Environment for Python:
Almost all users are using some kind of virtual python environment, either virtualenv, anaconda, miniconda, etc. We install a small number of default python packages to get things going, but you are responsible for creating your own environment.
At the moment, CUDA 9.0 is the default across the cluster. But each group (partition) can have their own default. Contact us if you think your group is ready for a new version of CUDA, which can be added (multiple CUDA versions can co-exist). This often requires GPU driver update, which requires a reboot on all of the nodes.
Do not run iPython/Jupyter notebook on the headnode (can cause memory spikes). Instead, do that via one of the compute nodes.
srun -p mypartition --pty bash (add --gres=gpu:1 if you need GPU)
export XDG_RUNTIME_DIR="" (important)
jupyter-notebook --no-browser --port=8880 --ip='0.0.0.0'
Follow the result URL on your browser to open up your notebook.
Extra-credit: If you do this often, you can easily convert the above into a script and use sbatch to run it in batch mode.