AI Hypercomputer images

AI Hypercomputer provides a software stack that contains common tools and libraries that are pre-configured to support your artificial intelligence (AI), machine learning (ML), and high performance computing (HPC) workloads. The software stack is arranged as follows:

AI Hypercomputer software stack.
Figure 1. AI Hypercomputer software stack

As outlined in the preceding diagram, the software stack has two main components:

  • Up to one of the following types of cluster images:

    • Deep Learning Software Layer (DLSL) container images: these images package NVIDIA CUDA, NCCL, and ML frameworks like PyTorch, providing a ready-to-use environment for deep learning workloads. These prebuilt DLSL container images are tested and verified to work seamlessly on Google Kubernetes Engine (GKE) clusters.

    • Slurm custom images: these images are created using Cluster Toolkit blueprints and extend the Ubuntu LTS Accelerator OS images to install software optimized for Slurm clusters.

  • Guest operating system (OS) accelerator images: these images are either an Ubuntu LTS or Rocky Linux image which is pre-configured with NVIDIA MOFED drivers, NVIDIA GPU drivers, and an RDMA core. These images are suitable for deploying workloads on Compute Engine instances.

Cluster images

DLSL images are optimized for GKE clusters, while Slurm custom images are optimized for Slurm clusters.

DLSL container images

When working with Google Kubernetes Engine environments, DLSL container images provide the following benefits for your workloads:

  • Simpler configuration: by replicating the setup used by internal reproducibility and regression testing, DLSL containers simplify the configuration of your environments setup to run machine learning workloads
  • Version management of the pre-configured docker images
  • Sample recipes to demonstrate how to start your workloads using the pre-configured docker images

You can access the DLSL container images from the DLSL artifact registry or in the sample recipe guides.

NeMo + PyTorch + NCCL gIB Plugin

These docker images are based on the NVIDIA NeMo NGC image. They contain Google's NCCL gIB plugin, and bundles all of the NCCL binaries that are required for running workloads on each supported accelerator machine. They also include Google Cloud tools such as gcsfuse and gcloud CLI for deploying workloads to Google Kubernetes Engine.

Model version Dependencies version Machine series Release date End of support date Image name Sample recipes
nemo25.02-gib1.0.5-A4
  • NeMo NGC:25.02
  • NCCL giB plugin: 1.0.5
A4 March 14, 2025 March 14, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.02-gib1.0.5-A4
nemo24.07-gib1.0.2-A3U
  • NeMo NGC:24.07
  • NCCL giB plugin: 1.0.2
A3 Ultra February 2, 2025 February 2, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.2-A3U
nemo24.07-gib1.0.3-A3U
  • NeMo NGC:24.07
  • NCCL giB plugin: 1.0.3
A3 Ultra February 2, 2025 February 2, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.3-A3U
nemo24.12-gib1.0.3-A3U
  • NeMo NGC:24.12
  • NCCL giB plugin: 1.0.3
A3 Ultra February 7, 2025 February 7, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.12-gib1.0.3-A3U
nemo24.07-tcpx1.0.5-A3Mega
  • NeMo NGC:24.07
  • GPUDirect-TCPX: 1.0.5
A3 Mega March 12, 2025 March 12, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-tcpx1.0.5-A3Mega

NeMo + PyTorch

This docker image is based on the NVIDIA NeMo NGC image and includes Google Cloud tools such as gcsfuse and gcloud CLI for deploying workloads to Google Kubernetes Engine.

Model version Dependencies version Machine series Release date End of support date Image name
nemo24.07--A3U NeMo NGC:24.07 A3 Ultra December 19, 2024 December 19, 2025 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo:nemo24.07-A3U

MaxText + JAX toolbox

This docker image is based on the NVIDIA JAX toolbox image and includes Google Cloud tools such as gcsfuse and gcloud CLI for deploying workloads to Google Kubernetes Engine.

Model version Dependencies version Machine series Release date End of support date Image name
toolbox-maxtext-2025-01-10-A3U Jax toolbox: maxtext-2025-01-10 A3 Ultra March 11, 2025 March 11, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:toolbox-maxtext-2025-01-10-A3U

MaxText + JAX stable stack

This docker image is based on the JAX stable stack and MaxText. This image also includes dependencies such as dnsutils for running workloads on Google Kubernetes Engine.

Model version Dependencies version Machine series Release date End of support date Image name
jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317
  • JAX Stable stacks:jax0.5.1-cuda_dl25.02-rev1
  • maxtext commit: 54e98c9e62caa426cf5902be068533ddb4fb79f5
A4 March 17, 2025 March 17, 2026 us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317

Slurm custom images

Slurm custom images are built from Cluster Toolkit blueprints. Blueprints are YAML files that define a configuration that you want to deploy using Cluster Toolkit. Cluster Toolkit provides default blueprints that are optimized for running AI Hypercomputer workloads on Slurm clusters. However, you can modify the default blueprints before you deploy them to customize some of the software downloaded on your images.

Cluster Toolkit blueprints extend the Ubuntu LTS Accelerator OS images.

A3 Ultra

The A3 Ultra blueprint installs the following software by default:

  • Ubuntu 22.04 LTS
  • Slurm: version 24.11.2
  • The following Slurm dependencies:
    • munge
    • mariadb
    • libjwt
    • lmod
  • Open MPI: the latest release of 4.1.x
  • PMIx: version 4.2.9
  • NFS client and server
  • NVIDIA 550 series drivers
  • NVIDIA enroot container runtime: version 3.5.0 with post-release bugfix
  • NVIDIA pyxis
  • The following NVIDIA tools:
    • Data Center GPU Manager (dcgmi)
    • libnvidia-cfg1-550-server
    • libnvidia-nscq-550
    • nvidia-compute-utils-550-server
    • nsight-compute
    • nsight-systems
  • CUDA Toolkit: version 12.4
  • Infiniband support, including ibverbs-utils
  • Ops Agent
  • Cloud Storage FUSE

A4

The A4 blueprint installs the following software by default:

  • Ubuntu 22.04 LTS
  • Slurm: version 24.11.2
  • The following Slurm dependencies:
    • munge
    • mariadb
    • libjwt
    • lmod
  • Open MPI: the latest release of 4.1.x
  • PMIx: version 4.2.9
  • NFS client and server
  • NVIDIA 570 series drivers
  • NVIDIA enroot container runtime: version 3.5.0 with post-release bugfix
  • NVIDIA pyxis
  • The following NVIDIA tools:
    • Data Center GPU Manager (dcgmi)
    • nvidia-utils-570
    • nvidia-container-toolkit
    • libnvidia-nscq-570
  • CUDA Toolkit: version 12.8
  • Infiniband support, including ibverbs-utils
  • Ops Agent
  • Cloud Storage FUSE

To learn how to modify Cluster Toolkit blueprints, see Design a cluster blueprint.

Guest OS accelerator images

The following OS images are optimized for running artificial intelligence (AI) and machine learning (ML) workloads on Google Cloud. For more detailed information about each OS, see the Operating system details page in the Compute Engine documentation.

Rocky Linux Accelerator

The following Rocky Linux Accelerator OS images are available:

  • Rocky Linux 9 Accelerator
    • Image family: rocky-linux-9-optimized-gcp-nvidia-latest
    • Image project: rocky-linux-accelerator-cloud
  • Rocky Linux 8 Accelerator
    • Image family: rocky-linux-8-optimized-gcp-nvidia-latest
    • Image project: rocky-linux-accelerator-cloud

Ubuntu LTS Accelerator

The following Ubuntu LTS Accelerator OS images are available:

  • Ubuntu 24.04 LTS Accelerator
    • Image family: ubuntu-accelerator-2404-amd64-with-nvidia-550
    • Image project: ubuntu-os-accelerator-images
  • Ubuntu 22.04 LTS Accelerator
    • Image family: ubuntu-accelerator-2204-amd64-with-nvidia-550
    • Image project: ubuntu-os-accelerator-images

What's next?