AI Hypercomputer provides a software stack that contains common tools and libraries that are pre-configured to support your artificial intelligence (AI), machine learning (ML), and high performance computing (HPC) workloads. The software stack is arranged as follows:

As outlined in the preceding diagram, the software stack has two main components:
Up to one of the following types of cluster images:
Deep Learning Software Layer (DLSL) container images: these images package NVIDIA CUDA, NCCL, and ML frameworks like PyTorch, providing a ready-to-use environment for deep learning workloads. These prebuilt DLSL container images are tested and verified to work seamlessly on Google Kubernetes Engine (GKE) clusters.
Slurm custom images: these images are created using Cluster Toolkit blueprints and extend the Ubuntu LTS Accelerator OS images to install software optimized for Slurm clusters.
Guest operating system (OS) accelerator images: these images are either an Ubuntu LTS or Rocky Linux image which is pre-configured with NVIDIA MOFED drivers, NVIDIA GPU drivers, and an RDMA core. These images are suitable for deploying workloads on Compute Engine instances.
Cluster images
DLSL images are optimized for GKE clusters, while Slurm custom images are optimized for Slurm clusters.
DLSL container images
When working with Google Kubernetes Engine environments, DLSL container images provide the following benefits for your workloads:
- Simpler configuration: by replicating the setup used by internal reproducibility and regression testing, DLSL containers simplify the configuration of your environments setup to run machine learning workloads
- Version management of the pre-configured docker images
- Sample recipes to demonstrate how to start your workloads using the pre-configured docker images
You can access the DLSL container images from the DLSL artifact registry or in the sample recipe guides.
NeMo + PyTorch + NCCL gIB Plugin
These docker images are based on the
NVIDIA NeMo NGC image.
They contain Google's NCCL gIB plugin, and bundles all of the NCCL binaries
that are required for running workloads on each supported accelerator machine.
They also include Google Cloud tools such as
gcsfuse
and
gcloud CLI for deploying workloads to Google Kubernetes Engine.
Model version | Dependencies version | Machine series | Release date | End of support date | Image name | Sample recipes |
---|---|---|---|---|---|---|
nemo25.02-gib1.0.5-A4 |
|
A4 | March 14, 2025 | March 14, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.02-gib1.0.5-A4 | |
nemo24.07-gib1.0.2-A3U |
|
A3 Ultra | February 2, 2025 | February 2, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.2-A3U | |
nemo24.07-gib1.0.3-A3U |
|
A3 Ultra | February 2, 2025 | February 2, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-gib1.0.3-A3U | |
nemo24.12-gib1.0.3-A3U |
|
A3 Ultra | February 7, 2025 | February 7, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.12-gib1.0.3-A3U | |
nemo24.07-tcpx1.0.5-A3Mega |
|
A3 Mega | March 12, 2025 | March 12, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo24.07-tcpx1.0.5-A3Mega |
NeMo + PyTorch
This docker image is based on the
NVIDIA NeMo NGC image
and includes Google Cloud tools such as
gcsfuse
and
gcloud CLI for deploying workloads to Google Kubernetes Engine.
Model version | Dependencies version | Machine series | Release date | End of support date | Image name |
---|---|---|---|---|---|
nemo24.07--A3U |
NeMo NGC:24.07 | A3 Ultra | December 19, 2024 | December 19, 2025 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo:nemo24.07-A3U |
MaxText + JAX toolbox
This docker image is based on the
NVIDIA JAX toolbox image
and includes Google Cloud tools such as
gcsfuse
and
gcloud CLI for deploying workloads to Google Kubernetes Engine.
Model version | Dependencies version | Machine series | Release date | End of support date | Image name |
---|---|---|---|---|---|
toolbox-maxtext-2025-01-10-A3U |
Jax toolbox: maxtext-2025-01-10 | A3 Ultra | March 11, 2025 | March 11, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:toolbox-maxtext-2025-01-10-A3U |
MaxText + JAX stable stack
This docker image is based on the
JAX stable stack
and MaxText. This image also
includes dependencies such as dnsutils
for running workloads on
Google Kubernetes Engine.
Model version | Dependencies version | Machine series | Release date | End of support date | Image name |
---|---|---|---|---|---|
jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317 |
|
A4 | March 17, 2025 | March 17, 2026 | us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317 |
Slurm custom images
Slurm custom images are built from Cluster Toolkit blueprints. Blueprints are YAML files that define a configuration that you want to deploy using Cluster Toolkit. Cluster Toolkit provides default blueprints that are optimized for running AI Hypercomputer workloads on Slurm clusters. However, you can modify the default blueprints before you deploy them to customize some of the software downloaded on your images.
Cluster Toolkit blueprints extend the Ubuntu LTS Accelerator OS images.
A3 Ultra
The A3 Ultra blueprint installs the following software by default:
- Ubuntu 22.04 LTS
- Slurm: version 24.11.2
- The following Slurm dependencies:
munge
mariadb
libjwt
lmod
- Open MPI: the latest release of 4.1.x
- PMIx: version 4.2.9
- NFS client and server
- NVIDIA 550 series drivers
- NVIDIA enroot container runtime: version 3.5.0 with post-release bugfix
- NVIDIA pyxis
- The following NVIDIA tools:
- Data Center GPU Manager (
dcgmi
) libnvidia-cfg1-550-server
libnvidia-nscq-550
nvidia-compute-utils-550-server
nsight-compute
nsight-systems
- Data Center GPU Manager (
- CUDA Toolkit: version 12.4
- Infiniband support, including
ibverbs-utils
- Ops Agent
- Cloud Storage FUSE
A4
The A4 blueprint installs the following software by default:
- Ubuntu 22.04 LTS
- Slurm: version 24.11.2
- The following Slurm dependencies:
munge
mariadb
libjwt
lmod
- Open MPI: the latest release of 4.1.x
- PMIx: version 4.2.9
- NFS client and server
- NVIDIA 570 series drivers
- NVIDIA enroot container runtime: version 3.5.0 with post-release bugfix
- NVIDIA pyxis
- The following NVIDIA tools:
- Data Center GPU Manager (
dcgmi
) nvidia-utils-570
nvidia-container-toolkit
libnvidia-nscq-570
- Data Center GPU Manager (
- CUDA Toolkit: version 12.8
- Infiniband support, including
ibverbs-utils
- Ops Agent
- Cloud Storage FUSE
To learn how to modify Cluster Toolkit blueprints, see Design a cluster blueprint.
Guest OS accelerator images
The following OS images are optimized for running artificial intelligence (AI) and machine learning (ML) workloads on Google Cloud. For more detailed information about each OS, see the Operating system details page in the Compute Engine documentation.
Rocky Linux Accelerator
The following Rocky Linux Accelerator OS images are available:
- Rocky Linux 9 Accelerator
- Image family:
rocky-linux-9-optimized-gcp-nvidia-latest
- Image project:
rocky-linux-accelerator-cloud
- Image family:
- Rocky Linux 8 Accelerator
- Image family:
rocky-linux-8-optimized-gcp-nvidia-latest
- Image project:
rocky-linux-accelerator-cloud
- Image family:
Ubuntu LTS Accelerator
The following Ubuntu LTS Accelerator OS images are available:
- Ubuntu 24.04 LTS Accelerator
- Image family:
ubuntu-accelerator-2404-amd64-with-nvidia-550
- Image project:
ubuntu-os-accelerator-images
- Image family:
- Ubuntu 22.04 LTS Accelerator
- Image family:
ubuntu-accelerator-2204-amd64-with-nvidia-550
- Image project:
ubuntu-os-accelerator-images
- Image family:
What's next?
- Review consumption options.
- To get started with creating VMs and clusters, see Create VMs and clusters overview.