Deploy TPU Multislices in GKE


This page introduces the TPU Multislice configuration in Google Kubernetes Engine (GKE). Before you configure Multislice in GKE, you should be familiar with the following concepts:

  1. Introduction to Cloud TPU
  2. Cloud TPU system architecture
  3. About TPUs in GKE

What's TPU Multislice

TPU Multislice is the architectural organization of TPU VMs where two or more Cloud TPU slices communicate over the Data Center Network (DCN). Multislice enables full-stack, cost effective, large scale training with near-linear scaling up to tens of thousands of TPUs chips. In a Multislice configuration, GKE deploys a Multislice workload on multiple TPU slices. The communication between chips within a slice happens over inter chip interconnects (ICI). The communication between slices happens over the DCN.

We recommend that you use the Multislice if your Job is too big to fit on a single TPU slice.

Multislice availability in GKE

  • Standard supports Multislice in version 1.27.4-gke.900 and later.
  • Autopilot supports Multislice in version 1.29.2-gke.1521000 and later.
  • Multislice supports JAX and PyTorch frameworks. The minimum supported JAX version is 2.1.
  • Multislice only supports multi-host TPU slice node pools. For example, you cannot use Multislice with a ct4p-hightpu-4t with a 2x2x1 topology or a ct5lp-hightpu-4t with a 2x2 topology, because these are single-host TPU slice node pools.
  • Multislice only supports synchronous multicontroller training.
  • Multislice workloads can only run across TPU slices that share the same TPU type, size, and topology.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Run a workload on a Multislice

This section shows you how to run a workload on a Multislice. If you use GKE Autopilot mode, skip to the Run a Multislice workload section. Autopilot clusters that run version 1.29.2-gke.1521000 or later enable TPUs by default.

Prepare a Standard mode node pool

This section covers the following steps:

  1. Create three multi-host TPU slice node pools
  2. Verify the node pool status

Create the TPU node pool

You can create more than one multi-host TPU node pool. For the purpose of this guide, create three multi-host TPU node pools to run a Multislice workload. You can create a multi-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONE \
    --machine-type=MACHINE_TYPE \
    --tpu-topology=TPU_TOPOLOGY \
    --num-nodes=NUM_NODES \
    [--spot \]
    [--enable-autoscaling \
      --max-nodes MAX_NODES]
    [--reservation-affinity=specific \
    --reservation=RESERVATION_NAME]

Replace the following:

  • POOL_NAME: The name of the new node pool.
  • LOCATION: The name of the zone based on the TPU version you want to use:

    • For TPU v4, use us-central2-b.
    • TPU v5e machine types beginning with ct5l- are never multi-host.
    • For TPU v5e machine types beginning with ct5lp-, use us-west1-c, us-west4-a, us-west4-b, us-central1-a, us-east1-c, us-east5-b, or europe-west4-a.
    • For TPU v5p machine types beginning with ct5p-, use us-east1-d, us-east5-a, or us-east5-c.

    To learn more, see TPU availability in GKE.

  • CLUSTER_NAME: The name of the cluster.

  • NODE_ZONE: The comma-separated list of one or more zones where GKE creates the node pool.

  • MACHINE_TYPE: The type of machine to use for nodes. To learn more about the available machine types, see Mapping of TPU configuration.

  • TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version as follows:

    • TPU v4 or v5p: Define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4.
    • TPU v5e: Define the topology in 2-tuples ({A}x{B}), for example 2x2.

    To learn more, see Topology.

  • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM. For multi-host TPU v4 and TPU v5e, the number of chips in each VM is four. Therefore, if your TPU_TOPOLOGY is 2x4x4 (TPU v4 with four chips in each VM), then the NUM_NODES is 32/4 which equals to 8.

Optionally, you can also use the following flags:

  • RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPU node pools. To learn more about TPU reservations, see TPU reservation.
  • --spot: Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
  • --enable-autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
    • MAX_NODES: The maximum size of the node pool. The --max-nodes flag is required if --enable-autoscaling is supplied and must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:

    resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
      provider           = google
      project            = PROJECT_ID
      cluster            = CLUSTER_NAME
      name               = POOL_NAME
      location           = CLUSTER_LOCATION
      node_locations     = [NODE_ZONES]
      initial_node_count = NUM_NODES
    
      autoscaling {
        max_node_count = MAX_NODES
        location_policy      = "ANY"
      }
      node_config {
        machine_type = MACHINE_TYPE
        reservation_affinity {
          consume_reservation_type = "SPECIFIC_RESERVATION"
          key = "compute.googleapis.com/reservation-name"
          values = [RESERVATION_LABEL_VALUES]
        }
        spot = true
      }
    
      placement_policy {
        type = "COMPACT"
        tpu_topology = TPU_TOPOLOGY
      }
    }
    

    Replace the following:

    • NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
    • PROJECT_ID: Your project ID.
    • CLUSTER_NAME: The name of the existing cluster to add the node pool to.
    • POOL_NAME: The name of the node pool to create.
    • CLUSTER_LOCATION: Compute location for the cluster. We recommend having a regional cluster for higher reliability of the Kubernetes control plane. You can also use a zonal cluster. To learn more, see Select a TPU version and topology.
    • NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
    • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the number of the TPU chips divided by four, because in multi-host TPU slices each TPU node has 4 chips. For example, if TPU_TOPOLOGY is 4x8, then there are 32 chips which means NUM_NODES must be 8. To learn more about TPU topologies, use the table in Mapping of TPU configuration.
    • TPU_TOPOLOGY: This indicates the desired physical topology for the TPU slice. The format of the topology depends on the TPU version you are using:
      • For TPU v4: Define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4.
      • For TPU v5e: Define the topology in 2-tuples ({A}x{B}), for example 2x2.

    Optionally, you can also use the following variables:

    • RESERVATION_NAME: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate theRESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
    • autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
      • MAX_NODES: It is the maximum size of the node pool. It must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.
    • spot: Lets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool details section, check the Specify node locations box.

  5. Select the zone based on the TPU version you want to use:

    • For TPU v4, use us-central2-b.
    • TPU v5e machine types beginning with ct5l- are never multi-host.
    • For TPU v5e machine types beginning with ct5lp-, use us-west1-c, us-west4-a, us-west4-b, us-central1-a, us-east1-c, us-east5-b, or europe-west4-a.
    • For TPU v5p machine types beginning with ct5p-, use us-east1-d, us-east5-a, or us-east5-c.
  6. From the navigation pane, click Nodes.

  7. In the Machine Configuration section, select TPUs.

  8. In the Series drop-down menu, select one of the following:

    • CT4P: For TPU v4.
    • CT5LP: For TPU v5e.
  9. In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Mapping of TPU configuration table to learn how to define the machine type and TPU topology that create a multi-host TPU node pool.

  10. In the TPU Topology drop-down menu, select the physical topology for the TPU slice.

  11. In the Changes needed dialog, click Make changes.

  12. Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

Verify the node pool status

  1. Get credentials, so that you can use kubectl to access the cluster:

    gcloud container clusters get-credentials CLUSTER_NAME \
        --project=PROJECT_ID
    

    Replace the following:

    • CLUSTER_NAME: The name of the cluster.
    • PROJECT_ID: Your project ID.
  2. Use kubectl, in Cloud Shell, to see your TPU nodes:

    kubectl get nodes -l cloud.google.com/gke-tpu-accelerator=TPU_ACCELERATOR \
       -l cloud.google.com/gke-tpu-topology=TPU_TOPOLOGY
    

    Replace the following:

    • TPU_ACCELERATOR: The type of TPU accelerator you used when you created the node pools. For example, tpu-v4-podslice, tpu-v5-lite-device, or tpu-v5-lite-podslice.
    • TPU_TOPOLOGY: The physical topology for the TPU slice.

    The output is similar to the following:

     NAME                                    STATUS   ROLES    AGE    VERSION
     gke-tpu-20ee2cce-5tv6                   Ready    <none>   34h     v1.28.1-gke.1066000
    

Run a Multislice workload

In this section you run a JAX workload which shows the global number of TPU chips in the TPU slice and then exits.

To run a JAX workload do the following:

  1. Create the following tpu-multislice.yaml manifest:

    Autopilot

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: multislice-job
      annotations:
        alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
    spec:
      failurePolicy:
        maxRestarts: 4
      replicatedJobs:
        - name: slice
          replicas: NUM_SLICES
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              backoffLimit: 0
              template:
                spec:
                  nodeSelector:
                    cloud.google.com/gke-tpu-accelerator: ACCELERATOR_TYPE
                    cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY
                  containers:
                  - name: jax-tpu
                    image: python:3.8
                    ports:
                    - containerPort: 8471
                    - containerPort: 8080
                    - containerPort: 8431
                    command:
                    - bash
                    - -c
                    - |
                      pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                      python -c 'import jax; print("Global device count:", jax.device_count())'
                      sleep 60
                    resources:
                     limits:
                        google.com/tpu: NUM_CHIPS
    

    Standard

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: multislice-job
      annotations:
        alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
    spec:
      failurePolicy:
        maxRestarts: 4
      replicatedJobs:
        - name: slice
          replicas: NUM_SLICES
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              backoffLimit: 0
              template:
                spec:
                  hostNetwork: true
                  dnsPolicy: ClusterFirstWithHostNet
                  nodeSelector:
                    cloud.google.com/gke-tpu-accelerator: ACCELERATOR_TYPE
                    cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY
                  containers:
                  - name: jax-tpu
                    image: python:3.8
                    ports:
                    - containerPort: 8471
                    - containerPort: 8080
                    - containerPort: 8431
                    securityContext:
                      privileged: true
                    command:
                    - bash
                    - -c
                    - |
                      pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                      python -c 'import jax; print("Global device count:", jax.device_count())'
                      sleep 60
                    resources:
                      limits:
                       google.com/tpu: NUM_CHIPS
    

    Replace the following:

    • NUM_SLICES: The number of TPU node pools. In this case, the NUM_SLICES equals 3.
    • ACCELERATOR_TYPE: The type of TPU accelerator you used when you created the node pools. For example, tpu-v4-podslice, tpu-v5-lite-device, or tpu-v5-lite-podslice.
    • TPU_TOPOLOGY: The physical topology for the TPU slice. For example 4x4x4 or 2x2 depending on the TPU version.
    • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM. For multi-host TPU v4, the number of chips in each VM is four. For multi-host TPU v5e, the number of chips in each VM is one, four, or eight. Therefore, if your TPU_TOPOLOGY is 2x4x4 (TPU v4 with four chips in each VM), then the NUM_NODES is 32/4 which equals to 8.
    • NUM_CHIPS: For multi-host TPU v4, the number of chips in each VM is four. For multi-host TPU v5e, the number of chips in each VM is one, four, or eight. To learn more, see TPU chips on the TPU VM.

    In this manifest:

    • The JobSet is a Headless Service with the same name as the JobSet name, in this case it is multislice-job.
    • The maxRestarts: 4 indicates the maximum number of times that GKE restarts the JobSet when a child Job fails. If the JobSet restarts reaches the maximum defined, then the JobSet is marked as failed.
    • The parallelism and completions fields equal the number of nodes in each node pool.
    • The backoff is 0 because Multislice only supports synchronous multi-controller training. Must be set to 0. Fail the job when any pod fails.
    • The values in the affinity section ensure that there is only one TPU Multislice workload running in a group of Multislices.
    • The containerPort: 8080 is the port for MXLA coordinator
    • The containerPort: 8431 is the port to export the TPU usage metrics
    • The securityContext: privileged: true indicates that nodes have privileged mode enabled to access TPUs. Nodes in GKE version 1.28 or later don't need to have privileged mode enabled to access TPUs. To learn more, see Run containers without privileged mode.
  2. Apply the manifest:

    kubectl apply -f tpu-multislice.yaml
    
  3. Confirm that the workload is admitted:

    kubectl get jobsets
    

    The output is similar to the following:

    NAME            RESTARTS   COMPLETED   AGE
    multislice-job                         3s
    
  4. Monitor the status of the provisioned Pods:

    kubectl get pods
    

    The output is similar to the following:

     NAME                                READY   STATUS      RESTARTS   AGE
     multislice-job-slice-0-0-wzq9t      0/1     Completed   0          2m31s
     multislice-job-slice-0-1-zf4dp      0/1     Completed   0          2m30s
     multislice-job-slice-1-0-hbfn5      0/1     Completed   0          2m31s
     multislice-job-slice-1-1-45fgl      0/1     Completed   0          2m30s
     multislice-job-slice-2-0-wjbp4      0/1     Completed   0          2m30s
     multislice-job-slice-2-1-lwnvs      0/1     Completed   0          2m30s
    

The multislice-job JobSet schedules, creates, then runs the Pods to completion. The Pod names are in the format <jobsetName>-<jobName>-<jobReplicaIndex>-<randomSuffix>. The jobsetName prefix determines the JobSet the Pod belongs to.

Additional configurations

The following sections describe the additional configurations you can apply to your Multislice.

Enable hostNetwork on your GKE Standard Pods

To improve network performance between TPU slices, we recommend you turn on hostNetworking. Use hostNetwork: true in your Pod spec to skip all the Kubernetes networking stack and let your Kubernetes Pods use the host network directly for VM-to-VM communication.

To turn on hostNetworking, remove the following two lines from your Pod spec:

hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet

To keep using the podHostnames for worker node discovery with hostNetwork, set dnsPolicy: ClusterFirstWithHostNet. This is important when you are running auto-resuming training Jobs and you need to have the same names for reloading the same checkpoints.

Logging

Logs emitted by containers running on GKE nodes, including TPU VMs, are visible in the Logs Explorer, if you have GKE system logging enabled in your cluster.

You can view your logs from GKE using the Logs Explorer with the following filter to view the container logs for your workload:

resource.type="k8s_container"
resource.labels.cluster_name=CLUSTER_NAME
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"=JOBSET_NAME

Use the following filter for TPU slice and workers:

resource.type="k8s_container"
resource.labels.cluster_name=CLUSTER_NAME
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"=JOBSET_NAME
resource.labels.pod_name:<jobSetName>-<replicateJobName>-<job-index>-<worker-index>

Observability and metrics

In addition to the general TPU metrics, there are 4 additional multislice specific TPU runtime metrics. These metrics are available in GKE version 1.29.1-gke.1016000 or later. TPU workload must use JAX version 0.4.24

The following are the available multislice metrics:

  • DCN (Data Center Network) transfer latencies: Distribution of network transfer latencies for multislice traffic.
  • Collective latencies: Distribution of end to end collective latency for multislice traffic.
  • Host-to-Device transfer latencies: Distribution of host to device transfer latency for each chunk of data for multislice traffic.
  • Device-to-Host transfer latencies: Distribution of device to host transfer latency for each chunk of data for multislice traffic.

These metrics are located in the Kubernetes container (k8s_container) schema:

  • kubernetes.io/container/multislice/network/dcn_transfer_latencies
  • kubernetes.io/container/multislice/network/collective_end_to_end_latencies
  • kubernetes.io/container/multislice/accelerator/host_to_device_transfer_latencies
  • kubernetes.io/container/multislice/accelerator/device_to_host_transfer_latencies

TPU slice versus Multislice

The following table differentiates the architectural organization of a TPU slice and a Multislice:

TPU slice Multislice
Interconnectivity The workload runs on a single TPU slice. All TPU chips in a slice are connected with ICI. The workload runs on multiple TPU slices. Communication within a slice happens over ICI. Communication between slices occurs over DCN.
Supported node pools Single-host TPU slice and multi-host TPU slice Groups of multi-host TPU slices
Recommended workload type IndexedJob or JobSet JobSet

What's next