Create a partial cluster

To mitigate the effects of the unavailability of user-specified VMs in specific regions at specific times (stockouts), Dataproc allows you to request the creation of a partial cluster by specifying a minimum number of primary workers that is acceptable to allow cluster creation.

Standard cluster Partial cluster
If one or more primary workers cannot be created and initialized, cluster creation fails. Workers that are created continue to run and incur charges until deleted by the user. If the specified minimum number of workers can be created, the cluster is created. Failed (uninitialized) workers are deleted and do not incur charges. If the specified minimum number of workers cannot be created and initialized, the cluster is not created. Workers that are created are not deleted to allow for debugging.
Cluster creation time is optimized. Longer cluster creation time can occur since all nodes must report provisioning status.
Single node clusters are available for creation. Single node clusters are not available for creation.

Autoscaling:

Use autoscaling with partial cluster creation to help ensure that the target (full) number of primary workers is created. Autoscaling will try to acquire failed workers in the background if the workload requires them.

The following is a sample autoscaling policy that retries until the total number of primary worker instances reaches a target size of 10. The policy's minInstances and maxInstances match the minimum and total number of primary workers specified at cluster creation time (see How to create a partial cluster). Setting the scaleDownFactor to 0 prevents the cluster from scaling down from 10 to 8, and will help keep the number of workers at the maximum 10-worker limit.

workerConfig:
  minInstances: 8
  maxInstances: 10
basicAlgorithm:
  cooldownPeriod: 2m
  yarnConfig:
    scaleUpFactor: 1
    scaleDownFactor: 0
    gracefulDecommissionTimeout: 1h

How to create a partial cluster

You can use the Google Cloud CLI or the Dataproc API to create a Dataproc partial cluster.

gcloud

To create a Dataproc partial cluster on the command line, run the following gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.

gcloud dataproc clusters create CLUSTER_NAME \
    --project=PROJECT \
    --region=REGION \
    --num-workers=NUM_WORKERS \
    --min-num-workers=MIN_NUM_WORKERS \
    other args ...
  • CLUSTER_NAME: The cluster name must start with a lowercase letter followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.
  • PROJECT: Specify the project associated with the job cluster.
  • REGION: Specify the Compute Engine region where the job cluster will be located.
  • NUM_WORKERS: The total number of primary workers in the cluster to create if available.
  • MIN_NUM_WORKERS: The minimum number of primary workers to create if the specified total number of workers (NUM_WORKERS) cannot be created. Cluster creation fails if this minimum number of primary workers cannot be created (workers that are created are not deleted to allow for debugging). If this flag is omitted, standard cluster creation with the total number of primary workers (NUM_WORKERS) is attempted.

REST

To create a Dataproc partial cluster, specify the minimum number of primary workers in the workerConfig.minNumInstances field as part of a clusters.create request.

Display the number of provisioned workers

After creating a cluster, you can run the following gcloud CLI command to list the number of workers, including any secondary workers, provisioned in your cluster.

gcloud dataproc clusters list \
    --project=PROJECT \
    --region=REGION \
    --filter=clusterName=CLUSTER_NAME