Request quotas

The Dataflow service fully manages resources in Google Cloud on a per-job basis. This includes spinning up and shutting down Compute Engine instances (occasionally referred to as workers or VMs) and accessing your project's Cloud Storage buckets for both I/O and temporary file staging. However, if your pipeline interacts with Google Cloud data storage technologies like BigQuery and Pub/Sub, you must manage the resources and quota for those services.

Dataflow uses a user provided location in Cloud Storage specifically for staging files. This location is under your control, and you should ensure that the location's lifetime is maintained as long as any job is reading from it. You can re-use the same staging location for multiple job runs, as the SDK's built-in caching can speed up the start time for your jobs.

Jobs

You may run up to 25 concurrent Dataflow jobs per Google Cloud project; however, this limit can be increased by contacting Google Cloud Support. For more information, see Quotas.

The Dataflow service is currently limited to processing JSON job requests that are 20 MB in size or smaller. The size of the job request is specifically tied to the JSON representation of your pipeline; a larger pipeline means a larger request.

To estimate the size of your pipeline's JSON request, run your pipeline with the following option:

Java

--dataflowJobFile=<path to output file>

Python

--dataflow_job_file=<path to output file>

Go

Estimating the size of a job's JSON payload with a flag is not currently supported in Go.

This command writes a JSON representation of your job to a file. The size of the serialized file is a good estimate of the size of the request; the actual size will be slightly larger due to some additional information included in the request.

For more information, see the troubleshooting page for "413 Request Entity Too Large" / "The size of serialized JSON representation of the pipeline exceeds the allowable limit".

In addition, your job's graph size must not exceed 10 MB. For more information, see the troubleshooting page for "The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.".

Workers

The Dataflow service currently allows a maximum of 1000 Compute Engine instances per job. For batch jobs, the default machine type is n1-standard-1. For streaming jobs, the default machine type for Streaming Engine-enabled jobs is n1-standard-2 and the default machine type for non-Streaming Engine jobs is n1-standard-4. When using the default machine types, the Dataflow service can therefore allocate up to 4000 cores per job. If you need more cores for your job, you can select a larger machine type.

You should not attempt to manage or otherwise interact directly with your Compute Engine Managed Instance Group; the Dataflow service will take care of that for you. Manually altering any Compute Engine resources associated with your Dataflow job is an unsupported operation.

You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

To allocate additional memory per worker thread, use a custom machine type with extended memory. For example, custom-2-15360-ext is an n1 machine type with 2 CPUs and 15 GB of memory. Dataflow considers the number of CPUs in a machine to determine the number of worker threads per worker VM. If your pipeline processes memory-intensive work, a custom machine type with extended memory can give more memory per worker thread. For more information, see Creating a custom VM instance.

Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. You can specify a machine type for your pipeline by setting the appropriate execution parameter at pipeline creation time.

Java

To change the machine type, set the --workerMachineType option.

Python

To change the machine type, set the --worker_machine_type option.

Go

To change the machine type, set the ‑‑worker_machine_type option.

Resource quota

The Dataflow service checks to ensure that your Google Cloud project has the Compute Engine resource quota required to run your job, both to start the job and scale to the maximum number of worker instances. Your job will fail to start if there is not enough resource quota available.

If your Dataflow job deploys Compute Engine virtual machines as a Managed Instance Group, you'll need to ensure your project satisfies some additional quota requirements. Specifically, your project will need one of the following types of quota for each concurrent Dataflow job that you want to run:

  • One Instance Group per job
  • One Managed Instance Group per job
  • One Instance Template per job

Caution: Manually changing your Dataflow job's Instance Template or Managed Instance Group is not recommended or supported. Use Dataflow's pipeline configuration options instead.

Dataflow's Horizontal Autoscaling feature is limited by your project's available Compute Engine quota. If your job has sufficient quota when it starts, but another job uses the remainder of your project's available quota, the first job will run but not be able to fully scale.

However, the Dataflow service does not manage quota increases for jobs that exceed the resource quotas in your project. You are responsible for making any necessary requests for additional resource quota, for which you can use the Google Cloud console.

IP addresses

By default, Dataflow assigns both public and private IP addresses to worker VMs. A public IP address satisfies one of the criteria for internet access, but a public IP address also counts against your quota of external IP addresses.

If your worker VMs don't need access to the public internet, consider using only internal IP addresses, which don't count against your external quota. For more information on configuring IP addresses, see the following resources:

Inactive workers

If the workers for a given job don't exhibit sufficient activity over a one-hour period, the job fails. Worker inactivity can result from dependency management problems. For example, if a worker encounters an issue while installing dependencies for a custom container image, the worker might fail to start or fail to make progress. The lack of progress could then cause the job to fail. To learn more, see:

Persistent disk resources

The Dataflow service is limited to 15 persistent disks per worker instance when running a streaming job. Each persistent disk is local to an individual Compute Engine virtual machine. Your job may not have more workers than persistent disks; a 1:1 ratio between workers and disks is the minimum resource allotment.

Jobs using Streaming Engine use 30 GB boot disks. Jobs using Dataflow Shuffle use 25 GB boot disks. For jobs that are not using these offerings, the default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode.

Locations

By default, the Dataflow service deploys Compute Engine resources in the us-central1-f zone of the us-central1 region. You can override this setting by specifying the --region parameter. If you need to use a specific zone for your resources, use the --zone parameter when you create your pipeline. However, we recommend that you only specify the region, and leave the zone unspecified. This allows the Dataflow service to automatically select the best zone within the region based on the available zone capacity at the time of the job creation request. For more information, see the Dataflow regions documentation.