Google Adds Flex and Priority Tiers to Gemini API for Developers

Google has introduced two new service tiers to the Gemini API, giving developers more precise control over the tradeoff between cost and performance when running AI inference workloads. The additions, Flex and Priority, went live on April 1, 2026, and sit alongside the existing Standard and Batch tiers in Google’s inference offering.

The launch addresses a longstanding challenge for developers building on large language models: the one-size-fits-all pricing of standard API access makes it expensive to run background or non-time-sensitive tasks, while providing no guaranteed performance headroom for critical applications that need fast, reliable responses.

Google Gemini API Flex and Priority inference tier pricing diagram — Google’s new Gemini API tiers give developers a wider range of cost and latency options. (Source: Google)

Contents show

How Each Tier Works

Flex Inference is designed for workloads that can tolerate variable latency in exchange for substantially lower costs. Google is offering Flex at 50 percent below standard rates. The tradeoff is that response times are not guaranteed, with a target window of one to 15 minutes. This makes Flex well suited for tasks like background CRM updates, large-scale research simulations, and agentic workflows where the model is working in the background without a user waiting for an immediate result.

Priority Inference sits at the opposite end of the spectrum. It delivers ultra-low latency, measured in milliseconds to seconds, at a price premium of 75 to 100 percent above standard rates. The tier is restricted to developers on Tier 2 and Tier 3 billing accounts, which require cumulative Google Cloud spending thresholds of $100 and $1,000 respectively. Target use cases include live customer chatbots, real-time fraud detection systems, and business-critical AI copilots.

Google Gemini API developer tools and inference optimization options — The Gemini API now offers four distinct tiers, from cost-optimized Batch to performance-guaranteed Priority. (Source: Google AI for Developers)

Where This Fits in the Competitive Landscape

The tiered pricing model reflects a broader industry shift toward treating AI inference as a layered infrastructure service rather than a single commodity offering. Competitors including Anthropic and OpenAI have also been experimenting with different pricing structures for API access, and Google’s move adds another dimension to that competition.

For developers running large agentic pipelines, the Flex tier could represent meaningful cost savings. A workflow that involves dozens of background model calls per user session, currently priced at standard rates, could see its inference costs cut in half by switching eligible calls to Flex. For enterprise applications where downtime or slow responses carry business consequences, Priority provides a contractual commitment to performance that the Standard tier does not offer.

Availability and Access

Flex Inference is available to all paid Gemini API accounts and works with both GenerateContent and Interactions API requests. Priority Inference requires qualifying billing thresholds, positioning it as a tier for established enterprise customers rather than new or small-scale developers. Google has published detailed documentation for both tiers through its AI for Developers portal, including guidance on which workloads are appropriate for each service level.

The expansion of Gemini’s API options comes alongside recent announcements from Google including the Veo 3.1 Lite video generation model and updates to Google Vids, as the company continues to build out its developer and enterprise AI product portfolio.

Google Adds Flex and Priority Tiers to Gemini API for Developers

How Each Tier Works

Where This Fits in the Competitive Landscape

Availability and Access