Author: Olaf Kopp
Only for SEO Research Suite member Reading time: 8 Minutes

Prompt caching in generative response engines

Topics: , , , ,

5/5 - (1 vote)

This patent, assigned to OpenAI OpCo, LLC, describes a system and method for caching prompts used with generative response engines (such as large language models) within a cloud computing environment. The core idea is that when a user sends a prompt to a generative AI service via an API, the system can hash a portion of that prompt and use it to route the request to the same server that previously processed a similar prompt. By reusing previously activated tokens (intermediate computational results) from cached prompts, the system avoids redundant computations, reduces latency, lowers power consumption, and enables cost savings for the user through discounted input token pricing. The patent also covers caching of multimodal inputs such as images and audio, and describes an automatic billing mechanism that credits users for reused cached tokens.

General Information

  • Patent ID: US12596764
  • Assignee: OpenAI OpCo, LLC (San Francisco, CA)
  • Countries: United States
  • Last Publishing Date: June 17, 2025
  • Date of Patent: April 7, 2026
  • Inventors:
    • Anushree Agrawal (San Francisco, CA)
    • Jeffrey Harris (San Francisco, CA)
    • Daryl Neubieser (San Francisco, CA)
    • Felipe Petroski Such (San Francisco, CA)
  • Primary Examiner: Clayton R. Williams
  • Status: Granted

Background

Generative response engines such as large language models represent a major milestone in artificial intelligence, enabling capabilities like text generation, translation, summarization, and code generation through advanced deep learning techniques. These engines operate on cloud computing infrastructure that requires specialized hardware (such as GPUs and multiply-accumulate units) for parallel floating point and vector computations.

Cloud computing services provide API access so that customers can use these engines without building their own infrastructure. However, a key challenge arises when customers reuse similar prompts repeatedly — for example, to generate standardized summaries of documents, audio recordings, or images.

Without caching, the system redundantly processes the same input tokens each time, wasting compute resources, increasing latency, and driving up costs. Additionally, because generative response engines run across many hardware nodes in geographically distributed data centers, and API requests are randomly assigned to different nodes, there is no guarantee that a repeated prompt will land on the same server that previously processed it. This makes it difficult to reuse previously computed token activations. The patent addresses these challenges by introducing a deterministic prompt-routing and caching mechanism.

Methodology / Process Step by Step

This patent describes a system for caching prompts submitted to generative response engines (like LLMs) across distributed cloud infrastructure, enabling reuse of previously computed tokens to reduce latency, cost, and computational load.

Step 1: Client Sends API Request

A client sends an HTTP request to the cloud computing service’s front end. The request contains a header (with an API key for authentication and billing) and a body (containing the prompt as an array of data—text, images, audio, or other binary data).

Step 2: Hash Generation from Prompt Prefix

The front end (or a distributor component) generates a deterministic hash based on a prefix of the prompt (e.g., the first 500 characters) combined with distinct user-identifying information (user ID, API key, or user-generated secret). This hash is always the same for the same input combination, ensuring consistent routing.

Step 3: Generative Response Engine Selection via Hash

The hash is used to map the request to a specific generative response engine (i.e., a specific server/transformer engine) across geographically distributed data centers. This deterministic routing ensures that prompts with the same prefix are directed to the same hardware node, where previous token activations may still be cached.

Step 4: Cache Lookup

Before full processing, the system checks whether the prompt (or its prefix) matches a previously cached prompt:

  • Per-user prompt cache (front end): A key-value store maps prompts to generative response engine identifiers per user account.
  • Generative response engine prompt cache: The engine itself checks whether activated tokens from a prior prompt can be reloaded using the hash.
  • Optional centralized cache: An in-memory database (e.g., Redis) within a data center can share cached tokens across multiple engines, though this requires explicit flagging and adds latency.

A longest prefix match algorithm is used: the engine iteratively compares increasing portions of the prompt prefix against cached keys until a match is found.

Step 5: Token Activation & Inference

  • Cache hit: The engine retrieves cached activated tokens (intermediate key-value attention pairs from the transformer model) and resumes inference from the point where the new prompt diverges from the cached prefix. Only the uncached tokens (the new/changed suffix) need to be tokenized and processed through the transformer layers.
  • Cache miss: The entire prompt is tokenized from scratch, processed through all transformer layers, and the resulting activated tokens are stored in the cache for future reuse.

Step 6: Multimodal Encoding Cache

For prompts containing binary data (images, audio, PDFs), an encoder converts the binary into tokens. These encoder tokens can also be cached independently of the text tokens. On subsequent requests with the same binary data, a checksum or hash of the binary is used to recall the cached encoder tokens, avoiding re-encoding.

Step 7: Response Generation & Token Accounting

The generative response engine produces output tokens and returns:

  • The generated response
  • Token accounting metadata: number of input tokens, output tokens, cached input tokens, and the type of each token (text, image, audio)

Step 8: Billing / Credit Calculation

The front end calculates the cost for the API call:

  • Cached input tokens are discounted (e.g., 50% off—2,048 cached tokens are billed as 1,024)
  • Different token types (text vs. image vs. audio) may have different pricing
  • minimum of 1,024 input tokens may be required to activate caching discounts
  • Credits are applied to the user account associated with the API key

Step 9: API Response to Client

The front end returns the API response to the client, including:

  • The generated content
  • Metadata: number of input tokens, output tokens, cached input tokens, and token types
  • The client can use this metadata for its own analytics and cost tracking

Step 10: Caching Window

Cached tokens have a time-limited availability window:

  • Minimum caching duration: ~300 seconds (5 minutes)
  • After this minimum, there is increasing probability the cache is evicted
  • Cached tokens typically become unavailable after about one hour
  • Within this window, subsequent prompts sharing the same prefix benefit from the cache

Step 11: Deferred vs. Streaming Routing

The system can also route based on response urgency:

  • Deferred responses (e.g., within 24 hours) may be queued and routed to data centers based on operational conditions (load, temperature)
  • Streaming responses (via WebSocket) are routed to the geographically closest data center to minimize latency

SEO & Generative Engine Optimization (GEO) Implications

While this is an infrastructure/efficiency patent rather than a ranking patent, it reveals important architectural behaviors of generative response engines that have direct implications for how content is processed and surfaced. Here are actionable insights anchored in the patent’s disclosed mechanisms:

1. Prompt Prefix Stability Drives Caching — Implications for System Prompts and Structured Queries

The patent reveals that the system uses the first ~500 characters of a prompt prefix combined with user identity to hash and route to the same engine. Cached token activations are reused based on longest prefix match.

GEO Implication: When generative AI systems (like ChatGPT, AI Mode) process queries about your brand or topic, the system prompt and initial context framing remain stable across many user queries. This means the LLM’s “understanding frame” is set early and reused. Content that aligns well with common system prompt framings (e.g., “You are a helpful assistant that provides accurate information about [topic]”) will be more efficiently processed when it appears in the variable suffix portion of the prompt.

Actionable Example: Structure your content so that the most critical brand and topical context appears in formats that match how LLMs typically frame queries. Use clear topic declarations at the top of pages (e.g., “This page covers [Brand X]’s approach to [specific attribute]”) so that when an LLM retrieves and inserts your content into its prompt, the relevant information is immediately processable within the cached context window.

2. Token Efficiency Matters — Write for Minimal Token Overhead

The patent extensively details how token counts (input and output) directly affect cost and processing. The system differentiates between text, image, and audio tokens, with different computational costs. Cached tokens reduce processing load by 50%.

GEO Implication: LLM operators have strong economic incentives to favor content that can be processed in fewer tokens while delivering high information density. Content that is verbose, redundant, or poorly structured consumes more tokens and costs more to process.

Actionable Example:

  • Write concise, information-dense paragraphs rather than padding content with filler
  • Use structured data and clear labeling (headers, lists) that tokenize efficiently
  • Avoid excessive repetition of keywords — the LLM tokenizer treats repeated content as additional tokens without adding semantic value
  • Brand identity blocks should be compact but semantically complete: “Kopp Online Marketing is a Germany-based Generative Engine Optimization (GEO) consultancy specializing in LLM readability and brand context optimization for AI visibility” — this is a single, efficient chunk

3. Multimodal Content Gets Cached Separately — Optimize Binary Assets

The patent discloses that images, audio, and other binary data are encoded into tokens by a separate encoder, and these encoder tokens can be cached independently via checksums/hashes. Unchanged binary assets are not re-processed.

GEO Implication: When generative AI systems process your page content that includes images, the image encoding is a distinct computational step. Images that are stable, well-optimized, and consistently referenced across your site create opportunities for encoder token caching, making your multimodal content cheaper and faster to process.

Actionable Example:

  • Use consistent, high-quality hero images on key pages rather than constantly rotating visuals
  • Ensure images have proper alt text and contextual captions that reinforce the text content — this creates alignment between the text tokens and the image encoder tokens

4. The Caching Window Reveals Temporal Query Patterns

The patent specifies a minimum 5-minute caching window with degradation up to about an hour. This means that when multiple users ask similar questions within a short timeframe, the system reuses the same activated tokens.

GEO Implication: During trending topics or high-query-volume periods, the cached prompt context becomes especially influential. The content that was retrieved and tokenized for the first query in a cluster effectively shapes responses for subsequent similar queries within the caching window.

Actionable Example:

  • For time-sensitive topics (product launches, news events), publish optimized content early so your content is more likely to be retrieved and cached in the initial prompt processing
  • Ensure your content is crawlable and indexed before demand spikes — being in the first cached context window means your content’s tokenized representation persists for subsequent similar queries
  • This reinforces the importance of freshness signals and rapid publication for GEO

5. Chunk Relevance Meets Token Caching — Front-Load Your Key Information

The system caches based on prefix matching. The front of the prompt (which includes system instructions + retrieved context) is what gets cached. The variable part (user’s specific question) comes after.

GEO Implication: When an LLM retrieves content from your page to construct its prompt, the beginning of your content chunk is more likely to fall within the cached prefix of repeated similar queries. This means the opening sentences of your content sections are disproportionately important for establishing context.

Actionable Example:

  • Front-load each content section with the most semantically rich, entity-defining statement
  • Instead of: “In recent years, there has been growing interest in sustainable packaging solutions…”
  • Write: “[Brand X] manufactures 100% biodegradable packaging from plant-based materials, serving the food industry across Europe.”
  • The second version provides entity, attribute, product, material, market, and geography in the first sentence — maximizing the semantic value within a potentially cached prefix

6. Longest Prefix Match for Cache Retrieval — Standardize Your Content Templates

The patent describes using longest prefix match to find cached prompts: the system iteratively increases match length from 128 characters until a unique match is found.

GEO Implication: This mechanism rewards standardized, template-based content structures. When your content follows consistent structural patterns, the tokenized representations of those patterns are more likely to match cached prefixes from prior queries.

Actionable Example:

  • Use consistent content templates for product pages, FAQ entries, and service descriptions
  • For example, always structure product pages as: [Product Name] → [One-sentence definition] → [Key attributes list] → [Use cases] → [Differentiators]
  • This structural consistency means the LLM’s tokenized representation of your content type becomes familiar and cacheable, potentially improving processing efficiency and thus likelihood of inclusion in generated responses

[/membership]

COMMENT ARTICLE



Content from the blog

Brand Context Optimization: A Practical Step-by-Step Guide

This guide helps you systematically optimize how AI systems (LLMs like ChatGPT, Gemini, Perplexity) and read more

Brand Identity Blocks for Brand Context Optimization

In this article, I would like to introduce you to the concept of brand identity read more

What is brand context optimization for GEO?

Brand context optimization is a strategic process of Generative Engine Optimization (GEO) that aims to read more

Brand Context Optimization: How to Write Text About Your Brand (for Companies, Persons and Products)

Search engines and large language models extract structured facts from your text — parsing sentences, read more

Guide to Brand Context Optimization for Generative Engine Optimization (GEO)

In many discussions about generative engine optimization, too little distinction is made between the different read more

Ultimate guide for llm readability optimization and better chunk relevance

In many discussions about generative engine optimization, too little distinction is made between the different read more