Prompt caching in generative response engines
Topics: Chunk Relevance, LLM Readability, LLMO / GEO, OpenAI / ChatGPT, Retrieval Augmented Generation (RAG)
This patent, assigned to OpenAI OpCo, LLC, describes a system and method for caching prompts used with generative response engines (such as large language models) within a cloud computing environment. The core idea is that when a user sends a prompt to a generative AI service via an API, the system can hash a portion of that prompt and use it to route the request to the same server that previously processed a similar prompt. By reusing previously activated tokens (intermediate computational results) from cached prompts, the system avoids redundant computations, reduces latency, lowers power consumption, and enables cost savings for the user through discounted input token pricing. The patent also covers caching of multimodal inputs such as images and audio, and describes an automatic billing mechanism that credits users for reused cached tokens.
General Information
- Patent ID: US12596764
- Assignee: OpenAI OpCo, LLC (San Francisco, CA)
- Countries: United States
- Last Publishing Date: June 17, 2025
- Date of Patent: April 7, 2026
- Inventors:
- Anushree Agrawal (San Francisco, CA)
- Jeffrey Harris (San Francisco, CA)
- Daryl Neubieser (San Francisco, CA)
- Felipe Petroski Such (San Francisco, CA)
- Primary Examiner: Clayton R. Williams
- Status: Granted
Contents
Background
Generative response engines such as large language models represent a major milestone in artificial intelligence, enabling capabilities like text generation, translation, summarization, and code generation through advanced deep learning techniques. These engines operate on cloud computing infrastructure that requires specialized hardware (such as GPUs and multiply-accumulate units) for parallel floating point and vector computations.
Cloud computing services provide API access so that customers can use these engines without building their own infrastructure. However, a key challenge arises when customers reuse similar prompts repeatedly — for example, to generate standardized summaries of documents, audio recordings, or images.
Without caching, the system redundantly processes the same input tokens each time, wasting compute resources, increasing latency, and driving up costs. Additionally, because generative response engines run across many hardware nodes in geographically distributed data centers, and API requests are randomly assigned to different nodes, there is no guarantee that a repeated prompt will land on the same server that previously processed it. This makes it difficult to reuse previously computed token activations. The patent addresses these challenges by introducing a deterministic prompt-routing and caching mechanism.
Methodology / Process Step by Step
This patent describes a system for caching prompts submitted to generative response engines (like LLMs) across distributed cloud infrastructure, enabling reuse of previously computed tokens to reduce latency, cost, and computational load.
Step 1: Client Sends API Request
A client sends an HTTP request to the cloud computing service’s front end. The request contains a header (with an API key for authentication and billing) and a body (containing the prompt as an array of data—text, images, audio, or other binary data).
Step 2: Hash Generation from Prompt Prefix
The front end (or a distributor component) generates a deterministic hash based on a prefix of the prompt (e.g., the first 500 characters) combined with distinct user-identifying information (user ID, API key, or user-generated secret). This hash is always the same for the same input combination, ensuring consistent routing.
Step 3: Generative Response Engine Selection via Hash
The hash is used to map the request to a specific generative response engine (i.e., a specific server/transformer engine) across geographically distributed data centers. This deterministic routing ensures that prompts with the same prefix are directed to the same hardware node, where previous token activations may still be cached.
Step 4: Cache Lookup
Before full processing, the system checks whether the prompt (or its prefix) matches a previously cached prompt:
- Per-user prompt cache (front end): A key-value store maps prompts to generative response engine identifiers per user account.
- Generative response engine prompt cache: The engine itself checks whether activated tokens from a prior prompt can be reloaded using the hash.
- Optional centralized cache: An in-memory database (e.g., Redis) within a data center can share cached tokens across multiple engines, though this requires explicit flagging and adds latency.
A longest prefix match algorithm is used: the engine iteratively compares increasing portions of the prompt prefix against cached keys until a match is found.
Step 5: Token Activation & Inference
- Cache hit: The engine retrieves cached activated tokens (intermediate key-value attention pairs from the transformer model) and resumes inference from the point where the new prompt diverges from the cached prefix. Only the uncached tokens (the new/changed suffix) need to be tokenized and processed through the transformer layers.
- Cache miss: The entire prompt is tokenized from scratch, processed through all transformer layers, and the resulting activated tokens are stored in the cache for future reuse.
Step 6: Multimodal Encoding Cache
For prompts containing binary data (images, audio, PDFs), an encoder converts the binary into tokens. These encoder tokens can also be cached independently of the text tokens. On subsequent requests with the same binary data, a checksum or hash of the binary is used to recall the cached encoder tokens, avoiding re-encoding.
Step 7: Response Generation & Token Accounting
The generative response engine produces output tokens and returns:
- The generated response
- Token accounting metadata: number of input tokens, output tokens, cached input tokens, and the type of each token (text, image, audio)
Step 8: Billing / Credit Calculation
The front end calculates the cost for the API call:
- Cached input tokens are discounted (e.g., 50% off—2,048 cached tokens are billed as 1,024)
- Different token types (text vs. image vs. audio) may have different pricing
- A minimum of 1,024 input tokens may be required to activate caching discounts
- Credits are applied to the user account associated with the API key
Step 9: API Response to Client
The front end returns the API response to the client, including:
- The generated content
- Metadata: number of input tokens, output tokens, cached input tokens, and token types
- The client can use this metadata for its own analytics and cost tracking
Step 10: Caching Window
Cached tokens have a time-limited availability window:
- Minimum caching duration: ~300 seconds (5 minutes)
- After this minimum, there is increasing probability the cache is evicted
- Cached tokens typically become unavailable after about one hour
- Within this window, subsequent prompts sharing the same prefix benefit from the cache
Step 11: Deferred vs. Streaming Routing
The system can also route based on response urgency:
- Deferred responses (e.g., within 24 hours) may be queued and routed to data centers based on operational conditions (load, temperature)
- Streaming responses (via WebSocket) are routed to the geographically closest data center to minimize latency



