Predicted Outputs
Reduce latency for model responses where much of the response is known ahead of time.
Predicted Outputs enable you to speed up API responses from Chat Completions when many of the output tokens are known ahead of time. This is most common when you are regenerating a text or code file with only minor modifications. Predicted Outputs are available today using the latest models. Read on to learn how to use Predicted Outputs to reduce latency in your applications.
Code Refactoring Example
Predicted Outputs are particularly useful for regenerating text documents and code files with small modifications. For instance, suppose you want the latest models to refactor a piece of TypeScript code by converting the username
property of the User
class to an email
property. Most of the file will remain unchanged except for the affected line.
Consider the original code:
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
In this situation, if you use the current text of the code file as your prediction, you can regenerate the entire file with lower latency. Such time savings are more significant for larger files.
Below is an example that uses the prediction
parameter in our SDKs to inform the API that the final output will be very similar to the provided code file (used as the prediction text).
Refactor a TypeScript Class with a Predicted Output
Using JavaScript (with the OpenAI SDK style):
import OpenAI from "openai";
const code = `
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
`.trim();
const openai = new OpenAI();
const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
`;
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "user",
content: refactorPrompt
},
{
role: "user",
content: code
}
],
prediction: {
type: "content",
content: code
}
});
// Inspect returned data
console.log(completion);
console.log(completion.choices[0].message.content);
Using Python:
from openai import OpenAI
code = """
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
"""
refactor_prompt = """
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
"""
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{ "role": "user", "content": refactor_prompt },
{ "role": "user", "content": code }
],
prediction={
"type": "content",
"content": code
}
)
print(completion)
print(completion.choices[0].message.content)
Using the alternative API client in JavaScript:
import APIClient from "api-client";
const code = `
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
`.trim();
const apiClient = new APIClient();
const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
`;
const completion = await apiClient.chat.completions.create({
model: "latest-model",
messages: [
{
role: "user",
content: refactorPrompt
},
{
role: "user",
content: code
}
],
prediction: {
type: "content",
content: code
}
});
// Inspect returned data
console.log(completion);
console.log(completion.choices[0].message.content);
Using the alternative API client in Python:
from api_client import APIClient
code = """
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
"""
refactor_prompt = """
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
"""
client = APIClient()
completion = client.chat.completions.create(
model="latest-model",
messages=[
{ "role": "user", "content": refactor_prompt },
{ "role": "user", "content": code }
],
prediction={
"type": "content",
"content": code
}
)
print(completion)
print(completion.choices[0].message.content)
Using Curl (note the use of HISPREAD_API_KEY):
curl "https://api.myhispreadnlp.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HISPREAD_API_KEY" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": "Replace the username property with an email property. Respond only with code, and with no markdown formatting."
},
{
"role": "user",
"content": "$CODE_CONTENT_HERE"
}
],
"prediction": {
"type": "content",
"content": "$CODE_CONTENT_HERE"
}
}'
An alternative curl example using a different API key variable:
curl "https://api.myhispreadnlp.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HISPREAD_API_KEY" \
-d '{
"model": "latest-model",
"messages": [
{
"role": "user",
"content": "Replace the username property with an email property. Respond only with code, and with no markdown formatting."
},
{
"role": "user",
"content": "$CODE_CONTENT_HERE"
}
],
"prediction": {
"type": "content",
"content": "$CODE_CONTENT_HERE"
}
}'
Understanding the Response
In addition to the refactored code, the model response contains detailed usage information. A sample response might look like the following:
{
id: 'chatcmpl-xxx',
object: 'chat.completion',
created: 1730918466,
model: 'gpt-4o-2024-08-06',
choices: [ /* ...actual text response here... */],
usage: {
prompt_tokens: 81,
completion_tokens: 39,
total_tokens: 120,
prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
accepted_prediction_tokens: 18,
rejected_prediction_tokens: 10
}
},
system_fingerprint: 'fp_159d8341cc'
}
Note the accepted_prediction_tokens
and rejected_prediction_tokens
in the usage
object. In this example, 18 tokens from the prediction were used to speed up the response, while 10 tokens were rejected. Keep in mind that any rejected tokens are still billed like other completion tokens generated by the API, so Predicted Outputs can increase costs if many tokens are rejected.
Streaming Example
The latency gains of Predicted Outputs are even greater when you use streaming for API responses. Below is an example using streaming for the same code refactoring use case.
Predicted Outputs with Streaming
Using JavaScript (with streaming):
import OpenAI from "openai";
const code = `
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
`.trim();
const openai = new OpenAI();
const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
`;
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "user",
content: refactorPrompt
},
{
role: "user",
content: code
}
],
prediction: {
type: "content",
content: code
},
stream: true
});
// Inspect returned data as a stream
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Using Python (with streaming):
from openai import OpenAI
code = """
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
"""
refactor_prompt = """
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
"""
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o",
messages=[
{ "role": "user", "content": refactor_prompt },
{ "role": "user", "content": code }
],
prediction={
"type": "content",
"content": code
},
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Using the alternative API client in JavaScript:
import APIClient from "api-client";
const code = `
class User {
firstName: string = "";
lastName: string = "";
username: string = "";
}
export default User;
`.trim();
const apiClient = new APIClient();
const refactorPrompt = `
Replace the "username" property with an "email" property. Respond only
with code, and with no markdown formatting.
`;
const stream = await apiClient.chat.completions.create({
model: "latest-model",
messages: [
{
role: "user",
content: refactorPrompt
},
{
role: "user",
content: code
}
],
prediction: {
type: "content",
content: code
},
stream: true
});
// Inspect returned data as a stream
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Position of Predicted Text in Response
When providing prediction text, your prediction can appear anywhere within the generated response while still providing latency reduction. For example, suppose your predicted text is the complete code for a simple Hono server application (or any other application). You can prompt the model to regenerate the file with a minor modification—such as adding a new route—while the bulk of the file remains as provided in your prediction.
For instance, consider the predicted text for a server:
import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";
const app = new Hono();
app.get("/api", (c) => {
return c.text("Hello Hono!");
});
// You will need to build the client code first `pnpm run ui:build`
app.use(
"/*",
serveStatic({
rewriteRequestPath: (path) => `./dist${path}`,
})
);
const port = 3000;
console.log(`Server is running on port ${port}`);
serve({
fetch: app.fetch,
port,
});
You might prompt the model as follows:
Add a get route to this application that responds with the text "hello world".
Generate the entire application file again with this route added, and with no other markdown formatting.
The resulting response might look like this:
import { serveStatic } from "@hono/node-server/serve-static";
import { serve } from "@hono/node-server";
import { Hono } from "hono";
const app = new Hono();
app.get("/api", (c) => {
return c.text("Hello Hono!");
});
app.get("/hello", (c) => {
return c.text("hello world");
});
// You will need to build the client code first `pnpm run ui:build`
app.use(
"/*",
serveStatic({
rewriteRequestPath: (path) => `./dist${path}`,
})
);
const port = 3000;
console.log(`Server is running on port ${port}`);
serve({
fetch: app.fetch,
port,
});
In this response, you would still see accepted prediction tokens even if the predicted text appears both before and after the new code added to the final response. For example:
{
id: 'chatcmpl-xxx',
object: 'chat.completion',
created: 1731014771,
model: 'gpt-4o-2024-08-06',
choices: [ /* completion here... */ ],
usage: {
prompt_tokens: 203,
completion_tokens: 159,
total_tokens: 362,
prompt_tokens_details: { cached_tokens: 0, audio_tokens: 0 },
completion_tokens_details: {
reasoning_tokens: 0,
audio_tokens: 0,
accepted_prediction_tokens: 60,
rejected_prediction_tokens: 0
}
},
system_fingerprint: 'fp_9ee9e968ea'
}
Here, all predicted tokens were accepted because the entire content of the prediction was used in the final output.
Limitations
When using Predicted Outputs, bear in mind the following constraints:
- Predicted Outputs are only supported with specific models within the API (for example, GPT-4o and GPT-4o-mini series).
- When providing a prediction, any tokens that are not part of the final completion are still billed as completion tokens. You can refer to the
rejected_prediction_tokens
in theusage
object to see how many tokens were not used. - The following API parameters are not supported when using Predicted Outputs:
- Requesting multiple completion choices (i.e. setting
n
> 1) - Log probability requests (
logprobs
) - Presence penalties with values greater than 0 (
presence_penalty
> 0) - Frequency penalties with values greater than 0 (
frequency_penalty
> 0) - Audio inputs and outputs
- Modalities other than text
- Explicit maximum completion tokens (
max_completion_tokens
) - Function calling functionality
- Requesting multiple completion choices (i.e. setting
Updated 3 months ago