openai-images-and-vision

# openai-images-and-vision Images and vision ================= Learn how to use vision capabilities to understand images. **Vision** is the ability to use images as input prompts to a model, and generate responses based on the data inside those images. Find out which models are capable of vision [on the models page](/docs/models). To generate images as _output_, see our [specialized model for image generation](/docs/guides/image-generation). You can provide images as input to generation requests either by providing a fully qualified URL to an image file, or providing an image as a Base64-encoded data URL. Passing a URL Analyze the content of an image ```javascript import OpenAI from "openai"; const openai = new OpenAI(); const response = await openai.responses.create({ model: "gpt-4.1-mini", input: [{ role: "user", content: [ { type: "input_text", text: "what's in this image?" }, { type: "input_image", image_url: "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, ], }], }); console.log(response.output_text); ``` ```python from openai import OpenAI client = OpenAI() response = client.responses.create( model="gpt-4.1-mini", input=[{ "role": "user", "content": [ {"type": "input_text", "text": "what's in this image?"}, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, ], }], ) print(response.output_text) ``` ```bash curl https://api.openai.com/v1/responses \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "gpt-4.1-mini", "input": [ { "role": "user", "content": [ {"type": "input_text", "text": "what is in this image?"}, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } ] } ] }' ``` Passing a Base64 encoded image Analyze the content of an image ```javascript import fs from "fs"; import OpenAI from "openai"; const openai = new OpenAI(); const imagePath = "path_to_your_image.jpg"; const base64Image = fs.readFileSync(imagePath, "base64"); const response = await openai.responses.create({ model: "gpt-4.1-mini", input: [ { role: "user", content: [ { type: "input_text", text: "what's in this image?" }, { type: "input_image", image_url: `data:image/jpeg;base64,${base64Image}`, }, ], }, ], }); console.log(response.output_text); ``` ```python import base64 from openai import OpenAI client = OpenAI() # Function to encode the image def encode_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode("utf-8") # Path to your image image_path = "path_to_your_image.jpg" # Getting the Base64 string base64_image = encode_image(image_path) response = client.responses.create( model="gpt-4.1", input=[ { "role": "user", "content": [ { "type": "input_text", "text": "what's in this image?" }, { "type": "input_image", "image_url": f"data:image/jpeg;base64,{base64_image}", }, ], } ], ) print(response.output_text) ``` Image input requirements ------------------------ Input images must meet the following requirements to be used in the API. || |PNG (.png)JPEG (.jpeg and .jpg)WEBP (.webp)Non-animated GIF (.gif)|Up to 20MB per imageUp to 500 individual images per requestUp to 50 MB image bytes per requestLow-resolution: 512px x 512pxHigh-resolution: 768px (short side) x 2000px (long side)|No watermarks or logosNo textNo NSFW contentClear enough for a human to understand| Specify image input detail level -------------------------------- The `detail` parameter tells the model what level of detail to use when processing and understanding the image (`low`, `high`, or `auto` to let the model decide). If you skip the parameter, the model will use `auto`. Put it right after your `image_url`, like this: ```plain { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", "detail": "high", } ``` You can save tokens and speed up responses by using `"detail": "low"`. This lets the model process the image with a budget of 85 tokens. The model receives a low-resolution 512px x 512px version of the image. This is fine if your use case doesn't require the model to see with high-resolution detail (for example, if you're asking about the dominant shape or color in the image). Or give the model more detail to generate its understanding by using `"detail": "high"`. This lets the model see the low-resolution image (using 85 tokens) and then creates detailed crops using 170 tokens for each 512px x 512px tile. Note that the above token budgets for image processing do not currently apply to the GPT-4o mini model, but the image processing cost is comparable to GPT-4o. For the most precise and up-to-date estimates for image processing, please use the image pricing calculator [here](https://openai.com/api/pricing/) Provide multiple image inputs ----------------------------- The [Responses API](https://platform.openai.com/docs/api-reference/responses) can take in and process multiple image inputs. The model processes each image and uses information from all images to answer the question. Multiple image inputs ```javascript import OpenAI from "openai"; const openai = new OpenAI(); const response = await openai.responses.create({ model: "gpt-4.1-mini", input: [ { role: "user", content: [ { type: "input_text", text: "What are in these images? Is there any difference between them?" }, { type: "input_image", image_url: "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, { type: "input_image", image_url: "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", } ], }, ] }); console.log(response.output_text); ``` ```python from openai import OpenAI client = OpenAI() response = client.responses.create( model="gpt-4.1-mini", input=[ { "role": "user", "content": [ { "type": "input_text", "text": "What are in these images? Is there any difference between them?", }, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, ], } ] ) print(response.output_text) ``` ```bash curl https://api.openai.com/v1/responses \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "gpt-4.1-mini", "input": [ { "role": "user", "content": [ { "type": "input_text", "text": "What are in these images? Is there any difference between them?" }, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } ] } ] }' ``` Here, the model is shown two copies of the same image. It can answer questions about both images or each image independently. Limitations ----------- While models with vision capabilities are powerful and can be used in many situations, it's important to understand the limitations of these models. Here are some known limitations: * **Medical images**: The model is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice. * **Non-English**: The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean. * **Small text**: Enlarge text within the image to improve readability, but avoid cropping important details. * **Rotation**: The model may misinterpret rotated or upside-down text and images. * **Visual elements**: The model may struggle to understand graphs or text where colors or styles—like solid, dashed, or dotted lines—vary. * **Spatial reasoning**: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions. * **Accuracy**: The model may generate incorrect descriptions or captions in certain scenarios. * **Image shape**: The model struggles with panoramic and fisheye images. * **Metadata and resizing**: The model doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions. * **Counting**: The model may give approximate counts for objects in images. * **CAPTCHAS**: For safety reasons, our system blocks the submission of CAPTCHAs. Calculating costs ----------------- Image inputs are metered and charged in tokens, just as text inputs are. How images are converted to text token inputs varies based on the model. ### GPT-4.1 Image inputs are metered and charged in tokens based on their dimensions. The token cost of an image is determined as follows: * Calculate the number of 32px x 32px patches that are needed to fully cover the image * If the number of patches exceeds 1536, we scale the image so that it can be covered by no more than 1536 patches. * The token cost is the number of patches, capped at a maximum of 1536 tokens * For `gpt-4.1-mini`, we multiply image tokens by 1.62 to get total tokens, and for `gpt-4.1-nano`, we multiply image tokens by 2.46 to get total tokens, that are then billed at normal text token rates. #### Cost calculation examples * A 1024 x 1024 image is **1024 tokens** * Width is 1024, resulting in `(1024 + 32 - 1) // 32 = 32` patches * Height is 1024, resulting in `(1024 + 32 - 1) // 32 = 32` patches * Tokens calculated as `32 * 32 = 1024`, below the cap of 1536 * A 1800 x 2400 image is **1452 tokens** * Width is 1800, resulting in `(1800 + 32 - 1) // 32 = 57` patches * Height is 2400, resulting in `(2400 + 32 - 1) // 32 = 75` patches * We need `57 * 75 = 4275` patches to cover the full image. Since that exceeds 1536, we need to scale down the image while preserving the aspect ratio. * We can calculate the shrink factor as `sqrt(token_budget × patch_size^2 / (width * height))`. In our example, the shrink factor is `sqrt(1536 * 32^2 / (1800 * 2400)) = 0.603`. * Width is now 1086, resulting in `1086 / 32 = 33.94` patches * Height is now 1448, resulting in `1448 / 32 = 45.25` patches * We want to make sure the image fits in a whole number of patches. In this case we scale again by `33 / 33.94 = 0.97` to fit the width in 33 patches. * The final width is then `1086 * (33 / 33.94) = 1056)` and the final height is `1448 * (33 / 33.94) = 1408` * The image now requires `1056 / 32 = 33` patches to cover the width and `1408 / 32 = 44` patches to cover the height * The total number of tokens is the `33 * 44 = 1452`, below the cap of 1536 ### GPT 4o and o-series The token cost of an image is determined by two factors: size and detail. Any image with `"detail": "low"` costs 85 tokens. To calculate the cost of an image with `"detail": "high"`, we do the following: * Scale to fit in a 2048px x 2048px square, maintaining original aspect ratio * Scale so that the image's shortest side is 768px long * Count the number of 512px squares in the image—each square costs **170 tokens** * Add **85 tokens** to the total #### Cost calculation examples * A 1024 x 1024 square image in `"detail": "high"` mode costs 765 tokens * 1024 is less than 2048, so there is no initial resize. * The shortest side is 1024, so we scale the image down to 768 x 768. * 4 512px square tiles are needed to represent the image, so the final token cost is `170 * 4 + 85 = 765`. * A 2048 x 4096 image in `"detail": "high"` mode costs 1105 tokens * We scale down the image to 1024 x 2048 to fit within the 2048 square. * The shortest side is 1024, so we further scale down to 768 x 1536. * 6 512px tiles are needed, so the final token cost is `170 * 6 + 85 = 1105`. * A 4096 x 8192 image in `"detail": "low"` most costs 85 tokens * Regardless of input size, low detail images are a fixed cost. We process images at the token level, so each image we process counts towards your tokens per minute (TPM) limit.