How to use Azure OpenAI GPT-4 Turbo with Vision to describe images

Let’s explore the Azure OpenAI GPT-4 Turbo with Vision model and how it can be used to describe image contents. The GPT-4 Turbo with Vision model is a large multimodal model (LMM) developed by OpenAI that can analyze images and provide textual responses to questions about them. It incorporates both natural language processing and visual understanding. The GPT-4 Turbo with Vision model answers general questions about what is present in the images. The easiest way to get started is to just ask it to describe the image contents. The latest Azure OpenAI API 2023-12-01-preview brought in support for Turbo with Vision, Tools, DALL-E 3.0 and enhanced content filtering. If you are using Azure OpenAI API with earlier API version, make sure you start using the 2023-12-01-preview before April 2^nd of 2024, because on that date older APIs than this one will be retired.

Vision-preview model

First, you need to go to AI Studio to create a deployment of GPT-4 model that is set to version vision-preview.

It is also possible to change the existing deployment and set it to vision-preview.

If you don’t see vision-preview in your list, then you need to create a new service to support the region. At the time of writing this, supported regions are:

Switzerland North
West US

You can see up to date list of model versions and regions where they are available here.

Make sure you take a note of where you can get the model deployment endpoint (URL) and API key, as we need these later when using API in the app. You can find these when you open the model deployment.

Azure AI Studio

Now that we have the model sorted out, you can take the easy way and use AI Studio Playground to test it out. You can upload images or video and use AI to “talk” with your content. It is easy here to ask to describe/ summarize an image, but you can do more with that. For simplicity of this article, we stick for summarizing / describing to show how it is done.

There are limits on size and how large content you can upload. For example, the video needs to be a maximum of 3 minutes long in the Playground. It is good to note that this limit doesn’t apply when using this via API.

Let’s try out my Cloud Technology Townhall Tallinn speaker promo image.

And it does provide a great answer, I think.

How to use REST API to get the same result?

Of course, I don’t want to use a website for image analysis. This is something that needs to be automated and integrated into business processes to save people’s time. And this is when we jump to the (low)code, to get the description.

In my example something (person, process, automation, …) uploads a file to a certain SharePoint library and the upload action is captured by the Power Automate trigger. Trigger or image source could be anything accessible by Power Automate (well, that is about anything). I am just using SharePoint library as an example since it is likely very often used as the source, and it is easy to demo with.

First, get the image content in base64 encoding.

Then you need to initialize variables for API URL and Key. You can get these values from the model deployment.

Then we need to create a call to the model using REST API. It follows the usual structure of all GPT-models when you call chat completion API.

The notable part here is the user message, with content array. In the content you specify image_url and in the text you put in the user prompt. In this automated description flow I used a short summarize & describe prompt.

The image_url can contain a public URL to the image. It is in fact an easy way, if the image happens to be on a public website (or in Azure Blob Storage with public anonymous access). For this example, I explicitly wanted an image to be in our system. In a SharePoint library. So, we need to give more information about the file: file type and content. The image content needs to be base64 encoded.

“url”: “data:image/jpeg;base64,[base64 encoded image content]”

The easiest way to get image content type and content, is the Get file content action. And we have that already in our flow, so we can just reference it there.

Note: make sure you set max_tokens in the JSON body, or otherwise the result text will be cut.

I have placed the REST API call body to a variable named APIBody. After that it is just an HTTP call to the model.

After the call you have a result body that contains the description… somewhere inside it’s JSON.

To comprehend and use it better, it is a good idea to decipher the return body with Parse JSON.

The Schema is a long one. The best way to get this is to run the flow once, so you have the returning body content, and use it to generate the Schema (Use sample payload to generate schema in the Parse JSON action).

The information can be found inside the choices section.

The description is in content under message.

Then you can just reference the content inside the body when you take the image description and move it forward in the process. In this example, the target is a Teams channel.

You can then use the content under message, to get the image description.

The Parse JSON action is not mandatory, you could just reference the right location in the body to get the message content.

No matter how you do it, by referencing via Parse JSON or directly, also check what is the finish reason (finish:details / type). If it is “stop”, everything is good. If the image contains something that is not ok with content filtering, you will get a content filtering result instead of the image description.

After that, you just push the description to your processes! In the demo I will post it to Microsoft Teams channel, indicating a new image has been uploaded and describing what it is.

For example when I uploaded my CTTT promo picture, I got this result out to Teams

Note that the text is slightly different than the one done in Azure AI Studio. Depending on creativity settings and prompt you get some variance to results.

Content filtering results

In case the image contains something that is not ok, finish_details include type “content_filter” as value. You can also get information about various levels of content from content_filter_results.

And in case the picture (or prompt) is against the acceptable, the call will fail and doesn’t return content filtering information.

Conclusion

Describing or summarizing images with vision model isn’t scratching the surface what it can do. Understanding what is in the image can be used for automating various processes. Perhaps it is an image from a surveillance camera, monitoring camera that regularly takes a photo of something that needs to be checked, like received shipment or product, safety (like is a gate closed or open, is there spill on floor, and so on). Reading labels, understanding a picture of a form (form processing is better for standard forms but read on), and the list goes on. This is a very good model, even guessing what is missing from the picture or text. How about reading handwriting, which has usually confused a lot of earlier AIs…

Extra: how about my handwriting

And with that, I of course tested, using Azure AI Studio Playground, it with my (bad) handwriting.

And the result? Quite astonishing in my opinion.

Thinking about the possibilities this model brings to the table, it is rather exciting!

However, it is good to keep in mind that sometimes you get different results. Using the same handwriting picture with this automated process, it described it with one error (Gambling instead of gardening). But even with that – AI understood the picture better some people did.

Description: The image shows a handwritten note, likely a shopping list, with the following items written down:

Milk x 2
Bred (presumably a misspelling of “Bread”)
Butter 1 KG
Gambling equipment (with the word “Gambling” scribbled out)

The words are written in distinct colors; “Milk x 2” is in yellow, “Bred” in orange, “Butter 1 KG” in yellow, and “Gambling equipment” in purple, with the purple line crossing out “Gambling.”

Of course, the prompt was also different, using the summarize / describe one in this blog post. And with further testing understanding handwritten Finnish was not as good as English is. So, there is still room for improvement.

If you want to get started easily, one simple use case would be generating tags & metadata for pictures automatically.

Join me live at CTTT24 in Tallinn!

If you want to see GPT-4 Turbo with Vision in action and talk with about these possibilities live, join me at my session at Cloud Technology Townhall Tallinn 2024 on 1^st of February. And no, this is not the only AI supercharging Teams example or demo I will show there live. In fact, I will also demo a second use case for image processing if time permits.

How to use Azure OpenAI GPT-4 Turbo with Vision to describe images

Vision-preview model

Azure AI Studio

How to use REST API to get the same result?

Content filtering results

Conclusion

Extra: how about my handwriting

Join me live at CTTT24 in Tallinn!

Published by Vesa Nopanen "Mr. Metaverse"

One thought on “How to use Azure OpenAI GPT-4 Turbo with Vision to describe images”

Leave a comment Cancel reply

Vision-preview model

Azure AI Studio

How to use REST API to get the same result?

Content filtering results

Conclusion

Extra: how about my handwriting

Join me live at CTTT24 in Tallinn!

Sharing is Caring! #CommunityRocks

Aiheeseen liittyy

Published by Vesa Nopanen "Mr. Metaverse"

One thought on “How to use Azure OpenAI GPT-4 Turbo with Vision to describe images”

Leave a comment Cancel reply