Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multi-modal model built on datasets including synthetic data and selected public websites, focusing on high-quality, inference-intensive data in both text and vision. This model belongs to the Phi-3 model family, and the multi-modal version supports a context length of 128K in token units. The model has undergone a rigorous enhancement process, incorporating supervised fine-tuning and direct preference optimization to ensure precise adherence to instructions and robust security measures.
For tasks that require simultaneous processing of images and text, Phi-3-vision demonstrates its unique advantages. It is particularly suitable for optical character recognition (OCR) tasks, not only for reasoning and answering questions about extracted text, but also for effectively understanding charts, graphics, and tables. This multimodal processing capability makes Phi-3-vision have broad application prospects in the field of multimedia content analysis.
NVIDIA NIM APINVIDIA NIM API is an API for building and deploying custom AI models. It aims to simplify the complexity of model training and deployment, enabling developers to focus on model design and performance optimization. NIM provides a simple way to train and deploy models for inference on edge devices. For example, this tutorial was written on Nvidia Jetson NX.
- Please visit the NVIDIA NIM Phi-3-vision page.
- Click "Login" in the upper right corner of the page to log in.
- Click on the "Python" tab and click on the "Get API Key" button.
- Click on "Generate Key" and copy and save your API Key.
Today we will do a small project of OCR recognition. I have a picture on hand, which is the nutrition table of a food package.
I hope to have the Phi-3-vision model recognize this image and convert the table on the image into a Markdown format table.
import base64 # Used for encoding pictures
import json
import os
import requests
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.getenv("TOKEN") #Retrieve API Key from environment variables
Copy the Python code from the NIM page and wrap it into a function
def invoke(prompt: str, image_b64: str, stream=True):
invoke_url = "https://ai.api.nvidia.com/v1/vlm/microsoft/phi-3-vision-128k-instruct"
assert (
len(image_b64) < 180_000
), "To upload larger images, use the assets API (see docs)"
headers = {
"Authorization": f"Bearer {TOKEN}",
"Accept": "text/event-stream" if stream else "application/json",
}
payload = {
"messages": [
{
"role": "user",
"content": f'{prompt} <img src="data:image/png;base64,{image_b64}" />',
}
],
"max_tokens": 512,
"temperature": 1.00,
"top_p": 0.70,
"stream": stream,
}
response = requests.post(invoke_url, headers=headers, json=payload)
if stream:
for line in response.iter_lines():
if line:
data = json.loads(line.decode('utf-8')[6:])
print(data["choices"][0]["delta"]["content"])
else:
print(response.json()["choices"][0]["message"]["content"])
We need to read the image into Python and convert it to base64 encoding
with open("nutrition_facts.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
Call the invoke function
invoke("Help me organize the table in the picture into md format", image_b64, stream=False)
After running, you can get the following results
| Nutrition Facts | |
|--------------------------------|---------------------|
| 8 servings per container | |
| Serving size | 2 rolls (20g) |
| **Amount Per Serving** | |
| Calories | 120 |
| **% Daily Value*** | |
| Total Fat | 9g |
| | 12% |
| Saturated Fat | 4g |
| | 20% |
| Trans Fat | 0g |
| Cholesterol | 0mg |
| | 0% |
| Sodium | 55mg |
| |
| Trans Fat | 0g |
| Cholesterol | 0mg |
| | 0% |
| Sodium | 55mg |
| | 2% |
| Total Carbohydrate | 10g |
| | 4% |
| Dietary Fiber | 0g |
| | 0% |
| Total Sugars | 1g |
| Includes 1g Added Sugars | |
| | 2% |
| Protein | less than 1g |
| Vit. D | 0mcg 0% |
| Calcium | 0mg 0% |
| Iron | 0mg 0% |
| Potas. | 0mg 0% |
You can see that all the characters on the image label have been recognized
More application scenariosThe application scenarios of using NVIDIA NIM and Microsoft Phi-3-vision for OCR recognition are very extensive, and they can greatly improve the efficiency of automated processing and data extraction.
Here are some specific application scenarios:
- Automated document processing: In the office environment, a large number of paper documents need to be converted to digital format for storage, search and sharing. Using NVIDIA NIM and Phi-3-vision for OCR recognition, paper documents such as contracts, invoices, reports, etc. can be quickly and accurately converted to editable text or Markdown format, improving office efficiency.
- Product Label Recognition: In the retail and logistics sectors, product labels typically contain important product information, such as name, production date, shelf life, and barcode. Using OCR technology, this label information can be quickly and accurately extracted for inventory management, logistics tracking, and product tracing.
- Financial data processing: When processing financial documents such as financial statements, invoices, and receipts, OCR technology can automatically identify and extract key information such as amounts, dates, and customer names. This helps speed up data processing, reduce human error, and improve financial work efficiency.
- Digitization of library materials: Libraries and archives store a large number of paper books and materials, which need to be digitized for online access and preservation. Using OCR technology can automatically identify and convert the text content in books and materials, enabling the rapid digitization of library materials.
- Form processing: In enterprises or government departments, it is often necessary to process various forms such as application forms and questionnaires. Using OCR technology can automatically identify the text and image content in the form and convert it into editable data formats, which facilitates subsequent data analysis and processing.
- Automated Security Inspection: In the realm of security checks, such as at airports and train stations, it is necessary to identify passengers' identification documents and tickets. OCR technology can automatically recognize and extract this information, enhancing the speed and accuracy of security checks.
- Accessible reading assistance: For people with visual impairments, OCR technology can convert paper books, magazines, and other materials into audible voice formats, helping them access information more easily. At the same time, OCR technology can also be used to create accessible versions of e-books and web pages, improving the usability of websites and applications.
These are just some typical application scenarios using NVIDIA NIM and Microsoft Phi-3-vision for OCR recognition. With the continuous development of technology and the continuous expansion of application scenarios, OCR technology will play an important role in more fields.
Comments
Please log in or sign up to comment.