Users with no training in photography are often unable to capture attractive photographs. This project utilizes AI to automatically analyze frame elements and generate instructions that guide users in capturing better photographs. In detail,this project is an application of a visual language model based on RAG and Agent technology. It can address users' photography questions by querying professional photography reference image libraries, and provide professional photography suggestions based on reference photos, including composition, posing, ISO settings, and more.
PipelineAs is illustrated in above pipeline, the whole process can be split into following steps:
1. Pass the query image to the embedding model to semantically represent it as an embedded query vector.
2. Pass the embedded query vector to reference professional photography DB.
3. Retrieve the top-k rePipelinelevant photo – measured by distance between the query embedding and all the embedded photo in the database.
4. Pass the query question,query image and retrieved photo to our VLM model.
5. The VLM model will determine which agent should be used,including composition agent,posing agent and ISO setting agent.
6. The VLM model generate a response with agent using the retrieved reference context.
RAGThe core of this project is the use of RAG, which allows the model to dynamically retrieve from the professional photography dataset Unsplash-Lite from https://unsplash.com/data.
In our system, we use a CLIP model to encode both user queries and reference photo. When a question is asked, it's embedded and a nearest-neighbor search is performed. This step measures the cosine similarity between the query and reference photo embeddings to identify the most relevant reference photo.
Based on reference information from professional photography images,the suggestions from our AI-assisted photography APP are more professional and reliable. RAG have following advantages:
1、Reducing hallucination:
LLMs are prone to hallucination - coherent but inaccurate or fabricated information. RAG reduces the risk of misleading suggestions by ensuring that large model answers are based on authoritative photographers' photographs.
2、Increasing transparency and trust:
Generative AI models like LLMs often lack transparency, making it difficult for people to trust their output. RAG increases user trust by allowing users to preview the retrieved reference photography photos.
3、High flexibility:
AI can flexibly update information without training the model. In the future, this project could iteratively incorporating more professional photography datasets, such as more portrait and more landscape photos.
This project have implement three agents to give suggestion for composition,pose and camera settings (such as ISO setting).
- Composition agent
pixel_values = load_image(reference_image_path, max_num=6).to(torch.bfloat16).cuda()
agent_prompt = '<image>\nYou are an professional photography assistant, based on the reference image, describe the composition techniques used'
response, his = self.model.chat(agent_prompt, pixel_values, [], )
This agent will analyze the composition techniques used in reference image to give advice.
- Pose agent
pixel_values = load_image(reference_image_path, max_num=6).to(torch.bfloat16).cuda()
agent_prompt = '<image>\nYou are an professional photography assistant, based on the reference image, describe the pose of people in the image.
You should including the position of person, Body Orientation, Head and Neck Pose, Hands and Arms Pose, Legs and Feet Pose, Overall Composition. '
response, his = self.model.chat(agent_prompt, pixel_values, [], )
This agent will analyze the pose used in reference image to give pose advice, including pose of body, head, neck, arms, legs.
- ISO setting agent
if len(exif_info) != 0:
agent_prompt = '<image>\nYou are an professional photography assistant, based on the reference EXIF information, give explanation on how these camera settings works'
response, his = self.model.chat(agent_prompt+exif_info, None, [], )
else:
pixel_values = load_image(reference_image_path, max_num=6).to(torch.bfloat16).cuda()
agent_prompt = '<image>\nYou are an professional photography assistant, based on the reference image, give advice on ISO,Exposure time,FNumber and other camera settings '
response, his = self.model.chat(agent_prompt, pixel_values, [], )
This agent will use the exif infomation from reference image to give camera setting advice, including ISO,Exposure time,FNumber. When reference EXIF is missing, the model will give advice using its own knowledge.
Demo- Simple text only conversation for pose
In this example,mode call pose_advice agent API to give advice. It firstly use query key word to retrieval a reference photo from Unsplash dataset. And then analyze the pose technique used in the reference photo and finally give advice on how to reproduce this pose.
- Simple text conversation for camera settings
In this example,mode call ISO_advice agent API to give advice. It firstly use query key word retrieval a reference photo from Unsplash dataset. And then parse the EXIF from the reference photo and finally give advice and explanation.
- Image-Text conversation for composition
In this example,mode call composition_advice agent API to give advice. It firstly use query image to retrieval a reference photo from Unsplash dataset. And then analyze the pose technique used in the reference photo and finally give advice on how to reproduce this composition.
Comments