The on-device Generative AI content landscape has been predominantly shaped by LLMs, excelling in text generation. However, with advancements in hardware like NPUs, image generation models are poised to enter this arena. Text-to-image models, once accessible only to cloud-based platforms, are now within reach of on-device deployment. This project explores the potential of refining Stable Diffusion, a leading text-to-image model, for on-device applications. By combining Stable Diffusion's generative capabilities with a CNN for object detection and rectification, we aim to create a system that produces high-quality images while addressing potential shortcomings.
Stable DiffusionContributions to Stable Diffusion are growing rapidly along with improvements in aspects such as image quality, accessibility and adaptability. Using the pipeline architecture provided by Hugging Face in support of the AMD Ryzen A.I. architecture it is as easy to test the model on your device with a few lines of code:
import torch
from diffusers import StableDiffusionPipeline
# Load the Stable Diffusion model
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
# Generate an image
prompt = "a beautiful cat sitting on a couch"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image.save("generated_image.png")
This requires the dependencies of AMD's RyzenAI to be preinstalled along with the PyTorch package and an environment setup for utilizing the NPU. The accessible pipeline architecture loads the entire requirements into the operating kernel, this is quite often a lengthy process as the size of a S.D. model increases with improvements in performance. The above model will load multiple file of around 10 Gb into the current kernel to perform the generation tasks.
CNNThe Convolution Neural Network model is utilized to identify the components of the image where in the model uses image recognition over a segment of image to identify the component and separate it from the rest of the image. After identification the pixel location is marked along with size of the object in picture to approximate pixel length and breadth. A simple example of CNN detection :
import torch
import cv2
import numpy as np
# Load the pre-trained CNN model
model = torch.load("object_detection_model.pth")
model.eval()
# Load the generated image
image = cv2.imread("generated_image.png")
# Preprocess the image for the CNN
# ... (resize, normalization, etc.)
# Perform object detection
with torch.no_grad():
detections = model(image)
# ... (post-processing to get bounding boxes)
# Rectification (simplified example)
for box in detections:
x1, y1, x2, y2 = box # bounding box coordinates
# ... (crop the object, apply rectification, paste back)
Comments