The Vision:
Our project is designed for parents who often run out of stories when their kids ask for more. Let AI take over the creative storytelling process! GenAI can craft unique stories for your child, complete with illustrations. Plus, you can add personalized voice narration and create a video to make the experience even more engaging!Project Description:
Our project leverages the Multimodal large language models to create an innovative children's comic book generator. Utilizing advanced GenAI technologies, including Stable Diffusion, Llama, and Qwen, users can generate unique stories featuring our main character, George the Monkey, based on just an input image and a prompt. Additionally, we offer an option to convert these stories into image-based movies with voice-over capabilities, allowing users to either clone their voices or choose from a variety of pre-recorded options.
How It WorksStage 1: Visual Tokenization & De-tokenization
- Pre-train an SD-XL-based de-tokenizer to reconstruct images by taking the features of a pre-trained Vision Transformer (ViT) as inputs.
Stage 2: Multimodal Sequence Training
- Sample an interleaved image-text sequence of a random length.
- Train the MLLM by performing next-word prediction and image feature regression between the output hidden states of the learnable queries and ViT features of the target image.
Stage 3: De-tokenizer Adaptation
- The regressed image features from the MLLM are fed into the de-tokenizer for tuning SD-XL, enhancing the consistency of the characters and styles in the generated images.
Given the same initial image but different opening texts, SEED Story can generate different multimodal stories. For instance, starting with text referencing “the man in the yellow hat” will lead to images that include the character, while omitting this reference will result in a different narrative direction.
Setting Up the Environment:
- Install ROCm v5.5
An initial image is uploaded by the user along with a prompt describing the theme and expectations of the adventure story. The prompt is then used by the open-source model LLaMA to generate a coherent and engaging story! The visuals are generated with the help of Stable Diffusion model. The model employs a pre-trained Vision Transformer (ViT) as the visual tokenizer and a pre-trained diffusion model as the visual de-tokenizer to decode images. These illustrations would include your pretrained favourite characters based on the seed in the prompt, in the below case George! This ensures a consistent art style for the character.
DataSource
The data used to perform inference is from the dataset StoryStream which has a collection of stories extracted from cartoon series. The dataset contains images and subtitles with image descriptions. Custom datasets can be created by utilizing Qwen to do image captioning!
- Narration Options:Choose to either clone your voice or use a pre-recorded voice to narrate the story, bringing the adventure to life with personalized narration. The voice cloning feature is done using the text-to-speech library TTS and edge-TTS to use one of the pre-recorded voices.
- Creating an Image-based Movie:Integrate the generated images and voice-over into a smooth video.
Gradio is used to create the application’s user interface, allowing for interactive and user-friendly interaction with the model.Ngrok is employed to host the application, providing a secure tunnel to the local server and enabling access from external networks. We also provide users with the opportunity to save and export the results be it PDFs or MP4 for the picture movie in case the kids wish to revisit the story!Check out the full project demo below:
Future Works:1. Fix the inconsistency issues in story generation 2. Finetune on more character
Comments