Hackster is hosting Hackster Holidays, Finale: Livestream & Giveaway Drawing. Watch previous episodes or stream live on Tuesday!Stream Hackster Holidays, Finale on Tuesday!
Kalana Ranahnsa Dissanayake
Created July 31, 2024

AI-Powered Speech-to-Image App for Kids' Storytelling

The project aims to develop an innovative AI-powered application tailored for children aged 3-10 years, facilitating storytelling

IntermediateWork in progress8
AI-Powered Speech-to-Image App for Kids' Storytelling

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9
Minisforum Venus UM790 Pro with AMD Ryzen™ 9
×1
JBL Tune 510BT
×1
Asus monitor 24 inch
×1

Software apps and online services

VS Code
Microsoft VS Code
miniconda
comfyui
lm studio
github
Adobe Express
hugging face
openai AI models

Story

Read more

Code

Narnia Adventures

Python
code
import os
import whisper
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
import requests
from flask import Flask, render_template, request, jsonify
from werkzeug.utils import secure_filename
import uuid

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads/'

# Load the Whisper model
whisper_model = whisper.load_model("medium.en")

# Load the Llama model and tokenizer
llama_tokenizer = LlamaTokenizer.from_pretrained("meta/llama-3-8b-instruct")
llama_model = LlamaForCausalLM.from_pretrained("meta/llama-3-8b-instruct", torch_dtype=torch.float16)

@app.route('/generate', methods=['POST'])
def generate():
    # Receive the audio file from the UI
    audio_file = request.files['audio']
    filename = secure_filename(audio_file.filename)
    file_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{uuid.uuid4().hex}_{filename}")
    audio_file.save(file_path)

    # Transcribe the audio using Whisper
    audio = whisper.load_audio(file_path)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(whisper_model.device)
    _, transcription = whisper_model.detect_language_and_transcribe(mel)

    # Generate a story based on the transcription using Llama
    prompt = f"Generate a short kids story based on the following prompt: {transcription.text}"
    input_ids = llama_tokenizer.encode(prompt, return_tensors="pt")
    output_ids = llama_model.generate(input_ids, max_new_tokens=256, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
    story = llama_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Generate an image based on the story using Stable Diffusion
    payload = {
        "prompt": story,
        "negative_prompt": "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality",
        "width": 512,
        "height": 512,
        "num_inference_steps": 50,
        "guidance_scale": 7.5,
    }

    response = requests.post("", json=payload)

    # Save the generated image
    image_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{uuid.uuid4().hex}.png")
    with open(image_path, "wb") as f:
        f.write(response.content)

    # Remove the uploaded audio file
    os.remove(file_path)

    # Return the generated story and image path
    return jsonify({'story': story, 'image': image_path})

@app.route('/')
def index():
    return render_template('index.html')

if __name__ == '__main__':
    app.run(debug=True)

Credits

Kalana Ranahnsa Dissanayake
1 project • 1 follower
I'm Kalan Dissanayake, a passionate individual with a strong background in software engineering, AI, and blockchain technologies

Comments