This AI-powered speech-to-image app is crucial because it enhances children's creativity and imagination by transforming spoken words into captivating images in real-time. It offers an interactive and multi-sensory experience, catering to diverse learning styles and making abstract concepts concrete. The app supports accessibility, enabling children with different needs to communicate and express themselves effectively. By integrating storytelling with visual elements, it fosters a collaborative environment where parents and educators can participate, ensuring that children are engaged, empowered, and prepared for the digital world.
Why this type of ai solution needThis type of AI solution is needed because it addresses several key challenges in children's learning and creative expression. By transforming spoken words into visual content in real-time, it makes storytelling more engaging and interactive, enhancing children's ability to express their ideas and emotions. It supports diverse learning styles by combining auditory, visual, and interactive elements, which helps make abstract concepts more tangible. Additionally, the app promotes inclusivity by providing a platform for children with different needs to communicate and learn effectively. In a world increasingly dominated by digital technology, such solutions prepare children for future interactions with AI and digital tools, fostering creativity, language development, and confidence.
With the support of the advanced AMD AI PC, we are poised to develop the Speech-to-Image App for Kids' Storytelling and Learning with exceptional efficiency and precision. The AMD AI PC’s cutting-edge processing power and specialized AI capabilities will facilitate real-time speech recognition and dynamic image generation, ensuring optimal performance and seamless user experience. This state-of-the-art hardware will enable us to create a highly interactive and engaging application that transforms children's spoken narratives into vivid visual representations, thereby significantly enhancing creativity, language development, and educational engagement.
The development of the AI-powered Speech-to-Image App is organized into three distinct yet interconnected stages, each pivotal to the project's success.
Planning begins with a comprehensive exploration of AI technology to identify the most suitable models for our needs. This is followed by an in-depth study of children’s talking abilities and communication styles to ensure that the app can accurately interpret and respond to their speech. During this stage, staying abreast of the latest advancements in AI and child development is crucial to integrate the most effective and current solutions.
Designing involves crafting the solution’s blueprint. This stage starts with the selection of appropriate AI models, informed by the research conducted during the planning phase. The chosen models are then rigorously tested on the AMD AI PC to ensure optimal performance and compatibility. Concurrently, the app's structure is meticulously planned and designed, focusing on creating a seamless user experience and effective integration of the chosen technologies.
Developing brings the design to life. It begins with prototyping the app’s user interface to refine user interactions and overall usability. Following this, the app is fully developed, incorporating the AI models and building both the front-end and back-end components to bring the design to fruition. The final step in this stage is project submission, where the completed app is prepared for review or deployment, accompanied by all necessary documentation.
This methodical approach ensures a thorough and effective development process, culminating in the creation of an engaging and innovative AI-powered app for children's storytelling and learning.
Reseach : AI Technology
In researching AI technology, it's crucial to understand how AI systems use algorithms and data processing to mimic human intelligence. Tools like TensorFlow and PyTorch are essential for developing and training machine learning models. Key trends include generative models for creating new content, explainable AI for transparency, and edge AI for real-time processing on devices. AMD’s Ryzen AI Engine enhances these capabilities by efficiently handling multiple AI workloads. Additionally, ONNX (Open Neural Network Exchange) facilitates model interoperability across different frameworks, while Hugging Face provides a library of state-of-the-art models and tools for natural language processing. Resources like the AMD Pervasive AI Developer Contest PC AI Study Guide offer valuable guidance for leveraging these technologies effectively.
Reseach : Kids talking ability
searching children's talking abilities involves a thorough understanding of language development stages, cultural influences, and the impact of bilingualism or multilingualism.
Language Development Stages: Children's language acquisition progresses through defined stages. Infants start with babbling and develop single words and simple sentences. By ages 3 to 4, they form more complex sentences and understand basic grammar. As they grow, their vocabulary and control over pronunciation and sentence structure improve.
Speech Patterns and Comprehension: Early language development features articulation challenges and simplified language use. Younger children depend on context and visual aids for comprehension. As they mature, their speech becomes clearer, and they better understand and use abstract language.
Cultural and Linguistic Influences: Language development is significantly influenced by cultural and linguistic contexts. In Sri Lanka, children may be exposed to multiple languages, including Sinhala, Tamil, and English. English-speaking children in Sri Lanka, who are often bilingual or multilingual, experience unique language acquisition dynamics that impact their pronunciation, vocabulary, and language proficiency.
Interaction with Technology: Designing applications for children requires consideration of their developmental stages and cultural backgrounds. For younger children, simple language, visual aids, and interactive elements are crucial. For children learning English as a second language, incorporating multilingual support and culturally relevant content enhances engagement and educational outcomes.
Variability in Communication Skills: Language development varies among children due to factors such as age, exposure, and individual differences. Recognizing these variations is vital for creating inclusive applications that cater to diverse needs and backgrounds.
Value of the Solution: Understanding these aspects is essential for developing educational tools that effectively support children's language learning and development. The proposed Speech-to-Image App provides significant value by integrating multilingual and culturally relevant content, tailored to various developmental stages and linguistic backgrounds. This approach not only enhances engagement and learning outcomes but also ensures inclusivity and accessibility for children with diverse language skills. By leveraging advanced AI technology, the app will deliver an interactive and immersive experience that supports both creative expression and educational growth, making it a valuable resource for children, parents, and educators.
Speech and Language Milestones
Age-Appropriate Speech and Language Milestones | Children's Hospital of Philadelphia (chop.edu)
Updating knowledge
Updating knowledge is crucial for developing innovative AI solutions, and this involves engaging with the latest advancements through various channels. Staying informed about new AI models and techniques can be achieved by taking specialized courses on platforms like Coursera, edX, and Udacity, which offer structured learning paths in machine learning and AI. Additionally, following trends and best practices is supported by industry reports, research papers, and YouTube channels such as Semtex and AI Adventures, which provide practical tutorials and insights. Adapting to user needs and leveraging new research are facilitated by resources like Google Scholar and the AMD Pervasive AI Developer Contest PC AI Study Guide. Continuous professional development through these courses, videos, and industry engagement ensures that the AI-powered Speech-to-Image App incorporates the most current and effective technologies.
Stage 2 : DesigningSelecting AI models
Selecting AI models is a critical step in developing an AI-powered solution, as it determines the effectiveness and efficiency of the application. Here’s a structured approach to model selection:
Voice to Text Model,
- Mozilla DeepSpeech: An open-source speech-to-text engine based on Baidu's Deep Speech research. It can be run locally and offers good performance for various languages.
- Vosk: An open-source toolkit that provides lightweight models capable of running on local PCs. Vosk supports multiple languages and offers real-time processing.
- Kaldi: A powerful speech recognition toolkit that can be configured for various tasks. It requires more setup and tuning but is highly customizable.
- Whisper by OpenAI: Available in models that can be run locally for transcription tasks. It's known for its robust performance in diverse conditions. (Selected)
Whisper by OpenAI
is a great choice for local voice-to-text transcription due to its robust performance across various conditions.
openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision (github.com)
LLM model
- GPT-Neo and GPT-J: Open-source models developed by EleutherAI, designed to be more accessible for local deployment. GPT-Neo and GPT-J offer powerful language generation capabilities and can be run on local hardware with sufficient resources.
- BLOOM: Developed by the BigScience project, BLOOM is an open-access LLM that offers robust performance for various language tasks and can be run locally with appropriate hardware.
- DistilBERT: A smaller, distilled version of BERT, optimized for efficiency and speed. While not as large as GPT-3 or LLaMA, it provides good performance for many NLP tasks and is more feasible for local deployment on less powerful hardware
- LLaMA by Meta: The LLaMA (Large Language Model Meta AI) family provides open-weight models that are designed for research and experimentation, making them suitable for local deployment . (Selected)
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-8B-Instruct · Hugging Face
text to image model
Stable Diffusion
- open-source model that excels in generating high-quality images from text prompts. It offers flexibility and produces high-resolution output, making it a strong choice for various applications.(Selected)
Text to Speach
TTS is an AI model that converts text to natural sounding spoken text. We offer two different model variates, tts-1 is optimized for real time text to speech use cases (Selected)
Selected Models
- Voice to text : openai/whisper-medium.en
- LLM : meta-llama/Meta-Llama-3-8B-Instruct
- Text to image : runwayml/stable-diffusion-v1-5
- Text To Speech : tts-1 Via API
Test AI models on AI PC
- meta-llama/Meta-Llama-3-8B-Instruct
Testing the Meta-LLaMA 3 8B Instruct model on the AI PC involves checking how well it can create stories and generate prompts for images. The model is used to produce interesting and coherent stories from simple prompts, and to suggest detailed prompts for creating images. The AI PC’s powerful hardware helps by running these tasks smoothly and quickly, allowing us to see how well the model performs in generating both text and image prompts.
- runwayml/stable-diffusion-v1-5
Testing the RunwayML Stable Diffusion v1-5 model on the AI PC involves evaluating its capability to generate images using a user-friendly interface (ComfyUI_windows_portable). The Stable Diffusion v1-5 model
App structure planning and designing
The project design involves creating an engaging storytelling application for children using advanced AI technologies. The app will leverage the OpenAI Whisper-medium.en model for converting spoken words into text, allowing kids to narrate their stories effortlessly. The Meta-LLaMA 3 8B Instruct model will generate captivating stories from the text inputs. For visual elements, RunwayML’s Stable Diffusion v1-5 will create vibrant images based on the story descriptions. Lastly, the TTS-1 API will provide natural-sounding voice narration to bring the stories to life. This combination of models ensures a seamless and interactive storytelling experience for children.
Solution Branding
"Narnia Adventures" was chosen as the name for our AI-powered speech-to-image app because it encapsulates the timeless magic, wonder, and adventure that defines C.S. Lewis’s beloved world of Narnia. Just as Narnia is a gateway to extraordinary realms of imagination, our app opens the door to endless creative possibilities for children. By simply speaking, kids can bring their stories to life with vibrant, dynamic images, making storytelling an interactive and visually engaging experience.
The name "Narnia Adventures" also reflects our commitment to fostering creativity, learning, and exploration. Through this app, we aim to inspire children to embark on their own adventures, explore their imaginations, and develop their storytelling skills in a fun, educational environment. The branding is designed to appeal to both children and parents, combining the nostalgic allure of Narnia with a modern, playful design that invites users to create, imagine, and dream.
Prototyping UI & App dry run
Prototyping the user interface (UI) and conducting a dry run of the app are crucial steps in the development process. This allows you to visualize the user experience, identify potential issues, and make necessary adjustments before full-scale development. Below is an outline of the prototyping process and a basic dry run for AI-powered storytelling app
APP Development
1. Install AMD Ryzen™ AI Software
Installation Instructions — Ryzen AI Software 1.2 documentation (amd.com)
2. Clone AI models
1.openai/whisper-medium.en
pip install git+https://github.com/openai/whisper.git
pip install torch --extra-index-url https://download.pytorch.org/whl/cu118
2.meta-llama/Meta-Llama-3-8B-Instruct
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct
Future workAs we continue developing our AI-powered speech-to-image app for kids' storytelling and learning, significant progress has been made, but the full solution is not yet complete. Below is an overview of the current status, including detailed instructions for setting up the local environment using Flask, integrating AI models, and the next steps until the full UI integration is achieved.
1. Current Progress and Implementation
The solution is being developed as a fully local application, leveraging the Flask web framework to manage interactions between the user and the AI models. The core components, including voice-to-text, text generation, text-to-image, and text-to-voice, have been successfully integrated into individual modules. Each module has been tested separately, ensuring that the AI models perform as expected when running locally.
2. Flask Setup and Local Integration
To create a fully functional localhost solution, the Flask framework has been selected for its simplicity and flexibility in handling web-based interfaces. Below is a basic setup guide for Flask, followed by code samples to demonstrate the integration of AI models.
While each model has been successfully integrated and tested in isolation, the final stages involve combining these modules into a cohesive application with a user-friendly interface. The next steps include:
- UI Integration: Developing a front-end interface that allows users to interact with the AI models seamlessly.
- Backend Optimization: Ensuring the Flask server can efficiently handle multiple AI tasks concurrently without compromising performance.
- Full Solution Testing: Conducting comprehensive testing to identify and fix any potential issues before the final deployment.
At this stage, the project is not yet ready for full deployment, but substantial progress has been made. The focus is on ensuring that the solution is robust, user-friendly, and capable of delivering high-quality results.
I will be adding code examples and implementation details to this documentation to provide a clear and practical guide for setting up and running the AI-powered storytelling app locally. These code snippets will cover the integration of various AI models, such as voice-to-text, text generation, and image creation, within the Flask framework. By including this detailed technical information, the documentation will serve as a comprehensive resource for understanding the inner workings of the solution and its development process.
import os
import whisper
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
import requests
from flask import Flask, render_template, request, jsonify
from werkzeug.utils import secure_filename
import uuid
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads/'
# Load the Whisper model
whisper_model = whisper.load_model("medium.en")
# Load the Llama model and tokenizer
llama_tokenizer = LlamaTokenizer.from_pretrained("meta/llama-3-8b-instruct")
llama_model = LlamaForCausalLM.from_pretrained("meta/llama-3-8b-instruct", torch_dtype=torch.float16)
@app.route('/generate', methods=['POST'])
def generate():
# Receive the audio file from the UI
audio_file = request.files['audio']
filename = secure_filename(audio_file.filename)
file_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{uuid.uuid4().hex}_{filename}")
audio_file.save(file_path)
# Transcribe the audio using Whisper
audio = whisper.load_audio(file_path)
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(whisper_model.device)
_, transcription = whisper_model.detect_language_and_transcribe(mel)
# Generate a story based on the transcription using Llama
prompt = f"Generate a short kids story based on the following prompt: {transcription.text}"
input_ids = llama_tokenizer.encode(prompt, return_tensors="pt")
output_ids = llama_model.generate(input_ids, max_new_tokens=256, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
story = llama_tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Generate an image based on the story using Stable Diffusion
payload = {
"prompt": story,
"negative_prompt": "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality",
"width": 512,
"height": 512,
"num_inference_steps": 50,
"guidance_scale": 7.5,
}
response = requests.post("", json=payload)
# Save the generated image
image_path = os.path.join(app.config['UPLOAD_FOLDER'], f"{uuid.uuid4().hex}.png")
with open(image_path, "wb") as f:
f.write(response.content)
# Remove the uploaded audio file
os.remove(file_path)
# Return the generated story and image path
return jsonify({'story': story, 'image': image_path})
@app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(debug=True)
this AI-powered solution is designed to be a robust multi-model system, seamlessly integrating three advanced AI models to operate concurrently. This requires substantial computing power to ensure smooth performance and responsiveness. The current development environment leverages the AMD Ryzen™ 9 7940HS processor, a high-performance chip that features 18 CPU cores and threads. With a maximum boost clock of 5.2 GHz and a performance capacity of up to 10 TOPS , this processor provides the necessary computational resources to handle complex AI tasks efficiently.
In addition to our current setup, we are also evaluating the AMD Ryzen™ AI 9 HX processor, which represents a significant upgrade in terms of performance. This processor offers 12 CPU cores and 24 threads, with a boost clock of 5.1 GHz. What sets it apart is its enhanced processing power, with a total performance capability of up to 85 TOPS. Of particular interest is its Neural Processing Unit (NPU), which alone can deliver up to 55 TOPS, enabling faster and more efficient AI model execution. This hardware upgrade is under consideration to optimize the performance of our AI solution, ensuring it meets the highest standards of efficiency and reliability.
To further enhance the scalability and availability of the application, we are also exploring development on AMD Cloud. This would not only provide the necessary infrastructure for handling larger workloads but also allow for more flexible deployment options. By utilizing cloud-based resources, we can ensure that our solution remains highly performant and accessible, regardless of the scale at which it is deployed.
We have identified a strong market demand for this type of application, particularly in the fields of children's storytelling and educational tools. This growing interest underscores the potential impact of our solution, and we are committed to developing it into a commercial product that can be launched in the near future. By refining the technology and leveraging cutting-edge hardware from AMD, we aim to create a product that not only meets but exceeds user expectations in terms of functionality and performance.
ThankingI would like to extend my heartfelt thanks for providing the opportunity to participate in the Pervasive AI Developer Contest with AMD by offering a free AI PC. Your support and generosity have greatly empowered me to pursue my project with the cutting-edge tools and resources needed to innovate and create. This experience has been invaluable, and I am excited to see the impact of the work made possible through this incredible platform.
Rest assured, the solution will be completed as soon as possible, with new features that will further enhance its capabilities. I look forward to sharing the final product with you.
Thank you once again for your commitment to fostering creativity and innovation in the AI community.
Comments