Team Arti:

•

jetsonmaster101

Created March 25, 2024

Arti the Art-Bot

An AI-powered art bot which generates images based on the conversation at hand, in real time.

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Webcam, Logitech® HD Pro

TV Screen

Any monitor or TV screen serves this purpose.

Samsung 990 Pro SSD 1TB

Any SSD with at least 1TB should work.

Software apps and online services

Pycharm 2023.3.4

Story

Image generated by Arti

Arti - an overview

Art comes in many beautiful forms, but you can only fit so many paintings on your wall. What looks interesting one day might fail to catch your attention the next. What if you could have infinitely diverse artwork generated every few minutes? What about art work based on your conversations? What if it was completely powered by AI? Welcome to this tutorial, where we will be showing you how to build Arti, your personal art-bot.
Arti will capture your conversations and create images in real time. It’s powered by many forms of Generative AI, which help it create accurate yet distinct images relatable to the conversation at hand. It’s also completely open-source, meaning that you can create your own art-bot with just a Jetson.
We created Arti to constantly generate inspirational, meaningful, and relevant works of art, and we hope you enjoy using it as much as we did. Let’s get started!

How Arti works:
Arti turns your conversations into images over three steps:
First, it starts a new recording of a conversation every 60 seconds. This means the image will change every 60 seconds - you can adjust this according to your preferences. It will save the recording as an audio file and send it to Whisper. Whisper is an open-source speech-to-text software recently released by OpenAI, with automatic speech recognition built in. It will write the audio file into text.
Next, LLaMa 2, an open-source large language model similar to ChatGPT, will translate this raw text into a prompt which is more understandable for Stable Diffusion (our image generation model, also open-source). This step will allow Stable Diffusion to generate more accurate images representing the topic of the conversation.
Finally, Stable Diffusion will receive its prompt from LLaMa 2, and generate the image. In this way, your art bot will be able to generate images based on your conversation!
We won’t be writing code in this tutorial, but you can access it at the top of this Github page. However, please read through everything first - your device may not satisfy the specifications for this project, and you may not have everything set up on your device.
Since we didn’t want our conversations to have the potential to be shared, we stored everything in a local directory, which also enabled offline access.

Examples
Following are a few of our favorite examples of the images Arti generated, along with the conversations taking place at those times.

EXAMPLE 1:

In this example, two people were discussing a diagram explaining nuclear fission.
Conversation: and this lighter isotope is less tightly bound. Wait, what? Compared to its... And this light... Wait, wait, what happened here? ...lighter isotope is less tightly bound. That's an isotope. Compared to its... And this, but roughly one in every 140, lacks three neutrons, and this lighter isotope is less tightly bound. It's not on the... Compared to its more abundant cousin, a strike by a neutron easily splits the U-235 nuclei into lighter radioactive elements called fission products, in addition to two to three neutrons, gamma rays, and a few neutrinos. During fission, some nuclear mass transforms into energy. A fraction of the newfound energy powers the fast moving neutrons. And if some of them strike uranium nuclei, fission results in a second larger generation of neutrons. So that's a little bit very complicated. It's very complicated. It's not . So there's three types of radioactivity that happen with uranium, alpha, beta, and gamma. And so the gamma first starts and then does these other two. It's more complicated than this. This video will not really explain it very well. But it starts a nuclear reaction. The gamma starts, you know, things that happen. Yeah, but maybe I can more actually show what it looks like. If you want actual uranium? Like this. Well, you want to know the nuclear... No, like this. You want to know the uranium really actually, how it works. Uranium decay, okay? That's what it is. That's if you really want to... Don't take each of your autonomous. This is how uranium becomes the chain reaction. It's a physics one, but it might be complicated. I don't know, just go back for one second. Yeah, this one. This is fission. There's something called fission and fusion in nuclear reactions. Do you want to know? Let's try this one, and if it's too hard. Yeah, but there's another one. Actually, how it works. Do you want to know? Let's try this one if it's too hard. Yeah, but there's another one. Actually how it works. Not like, hmm, but the thing is. Like actually how it works. But do you want to know the reaction of the radio? No, it just goes back to normal. So this new Wi-Fi camera is taking the US by storm. This brand new subscription. Electrobel has seven nuclear power plants, four in Doule and three in Tiange, covering half of the electricity consumption in Belgium without producing CO2. But how exactly does a nuclear power plant work? A nuclear power plant works to a large extent exactly does a nuclear power plant work? A nuclear power plant works to a large extent like a conventional thermal power plant. Water is converted into steam which drives a turbine connected to a generator. This generator converts the mechanical energy into electrical energy. The only difference is that the heat which converts water into steam is produced by nuclear fission and not by burning coal, natural gas or biomass. The nuclear power plants of Doubs and Thiages use fissile uranium.
LLaMa's Interpretation: A nuclear power plant with a glowing core and steam rising from its cooling towers, surrounded by a futuristic cityscape with sleek skyscrapers and neon lights.
Image generated:

EXAMPLE 2:

In this example Whisper didn't interpret all the audio correctly - for example, "San Diego" was its interpretation of an alarm going off on a phone - which was one of the reasons why we chose to use LLaMa 2.
Conversation: This is what San Diego San Diego San Diego San Diego San Diego San Diego San Diego San Diego San Diego San Diego San Diego San Diego San Diego I'm sorry. Where? I want to. Dad? Yeah? Mom needs you for pancakes. In the middle of something, man. Do I have to? It's fine. They don't want pancakes. That's like, I don't want to. Mom. Mom, I'm in the middle of something with Maya. Look it's raining outside. I don't want to do pancakes right now. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. Mom. I don't want to do pancakes right now. Thank you. I'm sorry. . . . . . . . . . . . . . . . . . Remember, because he's doing the virtual environment.
LLaMa's Interpretation: A young boy, surrounded by rain and fog, reluctantly helps his mom make pancakes while using a virtual reality headset.
Image generated:

EXAMPLE 3:

In this example a video about training autonomous cars was being played.
Conversation: not reversing the vehicle, and not requiring online calculations. Additionally, where all previous methods have used low constant speeds, our method uses variable speeds up to 6 m per second and ensures the vehicle remains within the friction limit. We presented a method of safe learning that reformulates reinforcement learning to incorporate the supervisor. The safe learning method was evaluated in the F-10-1-10 simulator at speeds of up to 6 meters per second. The results showed that safe learning presents a 5x or 5 times improvement in sample efficiency, requiring only 10,000 steps. The supervisor and the learning formulation effectively train the agent to not require supervision. The safe learning agents select lower speed profiles than the conventional learning agents. This results in the safe learning agents achieving slower lap times and higher success rates. A major advantage of our methods is that the vehicles never crash during training. Future work should use this method to train agents on board physical vehicles. The ability to train agents for high performance robotic control while ensuring safety during the training process means that these methods can be used to train deep reinforcement learning agents on real world robots, thus bypassing the sim to real problem. Future work should evaluate how well safe learning uses the supervisor performs using the supervisor performs on real-world high-performance platforms. Bypassing a simple gap will mean that there is no difference between the training and testing behavior since both will be on the same physical device. The improvement in sample efficiency means that it is easier to use deep reinforcement learning since training time is reduced. Training more conservative policies leads to safer solutions which are essential for
LLaMa's Interpretation: Generate an image of a high-performance robotic vehicle navigating a challenging track while ensuring safety during training.
Image generated:

Video

And now, a short video to demonstrate Arti in action!

Conclusion
We hope you enjoy using Arti! Remember that it’s not perfectly accurate, but you may find the implications entertaining.

Thank you and happy generating!

arti.py

# Importing libraries for audio recordings
import glob
import sounddevice as sd
from scipy.io.wavfile import write
import random
import os

# Importing libraries for Whisper
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Importing libraries for LLaMa 2
from llama_cpp import Llama
import torch

# Importing libraries for Stable Diffusion
from diffusers import AutoPipelineForText2Image
import cv2
import numpy as np


class NoWatermark:
    def apply_watermark(self, img):
        return img


# Set necessary variables
fs = 44100
recording_seconds = 600

# Create and define directories
audio_dir = "./Audio_Files/"
image_dir = "./Image_Files/"
whisper_text_dir = "./Whisper_Texts/"
llama_text_dir = "./Llama_Texts/"
if not os.path.isdir(audio_dir):
    os.mkdir(audio_dir)
if not os.path.isdir(image_dir):
    os.mkdir(image_dir)
if not os.path.isdir(whisper_text_dir):
    os.mkdir(whisper_text_dir)
if not os.path.isdir(llama_text_dir):
    os.mkdir(llama_text_dir)

opencv_window_name = 'Arti'
cv2.namedWindow(opencv_window_name, cv2.WINDOW_NORMAL)
cv2.setWindowProperty(opencv_window_name, cv2.WND_PROP_FULLSCREEN, cv2.WINDOW_FULLSCREEN)

# Start a recording while the model is loading for the first time
print("Initial recording started")
my_recording = sd.rec(int(recording_seconds * fs), samplerate=fs, channels=2)

# Display first splash screen
splash_screen_1 = cv2.imread("./Splash_Screens/splash_screen_1.png")
cv2.imshow(opencv_window_name, splash_screen_1)
cv2.waitKey(1)


device = "cuda:0" if torch.cuda.is_available() else "cpu"  # Specifying where to get memory from
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Whisper specifications:
whisper_model_id = "openai/whisper-large-v3"  # Specifying what Whisper model to use
whisper_prompt = "hacker in a room, muted colors, detailed, 8k"

# Stable Diffusion specifications:
stable_diffusion_prompt_pre = ""
stable_diffusion_negative_prompt = "Cartoon, Comic, Nudity, People, Humans"  # Setting negative prompt


# Below, we are going to load the models

# Load Whisper:
whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(
        "./Models/Whisper_Model/", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, local_files_only=True
    )
whisper_model.to(device)
whisper_processor = AutoProcessor.from_pretrained(whisper_model_id)
whisper_pipe = pipeline(
    "automatic-speech-recognition",
    model=whisper_model,
    tokenizer=whisper_processor.tokenizer,
    feature_extractor=whisper_processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# Load LLama:
llm = Llama(
    model_path="./Models/Llama_Model/llama-2-13b-chat.Q6_K.gguf",
    n_gpu_layers=-1,  # Using GPU acceleration
    n_ctx=2048  # Increasing the context window
)

# Load Stable Diffusion:
pipeline_text2image = AutoPipelineForText2Image.from_pretrained("./Models/Stable_Diffusion_Model/",
                                                                torch_dtype=torch.float16, variant="fp16",
                                                                use_safetensors=True, local_files_only=True).to("cuda")
pipeline_text2image.watermark = NoWatermark()

first_run = True

# Create recordings
while True:
    # Start a recording
    print("Starting recording loop")
    sd.stop()
    for x in range(1, recording_seconds):
        if my_recording[x * fs][0] == 0:
            my_recording = my_recording[0:x * fs]
            break
    write(audio_dir + str(random.randint(0, 100000000)) + 'output.mp3', fs, my_recording)
    print("Recording finished.")

    # Display second splash screen
    if first_run:
        splash_screen_2 = cv2.imread("./Splash_Screens/splash_screen_2.png")
        cv2.imshow(opencv_window_name, splash_screen_2)
        first_run = False
        cv2.waitKey(1)

    # Start another recording while the image is being generated
    print("Recording started.")
    my_recording[:] = 0
    my_recording = sd.rec(int(recording_seconds * fs), samplerate=fs, channels=2)

    # Save the first recording
    file_type = r'*.mp3'
    files = glob.glob(audio_dir + file_type)
    max_file = max(files, key=os.path.getctime)
    print("Last recording taken: ", max_file)

    audio_results = whisper_pipe(max_file, generate_kwargs={"language": "english"})
    whisper_prompt = audio_results["text"]
    whisper_prompt = stable_diffusion_prompt_pre + whisper_prompt

    # Give the LLM its prompt
    llm_prompt = ("[INST] <<SYS>>\nYou will be provided with a conversation. Your task is to generate a simple prompt for an AI image generator based on the information in the conversation. You should only provide a single prompt. Do not include any other information as a repsonse. Only respond with the prompt so it can be used as input for an AI generative model. Do not include anything related to nudity in your prompt.\n<</SYS>>\n{" + whisper_prompt + "}[/INST]")
    llm_response = ""

    # Create LLM response
    while llm_response == "":
        llm_response = llm(llm_prompt, max_tokens=512)
        print("LLM_response:", llm_response)
    stable_diffusion_prompt = llm_response['choices'][0]["text"].split("\n")[-1]  # Stable Diffusion's next prompt

    # Print outputs (for user's convenience)
    print("Whisper's output: ", whisper_prompt)
    print("Stable diffusion's prompt: ", stable_diffusion_prompt)

    # Stable Diffusion creating the image:
    image = pipeline_text2image(prompt=stable_diffusion_prompt, negative_prompt=stable_diffusion_negative_prompt, height=1024, width=int(int(1024 * 16 / 9 / 8) * 8)).images[0]

    opencv_converted_image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
    cv2.imshow(opencv_window_name, opencv_converted_image)  # Show the image

    # Save audio, text, and image files
    print("Saving ...")
    cv2.imwrite(image_dir + max_file[:-4].split("/")[-1] + "image.jpg", opencv_converted_image)
    whisper_file = open(whisper_text_dir + max_file[:-4].split("/")[-1] + "whisper_text.txt", "w+")
    whisper_file.write(whisper_prompt)
    llama_file = open(llama_text_dir + max_file[:-4].split("/")[-1] + "llama_text.txt", "w+")
    llama_file.write(stable_diffusion_prompt)

    # If Q was pressed to quit fullscreen mode:
    print("Checking if Q pressed")
    k = cv2.waitKey(1)
    if k == ord('q'):
        cv2.destroyAllWindows()
        break

print("End")

Credits

Rain Romman

2 projects • 0 followers

Contact

Arti the Art-Bot

Things used in this project

Hardware components

Software apps and online services

Story

Code

arti.py

Arti's Github Repository

Credits

Rain Romman

jetsonmaster101

Comments

Embed the widget on your own site

Arti the Art-Bot

Arti the Art-Bot

Things used in this project

Hardware components

Software apps and online services

Story

Code

arti.py

Arti's Github Repository

Credits

Rain Romman

jetsonmaster101

Comments