A sandbox can be an explorative playground. By molding sand into different shapes a person can explore their imagination and create amazing stories. Infinite Sands takes this a step further bringing these stories to life through the use of generative AI.
๐ SamplesI kept the wake word simple as "hello" as Whisper processes it without issue while some other words I tried were hit and miss with the model making it less reliable.
After you've triggered the logic that generates a prompt you can say the following commands:
art
- trigger art LoRA from Kerin Lovettnormal
- disable art LoRArender
- renders the scene -- useful if you by mistake have your hands in the generated photo and it continues to detect hands from the generated imageexit
- quits the script*
- additional non command words are set as the prompt triggering a new render
This project utilizes a computer, some peripherals
๐ป Computer HardwareFor our computer hardware we used several components we had on hand in addition to some purchases for the project.
The end computer we utilized had the following specifications: i5-13500 2.5 GHz processor, 64 GB RAM, 1 TB storage space, and the W7900 with 48 GB of VRAM.
We've assembled this PCPartPicker build that closely mimics the machine we used.
CPU: Intel Core i5-13500 2.5 GHz 14-Core Processor
CPU Cooler: Thermalright Peerless Assassin 120 SE WHITE ARGB 66.17 CFM CPU Cooler
Motherboard: MSI PRO Z790-P WIFI ATX LGA1700 Motherboard
Memory: TEAMGROUP T-Create Expert 64 GB (2 x 32 GB) DDR5-6000 CL34 Memory
Storage: Western Digital Black SN770 1 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive
Video Card: AMD 100-300000074 Radeon PRO W7900 48 GB Video Card
Case: Enermax GraceMESH ATX Mid Tower Case
Power Supply: Thermaltake Toughpower GF3 TT Premium 1000 W 80+ Gold Certified Fully Modular ATX Power Supply
It is important when working with larger generative AI models to have plenty of VRAM and RAM. If you lack the latter you won't be able to keep the larger models in memory which can lead to issues. It is also advisable to have a decent power supply as the W7900 uses quite a bit of power draw at 295 W.
Installation of the AMD Radeon PRO W7900 on the hardware side is fairly simple. The card is inserted as you would any graphics card into a PCIe slot on your motherboard followed by insertion of the power cables. You'll need two separate 2x8 Pin power cables routed from your power supply. They should not be on the same line but rather two distinct cables.
โ Sandbox HardwareWe wanted a few things with the sandbox:
- A raised bed for interfacing in a comfortable setting
- Wheeled support allowing the sandbox to be moved at will
- Room for a personal computer under the sandbox for the generative AI
- Steady arm to hold the projector and camera unit
To accomplish this we utilized three primary store-bought components that worked well together for these goals: a wheeled welding table, a drawer, and an overhead camera mount.
The pieces come with their own instructions and for the most part you can follow them directly. The welding table was fairly easy to put together using some bolts in only a few steps.
The drawer that goes on the welding table only requires one modification from the installation. When mostly complete ignore the center divider piece that comes with the drawer.
The monitor stand was also easy to assemble just requiring the screwing of different pieces together and tightening with a set screw. It can be assembled into a size that fits well while keeping the projector raised at a decent height for displaying on the entire sandbox. We didn't take photos of this process unfortunately but its instructions are simple.
We've created some 3D printed parts to act as spacers under the the drawer when combining the final components. We printed out six of these which give decent support and used VHB tape to hold them in place. They are needed as the monitor stand interferes with the drawer fitting perfectly so these act as a base for the drawer. There are also two 3D printed squares used to allow the stand to tighten down into the table.
With the assembled hardware we attached the projector, the USB camera, and then prepared the device for sand. We did this by adding a liner which we secured with a stapler and hammer. After the liner was in place we filled it with sand.
Kerin followed the guide she found by HollowStrawberry on Civitia. It made the process really straightforward as she could compile a list of her art, label it automatically, and then train on it for multiple epochs until it gave decent results as a style LoRA.
The process is fairly straightforward especially for those with an existing dataset like Kerin. She visited the dataset training colab, named her project "kerin-lovett", and ran the initial setup. This created a folder in Google drive where she could drop all of her images.
She then mostly ran through the dataset colab only needing to modify the label generation logic as the default was for anime labels.
After the dataset was readied she began work with the training colab provided by the same author.
At this point Kerin selected her project name from the previous set, changed the training model to Stable Diffusion v1.5, and began processing.
With the training done Kerin spent a little while comparing the LoRAs for checkpoints 6-10. In the end she settled on checkpoint 10 which she determined was the closest to her work.
If you're interested in trying this LoRA you can grab it from the project attachments.
๐ Python CodeSystem Setup
We went with Ubuntu 22.04 LTS for our installation utilizing Balena Etcher for creating the USB installation drive. The process isn't too bad just requiring a free USB drive of sufficient space, downloading the 22.04 LTS disk image, and then using the Etcher tool to write to that medium. You'll need to enter your BIOS boot menu (it varies based on motherboard) but with the USB plugged in you can move forward with the installation.
To make the setup process easier we installed openssh-server
package and configured it for ssh authentication.
sudo apt install openssh-server
For security purposes we disabled password authentication, uploaded our public key, and utilized that for login. Once setup it is very easy to get started with Visual Studio Code for SSH Remote sessions. If you're new to the process DigitalOcean has an article that goes into more depth for Ubuntu 22.04.
With the basics out of the way you can install the ROCm software. Things move fairly fast in this area so it is advisable to reference the following articles during installation (I won't be repeating the steps here as they will get out of date quickly).
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/prerequisites.html
The primary thing to take away from the prerequisite above is to add yourself to the render and video groups. Failure to do so will lead to errors further on in the install process.
sudo usermod -a -G render,video $LOGNAME
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/ubuntu.html
With the ROCm software in place we can prepare our environment for Automatic1111's web UI.
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/
This pip install fragment should be run to get the latest nightly rocm 6.1 version of torch, torchvision, and torchaudio. At times during the installation of other packages we found we needed to rerun this to get back on the correct version so it's worth keeping this at hand.
For installation we mostly followed the ROCm guide for Automatic1111's UI. We used the running natively option but you could use what you're comfortable with. For our install we took the following steps:
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui && cd stable-diffusion-webui
python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip wheel
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/
You will need a few more libraries if utilizing our script. We'll include their installation fragments here.
pip3 install openai-whisper sounddevice numpy
sudo apt install libportaudio2
sudo apt-get install libgtk2.0-dev
pip install screeninfo
sudo apt install screen
I believe everything else was available on our machine. We had installed ComfyUI prior during initial testing so if we've missed a requirement apologies but it should cover the main elements.
Inside of the Automatic1111 web UI tooling there are two extensions we utilized:
The former makes it easier to copy a workflow for use with the API (more on this later) and the latter allows the use of controlnet.
With that in place you should have enough setup to run automatic1111's web UI and are ready for the custom script that turns a normal sandbox into the infinite sands.
First we'll need to clone the repository with the infinite sands logic.
git clone https://github.com/Timo614/infinite-sands.git
You'll notice there are several files in the repository.
Code
infinite-sands.sh
- We placed this file in our home directory to start the stable diffusion web UI as a background task in our terminal. chmod +x infinite-sands.sh
prepares the file for execution. This is optional and you could just start the standard diffusion web UI server manually so long as you remember to append the --api
option.
infinite-sands.py
, infinite-sands-api.json
, and cv_fullscreen.py
should be placed alongside each other.
Test Files
The test-files
directory contains several test related pieces of code used during development. These can be ignored for a deployment but may help to debug issues if you're experiencing problems with your setup.
board-test.py
- A file used to create a CharucoBoard utilized for viewport mapping. This logic is utilized in the infinite-sands.py script file as part of the calibration but the sands-improved.ipynb
file provides a jupyter notebook to step through the calibration process so it's useful to leave this board on, run a capture from the notebook, and walk through the calibration.
blank-test.py
- Once the calibration was completed we needed a way to make the sand well lit so we could take a photo for use with Depth-Anything-V2 utilized by the ControlNet extension. The aforementioned jupyter notebook utilized this to take a capture of the state of the board, we warped it using the earlier calibration, and tested it with the web UI to test our setup.
sands.ipynb
- An initial jupyter notebook testing utilizing colors for determining the outline of the sandbox. Experimented with photos taken directly from the hung webcam. Time of day and general lighting conditions affected this approach heavily.
sands-improved.ipynb
- A jupyter notebook utilized for additional experimentation. The color based approach was dropped here in favor of using a projected CharucoBoard layout and mapping back to those values.
About the Code
The code has comments so I won't dig too much into it but will discuss some of the features of the main script that runs.
In the script an initial prompt, whether to use the artist LoRA from Kerin Lovett, the wake word (I suggest something simple), and some configuration values for tuning the silence detection with whisper.
The logic begins by starting the calibration flow displaying the CharucoBoard pattern, taking a webcam frame, and mapping the values. As sand can distort the characters the logic relies on matching as many as it can but can discard some if they have not been found. Even when discarded I found it to be fairly accurate in terms of the warp.
Once the script has calibrated it begins a thread that continually listens for the audio with whisper. When the wake word is heard it forwards on the prompt to a queue that can be used for several commands: art
to enable the LoRA from Kerin Lovett, normal
to disable it, render
to render again the scene, exit
to quit, and if none of the aforementioned commands it sets the new prompt and renders it.
While it's doing this processing it also takes a frame from the webcam every loop examining the contents for hands using the MediaPipe Hand identifier logic from Google. It examines for the frame when the hands leave the scene and at that point triggers a rendering. In this way you can modify the existing sand without it rapidly rendering over your current design.
The render logic turns the projector to a white display, it then takes a frame from the webcam, and warps it using the above calibration data. Using the base64 encoded version of this image it then prepares the JSON payload for stable diffusion web UI's API, submits it, and gets back the generated image. With the generated image it is scaled to three times its size and displayed.
Running the Script
With the script in place and stable diffusion web UI running you can start a screen instance, confirm its display is set to your projector, and then run the script to have it start.
The script will run through a calibration process that projects the CharucoBoard onto the sandbox and then figures out the correct warp to apply. It's worth nothing it's imperative you flatten the sand prior to running this both so the elements can be detected and so it is fairly accurate when the warp is applied.
nohup ./infinite-sands.sh &
screen -S display_session
export DISPLAY=:0
python3 infinite-sands.py
Script Setup
Development Rundown (Bonus Section)
It's probably easiest to just open the sands-improved.ipynb
file and reading the comments to follow along. That said we've included photos of each step in the process.
This warped image can then be used in the stable diffusion web UI through ControlNet as seen below:
Stable Diffusion Web UI API the API for Automatic1111's web UI can be enabled with the --api
flag. Turning on this flag provides a swagger interface for utilizing the API. The API documentation can be found by appending /doc
to a running server with the flag enabled. For example running locally you'd hit: http://127.0.0.1:7860/docs#/
To aid in the process of generating a web payload a helpful extension can be installed in the automatic1111 tooling. This extension (sd-webui-api-payload-display) surfaces an accordion dropdown with prefilled values based on your current session. In our case we wanted to use a width and height one third the size of the projection resolution. We did this to speed up the render time while still generating an output that would work for our sandbox (this is later scaled up in the associated code).
In this way we were able to create a json file representing the payload for the depth based controlnet use. The only need on our part was to provide the prompt and a base64 image to the API.
๐ฅ Easter EggsDepth Based Body Art We attempted to avoid rendering while a person was present in the viewfinder to avoid any issues visualizing the screen. If you bypass this though by quickly reinserting your arms during the render window you can take part in the art itself! The "render" command exists to force a new rendering of the scene. This can be helpful as the logic may assume the hand it sees in the viewfinder is yours after the art renders preventing you from changing the scene in that way (you could change the prompt to get around this as well).
Here are a few samples created in this way:
Viewfinder Correction Initially we took a simpler approach to calibration of the projector and camera utilizing the edge of the drawer painted with red acrylic paint. While this did work somewhat it had little accuracy and was prone to issues around lighting. We abandoned this approach in favor of the CharucoBoard used above which did a great job without any issues around reliability.
Whisper Hallucinations One issue that came up was the need to deal with hallucinations from Whisper that would occur in times of silence. We weren't sure if this had to do with the close proximity to the projector fans but we found others online struggling with it as well. In the end we opted to ignore the words "Thank" and "You" in the prompt text as they were the primary false positives we encountered.
ComfyUI API: We were going to use ComfyUI as it seems well supported, the tooling allows for customized workflows, and it appears to render quicker than the automatic1111 UI. After having put together a workflow we opted to abandon this approach as the API we did find was less mature than the available options on automatic1111's UI and it would have eaten into our project time further.
Card issues: Initially the card we received for the project had issues when utilizing the card's driver. We attempted several suggestions from the AMD team but were unable to get the card working. The AMD team was great and helped us resolve the issue with a replacement card in time to still submit. It did delay our participation a bit and eat into the available time.
Depth camera issues: We purchased an Oak-D Lite for use with the project. While the device does show some interesting visualizations we were unable to calibrate it to give a decent depth map for use with the project. In the end we decided to go the route of using depth anything v2 and relying on the detection of hands to trigger updates. If you have the budget for it a nicer depth camera may help further improve this workflow.
Open Source Licenses- Mediapipe: Apache License
- Whisper: MIT License
- Stable Diffusion 1.5: CreativeML Open RAIL-M License
- DepthAnythingV2 Large: CC-BY-NC-4.0
- ControlNet: Apache License
Comments