Voice technology has many applications in the modern world. Consumers have come to expect voice interactions in smartphones, home devices, and services. In addition, a project that uses AI to learn the voices of popular singers to recreate their songs has recently attracted attention. Based on these trends, we have started a new project using WIZnet IoT speaker with AI TTS technology and singer voice recreation technology.
* The following is the original description, not mine. I used it because it is more descriptive.*
Organize the requirements to achieve the goal and define and schedule detailed tasks based on them.
- Utilize the high-speed characteristics of the W5300 chip.
- Be able to deliver audio data in real time.
- It should be able to output data from the network to the speaker.
- It should be possible to check necessary information such as IP using an information display device such as an LCD.
- Build own board and share the data.
- Audio data can be sent from a PC or to the same board.
- The development environment must be multiplatform. (Windows/Mac/Linux)
- W5300 Control reads datasheets and uses registers to build libraries by hand.
- Organize the requirements to achieve the goal and define and schedule detailed tasks based on them.Utilize the high-speed characteristics of the W5300 chip.Be able to deliver audio data in real time.It should be able to output data from the network to the speaker.It should be possible to check necessary information such as IP using an information display device such as an LCD.Build own board and share the data.Audio data can be sent from a PC or to the same board.The development environment must be multiplatform. (Windows/Mac/Linux)W5300 Control reads datasheets and uses registers to build libraries by hand.
As you progress through the project, feel free to draw out and reflect on things to consider.
- I make my own boards and enjoy the process.
- Create additional I/O on the board beyond the basic functions so that it can be utilized as a development board.
- How do you want to send audio data from the PC?
- If you're creating a PC program, should it be a GUI?
- Do I need a microphone sensor, and if so, should I mount it on the board?
- Can we send video data as well?
- Could you send motion data to control a remote robot?
- As you progress through the project, feel free to draw out and reflect on things to consider.I make my own boards and enjoy the process.Create additional I/O on the board beyond the basic functions so that it can be utilized as a development board.How do you want to send audio data from the PC?If you're creating a PC program, should it be a GUI?Do I need a microphone sensor, and if so, should I mount it on the board?Can we send video data as well?Could you send motion data to control a remote robot?
- Schematic and Artwork Online with EasyEDA
- SMT directly through JLCPCB
Build Environments
- CMake, GCC, Make
- Build EnvironmentsCMake, GCC, Make
Editing Tools
- Visual Studio Code
- Editing ToolsVisual Studio Code
Libraries
STM32CubeH7
- MCU library provided by ST
- STM32CubeH7MCU library provided by ST
LittleFS
- File system for embedded with failsafe features
- LittleFSFile system for embedded with failsafe features
LVGL
- Free GUI library for embedded use
- LVGLFree GUI library for embedded use
WIZnet ioLibrary_Driver
- dhcp, sntp, socket
- WIZnet ioLibrary_Driverdhcp, sntp, socket
- LibrariesSTM32CubeH7MCU library provided by STLittleFSFile system for embedded with failsafe featuresLVGLFree GUI library for embedded useWIZnet ioLibrary_Driverdhcp, sntp, socket
MCU code generator
- STM32CubeMX
- MCU code generatorSTM32CubeMX
PySide6
- A Python version of Qt6
- PySide6A Python version of Qt6
MinGW GCC
- For creating PC programs to update firmware
- MinGW GCCFor creating PC programs to update firmware
As a first place project, I have organized YouTube videos and related materials as blog posts during the implementation process, and I have organized H/W, S/W, and GUI code on github. I would like to introduce how our chip was used in the above project, and what principle and code were implemented first.
ToE Contest 1st project by chobaram
AI Solution- mp3 file original
- CUDA GPU or Colab Pro
- pretrained artist voice
- ChatGPT
- Bandlab
- GaudiStudio
Reference RVC v2 Colab descriptions
MakingThe first step is to download an MP3 file of the music you want and split the instrumental and vocals in GAUDIO STUDIO.
If you go to the RVC v2 docs and read through them, there are a lot of different values for hyperparameters and a lot of different source code. In fact, the author and others have done a lot of parameter tuning experiments, so it's best to use the default values.
Find a link where a trained model of the singer you want has been saved and downloaded, either from a place with a lot of AI models like Hugging Face or through a community.
You can do this by running the colab code in the RVC v2 docs straight through.
Code
# Model load
url = 'https://huggingface.co/leeloli/rosiesoft/resolve/main/rosiesoft.zip' #@param {type:"string"}
model_zip = urlparse(url).path.split('/')[-2] + '.zip'
model_zip_path = '/content/zips/' + model_zip
Find the trained model of the singer you want and put the link to the model's save location in the URL.
import os
import time
import fileinput
from subprocess import getoutput
import sys
from IPython.utils import capture
from IPython.display import display, HTML, clear_output
%cd /content/Retrieval-based-Voice-Conversion-{reeee}UI/
#@markdown Keep this option enabled to use the simplified, easy interface.
#@markdown <br>Otherwise, it will use the advanced one that you see in the YouTube guide.
ezmode = True #@param{type:"boolean"}
#@markdown You can try using cloudflare as a tunnel instead of gradio.live if you get Connection Errors.
tunnel = "gradio" #@param ["cloudflared", "gradio"]
if ezmode:
if tunnel == "cloudflared":
for line in fileinput.FileInput(f'Easier{weeee}.py', inplace=True):
if line.strip() == 'app.queue(concurrency_count=511, max_size=1022).launch(share=True, quiet=True)':
# replace the line with the edited version
line = f' app.queue(concurrency_count=511, max_size=1022).launch(quiet=True)\n'
sys.stdout.write(line)
!pkill cloudflared
time.sleep(4)
!nohup cloudflared tunnel --url http://localhost:7860 > /content/srv.txt 2>&1 &
time.sleep(4)
!grep -o 'https[^[:space:]]*\.trycloudflare.com' /content/srv.txt >/content/srvr.txt
time.sleep(2)
srv=getoutput('cat /content/srvr.txt')
display(HTML('<h1>Your <span style="color:orange;">Cloudflare URL</span> is printed below! Click the link once you see "Running on local URL".</span></h1><br><h2><a href="' + srv + '" target="_blank">' + srv + '</a></h2>'))
!rm /content/srv.txt /content/srvr.txt
elif tunnel == "gradio":
for line in fileinput.FileInput(f'Easier{weeee}.py', inplace=True):
if line.strip() == 'app.queue(concurrency_count=511, max_size=1022).launch(quiet=True)':
# replace the line with the edited version
line = f' app.queue(concurrency_count=511, max_size=1022).launch(share=True, quiet=True)\n'
sys.stdout.write(line)
!python3 Easier{weeee}.py --colab --pycmd python3
else:
!python3 infer-web.py --colab --pycmd python3
This is the code for a task that utilizes the Gradio UI window to convert the model to the desired target data.
Despite the explanation above, we leave the parameters at their default settings. Of course, changing the parameters can improve performance, but it will take longer and may be more resource intensive. The current values are based on experiments by the original authors and what has been shared in the community.
1. in Choose model, press the refresh button several times if the model with the voice of the singer you want is not loaded well.
2. You can think of it as target data, which is the vocal file you converted to gaudilab earlier. This is how I want to train the data in 1 as the target data in 2.
3. optional is an octave-related parameter, which changes female -> male and male -> female +-12 using the parameter. I tried it with different octaves, but it's probably better not to use it unless you have a song that needs to be changed specifically. After converting, the synthesis proceeds. The result is a vocal file.
- After the speech synthesis is completed, you can download the file and get the file. The parameters below are TTS-related parameters, but after experimenting with them, the best performance is to set them to the default values and convert them.
- After that, various instrumental elements, such as drums, which are synthesized from the previously extracted sound source and the TTS result, must be combined again in a music-related work program. There are various programs such as Ableton, Cue Bass, Logic, etc., but this is easily available to those who have used it for some time, and if you are an Apple user, Garageband is said to be very good in cost performance. And you can combine them through BandLab, which is available on the web
Result
Connect the IOT SPEAKER made above to the Ethernet communication and put the voice converted by AI into the GUI program.
1. IOT Speaker first prepares for communication by setting IP.2. When it is completed, press the Play button on the GUI program and the music will come out.3. You can use the speaker according to the desired output format with I2S and SAI. The red button controls the volume of SAI, and the blue button controls I2S.
Comments
Please log in or sign up to comment.