For this Ryzen AI Contest, I had a target to create an application which could generate transcriptions and translate those transcriptions in real-time.
The reason I wanted to create this application as it could leverage the NPU's local compute capabilties to achieve some large advantages namely,
- Its low latency relative to cloud-based solutions. which is important for subtitles which are ideal when they are as close to real time as possible.
- The privacy of keeping all audio on-device and not needing to send audio over to a external service
- The capability to be run entirely offline as the model is stored locally
- pywebview is the renderer and window manager.
- FastAPI (and uvicorn) is the backend and has a host of APIs
- The Next.JS front-end is rendered inside pywebview and is hosted using the Home API.
- The Next.JS front-end can interact with pywebview through a Python-JS API which gives it finer window management capabilities.
- The frontend can interact with the Settings API to adjust settings.
- The frontend can connect to the Transcription API to get a live transcription feed through a websocket until it disconnects.
1. Initial setup and development environment.
The initial setup was relatively painless as I followed the instructions at https://ryzenai.docs.amd.com/en/latest/inst.html.
The only minor hiccup was that the AMD IPU device would not show up in Device Manager. I solved this by enabling the NPU inside the UEFI BIOS, thanks to Google and a Reddit post which I have not been able to find since.
I was rather fortunate as I did not have any issues with Visual Studio 2019 as I had access to a license of Visual Studio 2019 enterprise due to the Azure Student Benefits pack.
After that I treated the Mini PC similar to a server and developed using a remote VSCode SSH session.
2. InitialTrials
From there, I tried to develop a proof of concept to get a Whisper model to achieve real-time transcription. (Without NPU)
I first started with an existing code-base https://github.com/davabase/whisper_real_time and tried to get it to transcribe while using a loop-back feed. (looping a speaker output to a microphone input)
This was eventually achieved by patching all library imports from pyaudio to pyaudiowpatch, a pyaudio fork with WASAPI support.This initially caused a rather cryptic error whenever loop-back audio channels were used.
assert source.stream is not None, "Audio source must be entered before adjusting, see documentation for AudioSource; are you using source outside of a with statement?"
Fortunately, after extensive debugging to track the bug to the pyaudiowpatch library, I discovered someone else had had this exact same issue and there was a working solution given, which was to convert stereo speaker audio into mono (Whisper only supports mono audio currently)
3. Porting to Ryzen AI
Once I got the proof of concept working, I then started work on trying to make the model run on the NPU. I figured that with such a popular model such as Whisper, there is likely a person or organization smarter than me who has already quantized the model into a ONNX format.
I quickly found that my assumption was correct and I narrowed down my options to two models, one developed by Intel (Link) and one developed by AMD (Link).Although developed by a competitor, I initially tried to get the Intel version working first, as the AMD Whisper-Base model was locked behind an early access wall and the best AMD model I had access to was limited to Whisper-Tiny while the Intel model had all sizes from tiny to large.
I eventually switched to the AMD Whisper-Tiny model as I could not get the Intel model working in a manageable time-frame.
The AMD model was rather simple to get working, it worked first time on the latest RyzenAI software (v1.1.0) despite it being rather difficult to find, as the only link and mention I could find was in an early version release note. (v0.8.0).From there it was relatively simple to replace the whisper.transcribe() calls in the proof of concept to point to the AMD reference code.
4.a. FinalApp: Backend API
From the working NPU proof of concept, I started developing and fleshing out the back-end API in Python utilising FastAPI, which went quite smoothly..
4.b. FinalApp:Frontend
From there I turned my attention to the rather lacking UI.The UI revamp was created using NEXT.js, specifically using the app router as I intended to static export the application into HTML and JS and serve it using the backend API.
I also utilised ShadCN as a component library, which was chosen as it would allow me to modify the components if it were ever necessary.
4.c. FinalApp: Integration
To integrate the backend and frontend together, I utilised pywebview as a Python wrapper for Window's Edge Webview2.I chose to use this approach instead of using a browser window as this approach still let me utilise up-to-date browser technologies using Edge's Webview2 and I had more window control compared to a native browser window.
To that end, I used the pywebview to create a python-JS bridge to
- Let the settings show up in another window so the user can adjust some settings on the fly and to be able to keep the main window small and compact during use.
- Let the application be able to pin itself so the transcription will always appear on top of other applications
- To allow the main window to close the entire application from a single button.(Otherwise, the settings window stays despite the rest of the application closed)
In the end, the final application is run through a python script which
1. Spawns the backend server in another thread
2. Instantiates the Python-JS API bridge.
3. Starts the frontend in a Webview window.
4. The frontend then connects to a websocket when transcription is needed
5. The backend receives the request, loads the model and then starts sending transcription messages back to the frontend.
6. When the user stops the transcription, the frontend sends a websocket close, which is shuts down the transcription in the backend.
7. Eventually when the user closes the application, the script also closes the backend server thread.
Future workOverall the application is quite useful as is, but there are some bugs should be addressed and there is also some future work that could be done as I was unable to complete achieve everything I proposed.
Known Bugs:
- Application crashes if the user sets the settings too fast (Caused by user being faster than the Python-JS api can load)
- Application is extremely slow for a short period after transcription is stopped (Caused by queued transcriptions).
- If the application crashes, there is a high chance the whisper install will break with the following error and the model will need to be rebuilt.
InternalError: Check failed: (it != type_key2index_.end()) is false: Cannot find type tir.Load. Did you forget to register the node by TVM_REGISTER_NODE_TYPE ?
Future Work:
- Upgrade the Whisper model from Tiny: I got access to the early access Whisper-base model too late and was unable to get it working in time.
- A major omission in the project writeup is the translation feature, I was unable to get this feature to work in time, this should be possible if the model is upgraded (Above) as the current model I am using has many limitations, one of them being the only language supported being english.
- Get the Whisper model to fully run on the NPU, although the model is run on the NPU, the decode is still ran on CPU inference, I had not been able to get both the encode and decode to run simultaneously on the NPU as it would instantly crash. (Note, the AMD Whisper reference code appears to also have this same limitation)
- Improve the loopback speaker mono conversion, the mono conversion currently used to solve Whisper's mono limitation is evidently not good enough as the accuracy is slightly degraded on the same audio file being played on a loopback speaker compared to a raw microphone input.
- Package the application into a single executable. The application would be much more convenient if it is packaged into a.exe file instead of being a python file run from a command line.
This project would not have been possible without many contributions and support, I would like to express my thanks to AMD, Hackster for organising this contest, and for the various open source projects on which I built the application on top of (Whisper-real-time, FastAPI, Next.JS, pywebview, pyaudiowpatch and many more).
Comments