The CH32, Just Another 10¢ Microcontroller?
The latest project to appear for the 10¢ CH32 RISC-V microcontroller is a tiny wake word engine with impressive accuracy levels.
There has been a lot of discussion in the last year or so around cheap microcontrollers, with people mostly talking about the Puya PY32 series of parts. But just recently we've seen some revival in interest in an entirely different 10¢ microcontroller, the WCH Electronics CH32V003, and now you can use it for speech recognition.
The big difference between the PY32 and the CH32? While both are ultra-cheap 32-bit microcontrollers, the PY32 series has Arm Cortex-M0+ cores, while the CH32 series has RISCV cores.
After Jay Carlson's blog post at the start of last year, interest in the PY32 series picked up, and a lot of people took a serious look at the Puya ecosystem.
While there is a number of other really cheap microcontrollers on sale, most of these parts are EPROM-based. The PY32 was a cheap flash-based microcontroller that came in a range of package choices. But crucially it also came with English language documentation, and was well supported by the standard Arm MCU toolchains. That made it a lot easier to build a Makefile and GCC-based build system around the PY32 than the CH32.
However, that doesn't mean that support around the CH32 is lacking. There is an excellent open source development environment for the CH32 put together by Charles Lohr with solid documentation and, in January this year, Arduino support for the CH32 series was added with the release of an Arduino core for the chip by WCH.
Based on Lohr's core, the latest project for the CH32 is by Brian Smith, a speech-to-text engine which shows that just because it's tiny doesn't mean that a microcontroller can't do big things. Using a MAX4466 electret microphone amplifier board connected to a CH32V003-based development board, Smith uses a WCH-LinkE adaptor for both programming to act as a UART-to-USB converter to read the engine's output.
Designed to determine the difference between two spoken words — and trained to distinguish between the spoken digits 'zero' to 'nine' — the engine makes use of MFCC feature extraction the code compares buffered tensors of audio samples against pre-recorded spoken words to get a best match, and has "about 90% accuracy identifying spoken digits with the code as it stands" according to Smith. Considering the CH32V003 has only 16KB of storage and 2KB of RAM available managing all this inside the constraints is an impressive achievement. It's not the first time we've seen machine learning done on a tiny microcontroller, but it may well be the cheapest we've seen it done.
A timer is set up to generate an interrupt around 50,000/sec. On receiving an interrupt, the ADC is read and the next sample convertion started. 8 consecutive samples are averaged to generate a ~6400 samples/sec audio stream. Every 64 samples (10ms), a 128-wide FFT of a buffer of the last 128 samples is performed and 20 mel-scale frequency bins are calculated from that. The mel bin energies are converted to log2-scale. Finally an 8-bin cepstrum is calculated via a DCT of the 20 log-mel bins. When the 'energy' of a frame (sum of all mel levels) is above a threshold, it is added to a 'word' buffer, otherwise a count of 'silence' frames is increased. When enough 'silence' frames have passed to signify the end of a spoken sample, its length is warped to exactly 16 frames and compared to a lookup table of previously stored word samples, and the closest match is reported.
The project's GitHub repository includes code to allow (re-)training the engine and as Smith argues could become the basis for a low-cost "always-on" wake word engine.