You Don't Have to Say a Word
These sonar-powered smart glasses use machine learning to recognize silent speech and control devices.
Because of their ease of use and applicability to a wide range of use cases, silent speech interfaces have drawn an increasing amount of attention lately. A silent speech interface is a technology that enables individuals to communicate with electronic devices or computers without speaking aloud. A number of different approaches exist, but the goal is the same — to recognize movements of the face and mouth, and translate those movements into the words that were silently spoken.
One of the primary advantages of silent speech interfaces is that they allow for hands-free operation of electronic devices. This is particularly beneficial in situations where physical movement may be limited or inconvenient, such as when performing medical procedures or when driving a car.
Controlling devices without speaking aloud, can also be helpful in situations where noise levels or privacy are a concern. A traditional voice-controlled device cannot reliably interpret instructions where there is a lot of background noise, like in a crowded public space. Moreover, providing confidential personal information or passwords audibly is a significant security concern.
One of the most prevalent means of capturing information about silent speech involves the use of a camera pointed at the face. Many such methods have proven to be very highly effective, however, having a camera pointed at one’s face all day is impractical and very undesirable from a privacy perspective. Moreover, images contain a lot of information, and processing them requires a lot of computing power and energy. This is a bad combination for a wearable or mobile device, which means processing generally needs to happen in the cloud, which only serves to exacerbate privacy concerns.
An interesting idea was recently presented by a group at Cornell University that takes a different approach in translating silent speech. Instead of cameras, this team used tiny microphones and speakers to bounce inaudible sonar waves off of the face. They showed that the reflected echoes can accurately be decoded using a machine learning algorithm. And because sonar systems produce much less data than an image-based system, they were able to build the entire interface into a fairly standard looking pair of eyeglasses.
The EchoSpeech smart glasses are equipped with microphones and speakers, smaller than the size of a pencil eraser, pointed downwards towards the wearer’s face. As the sound waves interact with the face, they are altered in distinct ways. Those altered sounds then reflect back to the microphones where they are recorded by the smart glasses.
A deep convolutional neural network was selected, after experimenting with several types of network architectures, to translate the audio data into speech. The model was designed to classify 31 distinct commands in this study, so it is not able to translate any arbitrary speech, at least at this time. However, for device control, a limited vocabulary should be sufficient for most purposes. After training, the model was shown to be about 95% accurate on average in its translations.
Before a new user can get started with EchoSpeech, a few minutes of data must be collected to retrain the model and fine-tune it for that individual. After that, the glasses are all set to wirelessly transfer the sonar measurements to a nearby smartphone for processing — the glasses themselves do not have sufficient onboard compute capacity to run the model. However, processing does take place locally on the smartphone, so no private information needs to be sent to the cloud.
A small study was conducted with twelve participants to determine how well EchoSpeech works under real-world conditions. It was found that the word error rate in recognizing silent speech was generally low, and it did not lose accuracy when the wearer was walking about or in a noise environment.
The team is interested in commercializing their technology, but first they have a few things that they would like to improve. Some users reported that the glasses would not stay in a stable position on their faces due to their somewhat larger than normal size. Shrinking the size down a bit should alleviate this concern. They are also planning to work on a gesture-based activation system that will allow the battery to be conserved while the system is not in use.