Published January 17, 2016 © MIT

Listening Santa

Tell Santa what you want for Christmas - Windows 10 IoT Core and Azure based speech recognition automate the family's wish list.

BeginnerFull instructions provided1,882

Things used in this project

Hardware components

Raspberry Pi 2 Model B

Microsoft Lifecam Webcam

Christmas Themed Project Enclosure

Software apps and online services

Microsoft Windows 10 IoT Core

Microsoft Azure

Hand tools and fabrication machines

SparkFun Multimeter

Story

Introduction

My daughter is usually pretty good when it comes to her Christmas wish list - she gives it a lot of thought and keeps it short (and surprisingly consistent over the course of the year). But this year, there were an unusually large number of late additions - so much so that I had trouble keeping up with them all. It was time to use the power of the cloud.

When somebody pushes the button on the side of the house, Santa pops up through the chimney and asks for your name. Once the person says their name the app processes the audio. If you have been nice this year (i.e. on the approved list), Santa addresses you by name and asks what you would like for Christmas. The audio for the person's wish is streamed to Azure for processing; the result is returned to the app and it confirms the person's request. The app tracks all the requests by user. When I announce myself to the app, Santa pops up and recites the Christmas list to me - each requester and his or her wish list item is played back. The app runs on a Raspberry Pi 2 running Windows 10 IoT Core.

Unlike most of the other projects examples I found where the recognition is constrained to a specified list of words or grammar, I needed the project to recognize anything that my family may ask for so I used Microsoft's Project Oxford Speech APIs which leverage Azure's cloud services to process the spoken audio - regardless of whether my daughter asks for a pony, a zebra, or a rocking horse, Project Oxford will recognize her request and return the appropriate text to my app. I was really surprised at the performance given that the audio was sent to the cloud, processed, and returned to the app. (To see something really cool, code up the SpeechRecognizer's Hypothesis Generated event and add a text box to the UI to display the results - as you are speaking, whatever partial snippets are recognized will be displayed in fractions of seconds.)

Using Azure to process the audio stream to recognize almost anything that was requested by the family worked great. It also recognized my wife's name - Michele - without issue. However, it never recognized my name - Prasantha - and was quite inaccurate with my daughter's name - Lotus; not exactly common words. I solved this by creating a second recognition object for people's names and constraining the vocabulary for that object to a specified list of names. That worked perfectly.

I would consider this a beginner project because the wiring is very basic and there are some great samples to copy and paste from. I learned a lot from Krishnaraj Varma's Windows 10 IoT Core Speech Recognition project (and I reused his text to speech class) and from Anurag Vasanwala's Windows 10 IoT Core: Speech Controlled Robot project. Microsoft's Project Oxford documentation was also very helpful and provided step-by-step instructions to create a Windows 10 app along with a great sample app.

This combination of Raspberry Pi and Azure provides a lot of room for experimentation and enhancements - i.e. limiting the number of wishes a person can make, ranking items based on how often they are wished for, etc. These are a few at the top of my list:

Project Oxford provides the capability to do speaker recognition so that I don't have to prompt the person for their name. It requires a larger audio sample - about 20 seconds - but I can make that work in this project.
Alternatively, I may try Project Oxford's Face APIs for identifying the person making the wish via facial recognition. I already have a webcam hooked up to the the project for the microphone so it will be easy to utilize it's camera as well.
Project Oxford also provides API's for speech to text. I am interested to see how that compares to the speech synthesizer used in this project.

If you make some of your own enhancements, please use the comments below to tell me about. I'd love to hear what you make.

What You'll Need

Parts:

Raspberry Pi 2 with standard accessories: 8GB class 10 micro SD card, 5v 2A power supply, case, and network cable or wifi dongle.
compatible microphone - I used a Microsoft LifeCam Cinema
powered speaker with 3.5mm plug
jumper wires
Any kind of Christmas themed display or stuffed animal.

Tools:

multimeter
soldering iron

Instructions

Step 1: Integrate into the Christmas themed project enclosure

Time: 30 minutes

Parts: jumper wires

Tools: Multimeter and soldering iron

I got really lucky; I found the perfect item for this project at the thrift store for $2. I didn't expect any of the electrical to work and figured I have to spend a little time and money fixing it; but for $2, it was worth it. When I got home and plugged it in, I was extremely surprised to find that everything worked. When you push the button on the side of the house, the Santa Claus pops up through the chimney and dances while the lights around the house light up. It also plays some music while Santa is dancing but I disconnected the speaker so that the microphone could pick up the person's speech.

1 / 2

To integrate it with the Raspberry Pi, I spliced a jumper wire into one of the two wires running from the button to the circuit board. One wire carries the voltage to the button and the second wire has zero voltage until the button is pressed to close the circuit. Use your multimeter to determine which is which (set the multimeter to volts DC and touch one probe to the circuit board's ground and the other probe to one wire to button solder point). Connect a jumper from the second wire (I soldered it to the connector) to a GPIO pin in the Raspbery Pi - I choose GPIO 5 (pin 29) for the input pin. Then, run another jumper wire from any of the Raspberry Pi ground pins - I choose pin 25 - to the ground on the circuit board (again, I soldered the jumper to the ground pad).

1 / 3

Most of the time spent on this step was to disassemble the Santa house enough to get to the wiring and to find my way around the circuit board.

Step 2: Set up the Raspberry Pi and Deploy the App

Time: 20 minutes

Parts: Raspberry Pi and accessories; Microsoft LifeCam

Tools: N/A

First, connect the webcam by plugging it into any of the Raspberry Pi's USB ports. I just rested the webcam on one of the house's gables. Then, plug your speaker into the Raspberry Pi's 3.5mm audio port - I just used an USB powered sound bar I had left over from some other project. I powered my speaker from a wall outlet rather than using one the Raspberry Pi's USB ports - I didn't check the power draw of the speaker and maybe it would have been ok plugged into the Raspberry Pi but it wasn't worth the trouble to test it since I had a power strip right there.

Next, insert the SD card with the Windows 10 IoT Core image and power up the Raspberry Pi. You'll need an Internet connection to use Azure to process the audio so either setup your Wi-Fi connection or connect a network cable between the Raspberry Pi and your router (or development PC if you have Internet connection sharing set up). Refer to the Get Started With Windows IoT website if you have any issues. Download the source code from the GitHub repository and open it in Visual Studio.

Don't forget to edit the list of names; any name not in the list will not be recognized. The first name in the list is the person to whom the recorded wish list items will be played back.


// Define constraint list against which audio will be compared to recognize name of person making requestI
Enumerable<string> listOfNames = new List<string>() { "Prasantha", "Michele", "Lotus" };

In Visual Studio, set your Raspberry Pi as the target (select Remote Device and enter your Pi's host name or IP address) and deploy the code.

About The Code

Rather than setting up a timer and checking the status of the GPIO pin connected to the button on each tick, I used the GpioPin.ValueChanged event so that an event is triggered whenever the value of the pin changes. Now, whenever the button is pushed, the ValueChanged event handler will start the speech recognition process. When the speech recognizer is started, the microphone will be turned on and the audio will be captured. The microphone will be turned off when the speech recognition is complete.

For the name recognizer, I constrained the recognition to a defined list of names.

// Initialize name recognizer
nameRecognizer = new SpeechRecognizer();

// Create list constraint
SpeechRecognitionListConstraint listConstraint = new SpeechRecognitionListConstraint(listOfNames);

// Add list constraint and compile
nameRecognizer.Constraints.Add(listConstraint);
SpeechRecognitionCompilationResult nameResult = await nameRecognizer.CompileConstraintsAsync();

For the item recognizer, since I wanted people to ask for anything, I didn't use a list or a grammar constraint. I did however use a topic constraint - a pre-defined grammar for the Azure based recognition.

// Initialize item recognizer
itemRecognizer = new SpeechRecognizer();

// Create topic constraint
SpeechRecognitionTopicConstraint topicConstraint = new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.WebSearch, "Short Form");

// Add topic constraint and compile
itemRecognizer.Constraints.Add(topicConstraint);
SpeechRecognitionCompilationResult itemresult = await itemRecognizer.CompileConstraintsAsync();

There are three Speech Recognition Scenarios that can be specified for the Topic Constraint - Dictation, WebSearch, and FormFilling. I initially tried Dictation but couldn't get it to work - the speech recognizer would complete successfully immediately after it started but there would be no results. I switched to WebSearch and everything worked fine. I am not sure what I was doing wrong - I'll post an update once I figure it out.

There are other properties you can set for the speech recognizer object - i.e. timeouts to specify how much of the introductory silence or babble (the "um's" and the "ah's") to ignore. But, I didn't need to set any of them as the defaults worked just fine.

The only other difficultly I had with this project was finding a microphone that worked with Windows 10 IoT Core. I tried a few different approaches - a Logitech USB microphone (in hindsight, I am not surprised that didn't work because it required a special driver) and a USB sound card with both mic and headphone jacks - with no success. Jeffrey Dai from Microsoft responded to one of my forum posts with the suggestion to use the Microsoft LifeCam Cinema (that's what he had on hand at got it to work). I subsequently read on a StackOverflow post that someone got the Microsoft LifeCam HD 3000 working as well. Krishnaraj and Anurag didn't seem to have the same issues; I am wondering whether something broke in the most recent release (10586). I will submit a request to add a microphone section to the Hardware Compatibility List and see if that will spur some investigation into this issue.