Piano music has been around for centuries and is as enduring as any form of communication or expression. Different cultures around the world play the piano, and there are beginners in every generation. Like most abilities, becoming "good" takes time and plenty of practice. How can we enable technology to make this task easier?
Building a Music Teacher ChatbotI learned piano when I was younger, and no matter how good your instructor is, advancing from the beginning levels is all about repetition. It can be frustrating, and time consuming. My parents would drive me for lessons once or twice a week, and then I would need to practice at home in-between. Now that had some limitations.
- Convenience factor - busy schedules make it challenging to squeeze into the week, especially with parents that have full-time jobs.
- Financial limitations - it wasn't cheap to get lessons and was a trade off for other things in the family budget.
- Productivity at home - when practicing on my own, I don't think it was nearly as useful as when having an instructor walking me through the steps. Much of this was the engagement level of sitting alone staring at sheet music.
That was a long time ago, so when thinking of the same challenge with the next generation, I think we can improve on this with some of the new voice technologies like Alexa.
Now there are limits to what a chatbot can do, and I would advocate once you get beyond the basics, there's nothing like having a professional coach that can help you along as the music gets more complex.
If you would like to try out the skill, it's in the Alexa Skill Store, and it's called "Music Teacher". Here's a demo with my next generation.
Step 1 - Learning the SSML LanguageVoice chatbots can go beyond the spoken word, with devices such as the Alexa, we can incorporate a digital version of the piano itself. To do this, you need to first learn in depth how to manipulate the outputSpeech
attribute in the skill's response. Normally the type is "PlainText
", but by changing it to "SSML
" we can invoke a broader range of actions. Here's what it should look like.
"outputSpeech": {
"type": "SSML",
"ssml": "<speak>This output speech uses SSML.</speak>"
}
Now if we want to just mirror the "PlainText
" format, it just ends up looking like this.
<speak>Just normal language here</speak>
If we want to get a little more creative, and mix in a "pause" we can do this.
<speak>I'm going silent for five seconds<break time="5s"/>now I'm back</speak>
The limitation is that we can pause for up to ten seconds, but begins to show how we can manipulate the output for gaps where the user might need to do something. To integrate audio into the spoken text, it's a similar markup - syntax below.
<speak>which note is this?<audio src="https://s3.amazonaws.com/musicmakerskill/d3.mp3" />it's a D!</speak>
Amazon has a great write-up on their use of SSML here.
So when building this chatbot, we are going to create a script that mixes together both the spoken word with MP3 files that will be recorded from a piano, providing pauses for when the user should be playing notes on the piano.
There are some other limitations/restrictions within Alexa that should be understood. First, the overall length of an output response can't exceed 8000 characters. That won't be a limitation for this skill as that length would be in the range of 7-8 minutes of the spoken word, and we don't have nearly that amount of information to provide. The limitation that is constraining for this skill is that you can only include five audio references (more on how that plays out in step 3).
Step 2 - Creating Media from a PianoThe chatbot is going to need content to pull from a source available via the public internet (via HTTPS), so we need to build a library of musical notes and songs. On the Alexa platform, it can play back MP3 files that are marked appropriately in SSML, and there's a benefit to hosting within AWS S3 as it's a trusted SSL source for the HTTPS calls (thus no need for special certs). There are also a few other pieces that need to be considered when leveraging Alexa.
- The audio file cannot be longer than ninety (90) seconds.
- The bit rate must be 48 kbps.
- The sample rate must be 16000 Hz.
So when authoring the MP3 content, we need to be very careful to use the right encoding settings, and the software that I used is an open source product called Audacity. Here's a basic view of how this works.
When recording a song on the piano, we will break it into four to five segments. That will enable us to mix together the voice and piano note playing that we will get into with the next section. It also is in the range of ten seconds, so far below the limitation of 90 imposed by Alexa.
The architecture view above also calls out some graphics that will be displayed as cards on the Alexa app to assist the audio experience. Here's an example of the card that comes up for the note drill section. These aren't required, but do augment the experience in case the student wants to view.
These can be done through any graphics software and will also be hosted in S3. To include an image, switch to the "Standard
" card output type, and include the direct path for the images. The pixel dimensions for the small card needs to be 720x480, and the large 1200x800.
card: {
type: "Standard",
title: title,
text: cardInfo,
image: {
smallImageUrl: smallImagePath,
largeImageUrl: largeImagePath
}
}
Please note: for Alexa to be able to access either the MP3 or PNG/JPG files, you'll also need to make them public content in S3, where the URL is available for unauthenticated users to download to their device.
Step 3 - Designing the Voice InterfaceWe are going to build a curriculum for a beginner, and the assumption is that there is a set of fundamentals that can be taught. Here are the goals of what is considered basic.
- Understanding Notes (i.e. D, G, A, F Sharp, etc.).
- Learning the Musical Scale.
- Following along to play parts of a song when given the notes (ability to play 8-10 notes in a sequence).
- Memorizing an entire basic song in the C Major scale (i.e. all "white keys" - no flats or sharps - the "black keys").
- Adding complexity with intermediate level songs by pulling in other scales (adding the "black keys").
Here is the outline of what the initial skill looks like.
Learning the songs is a much longer session, and needs to account for the limitation that we can only include five audio clips into a single output response back to Alexa. Here is an outline of how the output gets built.
Each loop plays a MP3 file in the third step, so it can only be done a maximum of five times. So if to play a song with 40 notes, each section will need to group eight notes. Given memory retention of the notes by the student, it will restrict what songs can be chosen. For example, "Twinkle Twinkle Little Star" has 35 notes, so the song gets broken into five pieces, requiring the student to memorize seven notes for each section. The song "Joy to the World" has 57 notes, so the memorization ends up being a little more, thus driving up the complexity of comprehension for the student.
Once the student can learn these different sections, there is a "duet mode" where the Alexa skill plays just the music, with the expectation that the student can play the notes at the same time. This assumes a certain mastery of the song, and helps tune the note recognition, as well as tempo of keeping up with the notes.
Now that we have the design, we can get to coding!
Step 4 - Building the Alexa SkillThese instructions assume basic familiarity with how to build an Alexa Skill, and if you're not familiar with this part, please look here.
Some of the key logic is around how to build out the song that is played. Here is the section of the skill that contains the code the builds the song based on the content.
var cardTitle = "Teach me " + songTitle;
var audioOutput = "<speak>";
audioOutput = audioOutput + "Okay, let's get started on learning how to play " + songTitle + ". ";
// construct the response based on what is in the song array
for (i = 0; i < notes.length; i++) {
audioOutput = audioOutput + extraComments[i] + notes[i] + ".";
audioOutput = audioOutput + "<audio src=\"https://" + musicLocation + soundClips[i] + "\" />";
audioOutput = audioOutput + "<break time=\"1s\"/>Your turn, and remember it goes " + notes[i] + ".";
audioOutput = audioOutput + "<break time=\"" + userPause[i] + "\"/>";
}
audioOutput = audioOutput + "Well done - take a bow! You have played " + songTitle + ". ";
audioOutput = audioOutput + "When you are ready for another song, just ask for it. "
audioOutput = audioOutput + "</speak>";
var cardOutput = "How to Play " + songTitle;
var repromptText = "Ready for another song? Just ask for it and we will begin another lesson.";
callback(sessionAttributes,
buildAudioResponse(cardTitle, audioOutput, cardOutput, repromptText, shouldEndSession));
Once the assembly is written, then we just need to model the individual songs. It's a JSON object, and here are the variables that are passed in for "Twinkle Twinkle Little Star".
{
"songID":0,
"notes":[
"C, C, G, G, A, A, G",
"F, F, E, E, D, D, C",
"G, G, F, F, E, E, D",
"C, C, G, G, A, A, G",
"F, F, E, E, D, D, C"
],
"soundClips":[
"twinkle_pt1.mp3",
"twinkle_pt2.mp3",
"twinkle_pt3.mp3",
"twinkle_pt1.mp3",
"twinkle_pt2.mp3"
],
"userPause":["8s","8s","10s","8s","8s"],
"extraComments":[
"The first part starts with the following series of notes ",
"Now lets move on to the next part. The notes are ",
"Now lets move on to the next part. For this section, you will need to repeat it twice. The notes are ",
"You're doing great! For the next two sections we are just replaying the opening. The notes are ",
"Almost done, lets finish by playing "
]
}
This allows creating of other content, expanding the capabilities of the skill by just decomposing the music, and making more MP3 recordings.
Step 5 - Practice, Practice, PracticeThere's no substitute for allocating time to practice, but going back to the problem statement - here's an assessment on how valuable this would be.
- Convenience factor - can do this anytime at home. No need to drive to a piano teacher for lessons. Removes an excuse as it's available 24x7.
- Financial limitations - the skill is free, and an Alexa costs less than a one month of lessons. It probably takes a few months to get to a beginner level, so this is a good way to get started.
- Productivity at home - a voice-activated assistant is more engaging than a piece of sheet music, and while there are some good YouTube videos to watch, they're not as interactive as a chatbot.
Go ahead and give it a try - it's called "Music Teacher".
Comments
Please log in or sign up to comment.