The Pianist is a skill to assist musicians with their everyday tasks. It can give you a pitch when need to tune your instrument. For singers, it can provide a vocal warmup that goes as low as C3 (130.81 Hz) and as high as G6 (1568.0 Hz).
Lean ApproachWe took a lean approach to creating this skill -- using the build, measure, learn, repeat cycle. We delivered the skill in three iterations. The first release of the skill could only play an A and do a simple ascending warmup. An A is sufficient to tune most instruments, and the simple warmup was comprehensive enough for many singers.
To measure the results of each release, we got a quantitative measurement of usage, using AWS CloudWatch metrics for the λ. We also got qualitative feedback on the skill through the reviews in the Alexa app.
After the initial release, we observed that people were willing to use the skill and that it was worthy of further development. Subsequent iterations introduced the ability to play the other 11 notes in a 12-tone chromatic scale and the ability to continue warming up by going higher, lower, back up, back down, and repeating.
Record Sound ClipThe first step is to record a sound clip using the piano and sound recorder. The total playback time of the audio, plus any synthesized speech, cannot exceed 90 seconds. It was for this reason that the warmup had to be split into multiple chunks. This technical limitation ended up making the skill more versatile -- allowing singers to customize their warmup routine.
Format Sound ClipThe audio file needs to be in MPEG version 2 (mp3) format. It may not be obvious, but this implies the sample rate must be either 22050 Hz, 24000 Hz, or 16000 Hz. In addition, the bit rate must be 48 kbps. More details are available in the Alexa Skills Kit documentation. I used the following FFmpeg command to convert an audio clip:
ffmpeg -y -i c4_-_c5.m4a -ar 16000 -ab 48k -codec:a libmp3lame -ac 1 c4_-_c5.mp3
The audio file needs to be hosted on a publicly-accessible host via HTTPS using an Amazon-trusted SSL certificate. Again, more details are available in the documentation.
ChallengesWe encountered several unexpected challenges while developing the skill.
PronunciationFirst, while handling the boundary cases for the warmups, we noticed that the Echo had trouble pronouncing foreign names like Popoli di Tessaglia and Die Entfürung aus dem Serail. Fortunately, the Alexa Skills Kit supports a sub set of the International Phonetic Alphabet (IPA). Although it does not support the full set of phones necessary to correctly pronounce the original Italian and German, we managed to synthesize an acceptable American pronunciation of the names using:
po.po.li di tɛ.ˈsɑljə
ɛnt.ˈfuɹʊŋ aus dem sɛˈɹaɪUnexpected Slot Values
Finally, but observing invocation errors in the CloudWatch metrics and digging into the CloudWatch logs, we were surprised to find that sometimes the slot values provided to the λ do not exactly match any of the custom slot values we defined. For example, we have a custom slot type that has 21 possible values that represent the 12 notes along with their alternative names (e.g. we have both "D Sharp" and "E Flat"). However, sometimes the slot value included punctuation (e.g. "f."). Sometimes it had invalid note names that did not correspond to any of the defined slot values (e.g. "scale"). Finally, sometimes it had extraneous articles (e.g. "a c" and "a c sharp"). To address these issues, we added unit tests to our suite that replicated these problems and then modified the code accordingly.
Cold StartsOne thing we noticed while testing the skill was that sometimes the Echo would take a long time to respond. Subsequent invocations would be much faster. This was corroborated by the CloudWatch metrics. Although most of the λ invocations completed in under 1s, there were many outliers that took on the order of 5.5s and one that took up to 7s. This is an unacceptable user experience.
Since we had decided to build the λ in Java, we thought that the culprit might be a combination of copying the archive, decompressing it, and loading classes into memory. To address class loading, we removed unnecessary uses of library classes such as the string utilities and validators from Apache commons-lang. To improve the time to handle the archive, we used ProGuard to shrink the size of the jar by removing unnecessary classes. It took many iterations to get the right classes excluded without breaking essential functionality. In the end, this is the configuration we used:
-dontobfuscate -dontoptimize -dontwarn ch.qos.logback.**,org.joda.** -keep class com.macasaet.** { *; } -keep class com.amazon.** { *; } -keep class com.amazonaws.** { *; } -keepclassmembers enum * { *; } -keepattributes InnerClasses,EnclosingMethod,Signature,*Annotation*
We managed to shrink the archive from 3.1mb to 1.8mb... This had absolutely no effect on the cold start problem. When testing the λ in the AWS console after uploading it, we still observed 5+ second invocations.
Finally, we increased the amount of memory available to the λ, although the maximum memory it ever uses is 35mb. We raised the available memory from 128mb to 512mb. By increasing the memory, we increase the share of the underlying hardware's compute resources allocated to the λ.
This had a tangible effect. In the AWS Console, the initial test invocation of the λ dropped to 2.5 seconds. In addition, the impact was obvious from the CloudWatch metrics as pictured below. Prior to increasing the RAM, there were outlier invocation times from 5.5 seconds to over 7 seconds. Similarly, the 6 hour moving average duration peaked at 2.6 seconds. After increasing the RAM, the outlier invocation times dropped to at most 1.7 seconds. The 6 hour moving average duration dropped to a peak of 674 milliseconds. This makes for a much more seamless experience for the musician.
We hope you enjoy The Pianist! Let us know what you think.
Comments