Motivation
As a indie video game creator and composer, I often write tunes which work fine on their own, but the transition going from one area to the next is abrupt and kills the mood. A classic showcase of the problem is the game Chrono Trigger: Each track is beautiful its own, but the transitions are rather clumsy, which is really a shame. Therefore, I will use deep learning on an AMD graphics chip to create transitions between two tracks.
MethodUsing torch with ROCm on the Radeon Pro W7900, we train a transformer to predict midi-like note tokens. These are snippets from preprocessed midi tracks with themes from popular media, which are scraped from popular midi websites such as midiworld.com. The database is around 50MB uncompressed. To achieve a prediction of a transition instead of the typical causal setting, we mask out a random number of notes after the token that is to be predicted. Masking out a region means setting regions of the first attention head to -float("inf") before the softmax, so that the transformer cannot make use of the information from those tokens.
To achieve this, the project uses Kevin-Yang's (@jason9693) midi tokenizer, and the music tranformer architecture as implemented by Damon Gwinn, Ben Myrick and Ryan Marshall.
ResultsThe loss of this version is, as expected, better than the original because the model has access to more data. In practice though, it appears to have learned mostly to repeat previous sequences of notes and is not yet in a state to produce good transitions, and is thus sadly still work-in-progress. Nonetheless I am personally happy with my learning journey and found that the model occasionally produces non-trivial note sequences that are pleasant to listen to. On the AMD Radeon W7900 with ROCm drivers, training 100 epochs took around 10 hours, which is less than expected for a transformer from scratch. Originally, I thought I had to use a GAN because they are easier to train at smaller scales.
Comments