This project ist created as part of the lecture "Applied Artificial Intelligence" at the University of Applied Sciences Esslingen. The target of this project was to provide a solution for an common problem with the concept of artificial intelligence.
ContactUniversity of applied Sciences Esslingen, 2021
The project ideaDue to the corona pandemic a lot of free-time activities were canceled. Hence a lot of students were bored and in search of variety. Most people like listening to music. So we decided to develop a feature which allows to experiment with music.
Goal of this ProjectThe goal of this idea ist to change the genre of common music. We like to implement some framework which can transfer the style and genre of one style-audio-file to another content-audio-file, while preserving the main features of the content-file.
Due to the complexity, not a lot of available information on this topic and the limited time of one semester, we decided to start with some basic research to validate the possibilities in order to reach our goal.
Basic ConceptWe found al lot of common implementations of style-transfer with images. So our approach was to first transfer the audio-files somehow to image-files, transfer the style between the two images and then transform the resulting image back to an audio-file.
The concept of transfering an audio-file into an image-file is very simple. It basically consists of an FFT, which result can be seen as an grayscale-image. An example of this grayscale-image is shown below. The transformation back to an audio file isn't that straight foreward as there isn't a simple mathematical solution for this, but we found some code which does exactly this.
The challenging part consist of converting this grayscale-image into an rgb-image, because the pre-trained VGG16 image-transformation-net, which we decided to use because there is a lot of documentation available, expects an color-image. Since there isn't enougth information in the graycale-image to transfer it into a fully coloured image, there are multiple possibilities. For example we tried to treat every colour the same, so the resulting image still looks like an grayscale-image, but it is infact an coloured-image. An other approach was to map different colours over the whole value-range of the grayscale-image. But all of this methods didn't work as we expected. The output dindn't sound anything like the input.
While research in fact we found some similar approach that claims to change the voice of two podcast speakers. So we also tried this implementation with two music-files, but the result was also bad, and contained mostly noise and garbage.
So we thought it could be neccessary to preprocess the music-files before applying the style-transfer. Thereby the following concept was created.
ImplementationSince there is an implementation for voices, we decided to split the music-files into multiple audio-files, each containing another "voice" of the music, like volcals, guitar, keyboard and drums. Then we apply the style transfer between the matching parts and combine the results back into one audio file. As stated above the results of the style-transfer are very noisy, so before combining the multiple transfered audio files back into one, we also thought to implement some sort of denoising. This whole process is shown in the following image.
We implemented each part of the process-chain in its own python notebook. This was done to break down the complexity and enable a more simple validation of the seperated steps. To share information and data between the different notebooks we relied on the google-drive cloud, which can be easily accessed in google-colab. With google-colab it is also possible to activate hardware-acceleration, which increases the processing-time significantly.
Due to the limited time, we unfortunally weren't able to implement the whole process. But we got most of the steps working.
To get a first impression, we made a demo-notebook, which contains a simpler version of the process, but can be run as it is, and doesn't require the google cloud and multiple notebooks.
Getting startedTo get a first impression of the current state of this project, you can open the "getting_started.ipynb" which is located in the "python" folder of the github-repo in google-colab.
To accelerate the transfer you should enable hardware acceleration. This can be done via the "Runtime" tab and then "Change runtime type". There you have to switch the "Hardware accelerator" option from "None" to "GPU". Then you can run the whole Notebook via the menue or each block step by step.
First the script requires you to upload the stlyle-audio-file via the upload-promt. In the next step you have to upload an content-audio-file. Make sure to only upload wav-files! After that the files get splitted into a vocal and a playback part. If you run the whole notebook as it is, the style-transfer is done only with the playback part. Depending on the chosen audio-file-length, this can take a lot of time. When done you get an promt where you can listen to the resulting-audio-file. If you follow the included guide and repeat half of the notebook as desriped, both parts get transfered.
ConclusionWhile working on this project we have realised that the modification of an image-style-transfer to work with audio isn't that straight foreward as we thought at the beginning. We have encountered many challenges for which we need more time to solve them as we got as part of the lecture.
Unfortunally we could not fully realize our goal, but it could be used as a bases for further work on this topic.
Comments
Please log in or sign up to comment.