Written byAlex Korte
Published onJun 1, 2023

About

I set out to create a solution to a very subjective problem: Generating Album art Based only off of the audio of a Song. Obviously this is very hard to do, and while the model created is by no means perfect, it is interesting to see what it has learned to generate conditionally off music.

Model Architecture

The model is a Latent Diffusion model with the conditional head replaced with the MERT-v1-95M Music2Vec model. This means passing the MERT vector output from the song it was fed into the conditioning pipeline of the diffusion model.

Generations

While obviously music is very subjective, there are some hints that the model has learned to match the general vibe of a song. From a qualitative perspective, generated images tend to match the color palette and mood of the original album art. The model picks up on cues in songs - for example, urban-sounding music produces images with an urban feel, including patterns resembling brick walls and automobiles.

Interesting Notes

We fine tuned on the UNet of Stable Diffusion v1.4 along with using the corresponding autoencoder. Despite the conditional head being completely different, this surprisingly helped us produce less abstract results than training the UNet from scratch.