Multi-Style music generation model design based on variational autoencoders and generative adversarial networks
Abstract
Deep learning progress has made the synthesis of different musical styles an important front in AI research. But existing methods still pose significant challenges when it comes to creating proper multi-level structures, controlling styles precisely, and maintaining stable training. To solve such problems, we propose a novel model, a variational autoencoderand generative adversarial network (VAE-GAN), for high-quality, controllable multi-style music generation. It has a multi-scale temporal feature fusion transformer to capture local and global musical structures, a variational autoencoder-based style decoupling module to separate and handle content and style representations, and an adaptive adversarial training with multiple discriminators to improve generation quality and ease training. Experiments on the MAESTRO and Lakh MIDI datasets show that our proposed model outperforms state-of-the-art baselines. On the MAESTRO dataset, it gets a 22.0% drop in MSE and a 26.9% drop in FAD relative to the top existing technique. Similarly, it reduces the MSE of Lakh MIDI by 21.7% and the FAD by 26.9%. The results show that the model is capable of generating expressive, structurally sound, and varied musical styles, making it a strong multi-style music generation solution.