Cross-modal deep collaborative networks for intelligent music composition and interactive performance
Abstract
With the continuous growth of intelligent music generation and human–computer interaction demands, realizing music composition driven by natural language descriptions has become a significant research direction in this field. However, existing approaches still exhibit limitations in semantic alignment between text and music, audio quality of generation, and controllability, making it difficult to balance high fidelity with semantic consistency. To address this issue, this study proposes a TMG-Net framework that integrates Transformer, Audio Transformer, contrastive learning, and diffusion-based generation for text-driven music generation tasks. The framework employs Transformer to capture semantic features of text and time–frequency information of audio, and achieves cross-modal alignment through contrastive learning. On this basis, a conditional diffusion model is introduced to generate Mel-spectrograms, which are subsequently reconstructed into high-fidelity music via a vocoder, thereby enhancing both the naturalness and semantic consistency of the generated music. Experiments conducted on the MusicCaps and Song Describer Dataset public benchmarks demonstrate that TMG-Net significantly outperforms representative methods such as MuseGAN, Restyle-MusicVAE, and Mustango across three key metrics—FAD, CLAP Score, and R@10—while approaching the performance of MusicLLM. These results indicate that TMG-Net can effectively align with textual semantics while ensuring audio quality, offering a novel technological pathway and application potential for intelligent music creation and interactive performance.