Few-shot LoRA tuning for genre-specific music generation with semantic prompt matching
Abstract
This paper presents an efficient method for genre-specific music generation by applying Low-Rank Adaptation (LoRA) to the text encoder of MusicGen, a large-scale text-to-music generation model. Full fine-tuning of such models is computationally expensive and resource-intensive, making it impractical for lightweight applications or small-scale research groups. To address this, we fine-tune only a small number of parameters using LoRA, significantly reducing training cost while preserving the base model's capabilities. Furthermore, we propose a mechanism for automatically selecting the most suitable genre-specific LoRA adapter based on cosine similarity between the user's prompt and predefined genre labels in the text embedding space. This enables effective music generation even when the user does not explicitly mention a genre. Experiments conducted on the FMA dataset using jazz and hip-hop genres demonstrate that the proposed method improves alignment between prompts and generated audio, measured using CLAP-based text-audio similarity. The results show consistent performance gains over the baseline model, validating the effectiveness of LoRA in genre adaptation and the proposed adapter selection strategy. On average, applying our method increased CLAP-based text-to-audio similarity from 0.35 to 0.38 for jazz prompts and from 0.31 to 0.34 for hip-hop prompts. These improvements demonstrate that genre-adapted LoRA tuning yields more semantically aligned and stylistically appropriate music. Our approach enables flexible and efficient customization of music generation models with minimal resources across diverse genres and applications.