Cross-modal deep collaborative networks for intelligent music composition and interactive performance


Abstract

With the continuous growth of intelligent music generation and human–computer interaction demands, realizing music composition driven by natural language descriptions has become a significant research direction in this field. However, existing approaches still exhibit limitations in semantic alignment between text and music, audio quality of generation, and controllability, making it difficult to balance high fidelity with semantic consistency. To address this issue, this study proposes a TMG-Net framework that integrates Transformer, Audio Transformer, contrastive learning, and diffusion-based generation for text-driven music generation tasks. The framework employs Transformer to capture semantic features of text and time–frequency information of audio, and achieves cross-modal alignment through contrastive learning. On this basis, a conditional diffusion model is introduced to generate Mel-spectrograms, which are subsequently reconstructed into high-fidelity music via a vocoder, thereby enhancing both the naturalness and semantic consistency of the generated music. Experiments conducted on the MusicCaps and Song Describer Dataset public benchmarks demonstrate that TMG-Net significantly outperforms representative methods such as MuseGAN, Restyle-MusicVAE, and Mustango across three key metrics—FAD, CLAP Score, and R@10—while approaching the performance of MusicLLM. These results indicate that TMG-Net can effectively align with textual semantics while ensuring audio quality, offering a novel technological pathway and application potential for intelligent music creation and interactive performance.
Ask to review this manuscript

Notes for potential reviewers

  • Volunteering is not a guarantee that you will be asked to review. There are many reasons: reviewers must be qualified, there should be no conflicts of interest, a minimum of two reviewers have already accepted an invitation, etc.
  • This is NOT OPEN peer review. The review is single-blind, and all recommendations are sent privately to the Academic Editor handling the manuscript. All reviews are published and reviewers can choose to sign their reviews.
  • What happens after volunteering? It may be a few days before you receive an invitation to review with further instructions. You will need to accept the invitation to then become an official referee for the manuscript. If you do not receive an invitation it is for one of many possible reasons as noted above.

  • PeerJ Computer Science does not judge submissions based on subjective measures such as novelty, impact or degree of advance. Effectively, reviewers are asked to comment on whether or not the submission is scientifically and technically sound and therefore deserves to join the scientific literature. Our Peer Review criteria can be found on the "Editorial Criteria" page - reviewers are specifically asked to comment on 3 broad areas: "Basic Reporting", "Experimental Design" and "Validity of the Findings".
  • Reviewers are expected to comment in a timely, professional, and constructive manner.
  • Until the article is published, reviewers must regard all information relating to the submission as strictly confidential.
  • When submitting a review, reviewers are given the option to "sign" their review (i.e. to associate their name with their comments). Otherwise, all review comments remain anonymous.
  • All reviews of published articles are published. This includes manuscript files, peer review comments, author rebuttals and revised materials.
  • Each time a decision is made by the Academic Editor, each reviewer will receive a copy of the Decision Letter (which will include the comments of all reviewers).

If you have any questions about submitting your review, please email us at [email protected].