All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Dear authors. thank you for your revision. I am pleased to accept your manuscript based on the comments of the reviewers in the second round. Thank you for your contribution.
[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]
Comments are addressed.
Comments are addressed.
Comments are addressed.
The authors have addressed my comments; therefore, the paper can be accepted for publication in the present format.
no comment
no comment
no comment
no comment
Thank you for submitting your manuscript. We understand that the proposed framework shows promise, but the manuscript requires major revisions to meet our publication standards. Please clarify the specific limitations of existing models and your model's unique contributions with references and examples. Provide detailed technical explanations, including mathematical formulations for key components like the genetic algorithm and AdaIN mechanism. Strengthen the evaluation by specifying dataset details, adding user-centric metrics, and comparing against state-of-the-art baselines like DALL·E 2.
Please carefully take the comments of the reviewers as well and resubmit a detailed response along with the updated paper.
The manuscript’s language is overly complex, making it difficult to follow. Key terms like “style migration module” and “preference vectors” are should be more clearly defined. Figures are not self-explanatory and lack detailed captions. Please simplify the writing, define technical terms on first use, and add missing experimental details.
Critical information on data preprocessing, model hyperparameters, and training/validation splits is missing or vague. There is no discussion of how user preference vectors are encoded or how random seed and evaluation bias are controlled. Please clarify these aspects and provide complete implementation details.
The validity of the findings is limited by lack of detail on how user studies were conducted, including participant selection and evaluation protocols. Claims of superior performance rely mainly on standard benchmarks without in-depth error analysis or statistical testing. Conclusions often overreach what the reported evidence supports. M
The paper lacks clarity on how multimodal alignment is achieved. Is there a shared embedding space for text and image features? If so, which encoder (e.g., CLIP, BERT-ViT) is used?
The fusion of semantic and visual information requires a formal definition. Please provide mathematical notations and flow diagrams illustrating the stages of feature combination.
The process for generating "interactive" scenes is not clearly described. How are interactivity and animation dynamics modeled—are they rule-based or learned?
No attention mechanism is mentioned. Could visual attention be integrated to better correlate scene regions with text prompts?
The ablation experiments do not isolate the contributions of visual versus textual modalities. Such tests are essential to evaluate the synergy in multimodal fusion.
Please elaborate on the dataset construction. Are the text descriptions manually labeled or scraped? What annotation quality control mechanisms are in place?
How is semantic granularity (e.g., action verbs, mood descriptors, spatial relations) extracted and encoded from the text?
Table 1 presents the dataset composition but omits average text length, vocabulary size, and distribution across scene types. Please include these details for reproducibility.
How are abstract concepts (e.g., "tension", "fantasy", "urgency") in scene prompts translated into visual features?
If synthetic datasets or augmentation methods were used, clarify how realism was preserved and how human evaluation was ensured.
Figures 1 and 2 provide high-level overviews of the Stable Diffusion and EAGAN frameworks, yet they lack annotations or legends for key components. For instance:
• The "Style Migration Module" in Figure 2 is not labeled with AdaIN or its sub-components.
• Variables such as z, ε, and c appear in equations but are not defined graphically or contextually.
Equations (1) to (12) are referenced but most are not explicitly defined. Key derivations—such as the noise injection formula, denoising process, or GA fitness function—are either absent or only abstractly described.
Explicitly write out the full mathematical formulations.
Clearly define all variables (x_t, t, c, etc.) and ensure the numbering aligns with the equations referred to in the text.
The StyleSim component of the GA fitness function is not clearly described:
• Is it calculated using Gram matrices, CLIP embedding cosine similarity, or handcrafted features?
• How is the user style preference vector obtained—via reference images, textual cues, or vector embeddings?
Add a subsection or paragraph describing how StyleSim is computed, the embedding space used, and whether it is trainable, fixed, or learned from user interaction data.
Although the paper shows GA performance (Figures 5–8), key details are missing:
• No information is provided on population size, mutation/crossover rate, or selection strategy.
• There is no convergence curve to justify why 20 iterations is optimal.
Include a table listing GA parameters, and present a convergence curve showing fitness or metric improvement per generation. Discuss trade-offs in computational cost vs. performance gain.
While models such as Stable Diffusion, VQGAN+CLIP, and StyleGAN-NADA are used for comparison, the comparison mainly focuses on image quality and text-image consistency. There is no structured comparison of: Controllability over stylistic attributes, Responsiveness to user preferences.
How these vectors are collected or encoded,
Whether they come from CLIP, VGG, user annotations, or image uploads.
Provide a concrete example of user interaction (e.g., “user uploads an Ukiyo-e style image, which is encoded via CLIP to obtain the style vector”). Clarify how these are fed into the AdaIN module and used in the GA’s fitness function.
Although the paper shows improvements in FID, CLIPScore, LPIPS, etc., no statistical tests are conducted: Are the improvements significant across random seeds or multiple runs?
What is the standard deviation or confidence interval?
I recommend that you expand your Introduction/Related Work section by including some recent contributions that are closely aligned with your study. These works highlight how generative and deep learning approaches have been applied in cultural, creative, and content recognition domains, which resonates well with the objectives of your paper.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.