All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
The paper may be accepted.
[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]
The manuscript is generally well-written and clear. The authors have substantially improved the reporting compared to the earlier version: the methodology is now more clearly described (e.g., the use of uniformly sampled frames for the 3D-CNN), the ethical considerations are explicitly discussed, and limitations regarding dialectal and gender bias in the dataset are acknowledged. Figures are of adequate quality, and references that were previously misformatted have been corrected.
The introduction and literature review position the work within the broader field of sentiment analysis, highlighting the existing gap in resources for Urdu. However, the claim of methodological novelty remains overstated. The paper primarily reuses established techniques (CNN + FastText, MFCC/prosodic features, 3D-CNN, fusion strategies) rather than introducing fundamentally new models. The main novelty lies in the dataset contribution rather than the modeling framework. This distinction should be made clearer.
The authors’ creation of the MMSA-Urdu dataset is a significant contribution, as it addresses the scarcity of multi-modal resources for Urdu.
The methodology is described with enough detail to be reproducible, and the revisions clarify ambiguities from the earlier version (e.g., video vs. frames). Evaluation has been strengthened with 5-fold cross-validation and additional performance metrics (precision, recall, F1), which were missing before. Ethical considerations regarding dataset collection are now explicitly addressed.
However, for baseline comparisons, it is not always clear whether competing models were retrained under the same conditions or whether results are drawn directly from prior publications. The authors restrict themselves to MFCC, pitch, loudness, and intensity. While justified as standard practice, the lack of exploration of alternative audio features is a limitation. Finally, although dataset biases are acknowledged, no steps are taken to mitigate them, either in modeling or evaluation.
The inclusion of multiple metrics (precision, recall, F1) alongside accuracy is a welcome improvement, and the reported results (91.18% with tri-modal fusion) are strong. The use of cross-validation further supports the robustness of the findings. The authors also attempt to demonstrate generalizability by retraining their model on the CMU-MOSI dataset, achieving competitive results. This is a good step toward validating the framework across languages.
However, the claim of generalizability is still limited to one benchmark dataset; no tests are conducted across Urdu dialects or out-of-domain data. While comparisons with prior work show performance gains, the lack of clarity on whether baselines were re-implemented under the same conditions leaves some doubt about the validity of superiority claims.
Overall, this revision represents a significant improvement over the earlier version. The reporting is clearer, the evaluation more comprehensive, and the error analysis more insightful. The MMSA-Urdu dataset is a valuable contribution to the community and provides a foundation for future research in low-resource, multi-modal sentiment analysis.
Only my point of view: I remain concerned about the limited methodological novelty and the uncertainty surrounding the baseline comparisons. The authors should present the work more as a dataset plus a baseline framework paper rather than as a substantial methodological advance. Framed this way, the paper can still make a meaningful contribution to the field by providing an open-access Urdu multi-modal dataset along with a strong baseline evaluation.
Incorporate the suggestions of the reviewers.
The paper is drafted well and all aspects are covered including 'Error Analysis' and 'Evaluation on Standard Benchmark Dataset'.
All the details are provided.
-No comment-
The paper is well drafted and all the aspects are covered.
Minor typographical errors (e.g., "classiocation" for "classification", "onnal" for "final") recur throughout. A professional proofread is recommended.
Acronyms like MFCC, CNN, 3D-CNN, and ReLU are used without initial expansion or explanation in some parts. Definitions should be clearly introduced at first mention.
The review could be improved by deeper critical discussion of limitations in prior Urdu-only sentiment work, such as absence of large multimodal corpora or weak cross-lingual transferability.
A supplementary table showing class-wise performance for each modality would clarify where the model struggles (e.g., with "neutral" predictions).
Figure quality is good but could benefit from consistent formatting across all visuals.
Although ethical considerations are discussed, an IRB or formal ethics committee mention (even if not applicable) would enhance the credibility.
A GitHub link or supplementary material containing the training code/scripts would further aid replication.
Although claims of generalizability are supported, inclusion of confidence intervals or significance testing (e.g., t-test between fusion methods) would strengthen the analysis.
A short error analysis on inter-rater disagreements could enrich understanding of ambiguity in labels.
This manuscript makes a substantial and timely contribution to multilingual NLP by advancing multimodal sentiment analysis in a low-resource context. The integration of audio, video, and text is well-justified and executed with care. Results are compelling (91.18% accuracy with fusion), and the validation on CMU-MOSI adds credibility.
Include clear subheadings in Methods section for each modality's processing pipeline
Specify exact file formats and sizes for the shared dataset
Add transition sentences between major sections for better flow
Include version numbers for key libraries (Librosa, FastText)
Specify hardware configurations used for timing benchmarks
Clarify the rationale for choosing 10 frames uniformly in visual feature extraction.
why not dynamic sampling?
Include error bars/confidence intervals in Table 4 to highlight statistical significance of fusion gains.
Address dataset bias (dialect/gender imbalance) more explicitly in limitations.
The reviewers consider the paper has potential but needs important improvements in order to be suitable for publication,
Special focus is required on the analysis since the proposed model is not novel
**PeerJ Staff Note:** Please ensure that all review and editorial comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff
1. The authors state that they used a benchmark dataset (CMU-MOSI) for evaluation, but provide no details on how it was used. It is unclear whether this dataset was preprocessed in the same way as their Urdu dataset, whether it was used for training, or whether it was simply used for comparison. If the CMU-MOSI dataset was originally designed for a multi-class sentiment analysis task, was it reduced to two classes for consistency with this study’s classification approach? If so, was the same methodology applied to all models compared?
2. The methodology is unclear about whether the model processes entire videos or extracts images from them. At various points (eg, Figure 1, Figure 3), the paper refers to "videos," but then also discusses "images" in the context of CNN processing. It is never explicitly stated whether frames were extracted from videos at a fixed interval, whether still images were used separately, or whether the entire video sequences were processed.
3. References 18 and 19 should be corrected. The figures are blurry because they appear to have been enlarged.
1. No mention of ethical considerations regarding the dataset. Did the authors obtain consent to use these videos? Many platforms, including YouTube and Facebook, have strict terms of service regarding dataset collection for machine learning. Did the dataset contain personally identifiable content? If so, were privacy safeguards implemented?
2. There is no mention of bias mitigation. If most videos came from Pakistani sources, would the model generalize to Urdu speakers from India or other regions?
3. The feature selection process for audio is not justified enough. Why limit audio features to MFCCs, intensity, pitch, and loudness? Were other audio features tested?
4. The paper claims that the proposed model outperforms previous approaches but fails to address key methodological questions: Were all competing models trained under the same conditions? Did the authors re-train previous models on their dataset, or are they reporting results from prior papers? Did all models use the same two-class sentiment classification approach, or were some evaluated using multi-class sentiment analysis?
1. The authors claim their model is generalizable, yet they provide no cross-validation results, independent test set evaluations, or robustness testing. Generalization requires evaluation across different conditions, such as different Urdu dialects, different sources of video/audio, noisy vs. high-quality recordings, and different demographics (age, gender, education level).
2. The error analysis section appears generic: the authors list broad error categories (e.g., "misclassification between positive and negative sentiments"), but they do not provide evidence that these specific issues occurred. How did they determine that these were the exact reasons for misclassifications? The discussion is entirely theoretical rather than based on actual error case studies. Were misclassifications manually analyzed, or were they inferred from model confidence scores?
3. Accuracy is the only metric used, which is problematic for imbalanced datasets (even if the authors stated that their datasets were balanced). What about precision, recall, and F1-score? These are standard in sentiment analysis.
1- The most important issue in literature references is the sequence of the reference citations in the text. Reference numbers should be consecutive, according to their order of appearance in text. In line 37, for example, the reference number should be (1) instead of (21). Also, in line 214, the reference was not numbered.
2- There is a spelling mistake in line 10. "Studeis" instead of "studies.
3- In line 119, it will be more convenient to use the recent name of the social media " twitter" which is "X."
4- For Table 2, it would be better to place it on the same page where it was mentioned (the following page, line 162)
5- The last paragraph in the results section (lines 369 and 370) is only a repetition of the section's introduction. It would be more informative to summarize the main findings of the experiment.
-
The study gave a clear outlook of the added value of multi-model sentiment analysis for videos in the Urdu language. The only shortcoming is the lack of examples of real-world scenarios where it could be applied (line 385).
In the "Error Analysis" section (line 313), the limits of audio feature extraction should also be analysed since the cultural background might influence the audio signals' interpretation.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.