Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on April 24th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on August 12th, 2025.
  • The first revision was submitted on September 9th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on October 21st, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • The article was Accepted by the Academic Editor on December 5th, 2025.

Version 0.3 (accepted)

· · Academic Editor

Accept

Thanks to the authors for their efforts to improve the work. This version has addressed the concerns of the reviewers. It can be accepted currently. Congrats!

[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]

Reviewer 1 ·

Basic reporting

The basic reporting of the paper has improved from the last version. I can see that the author has made modifications, correcting mistakes and improving the writing of the paper. Thank you.

Experimental design

The comments about the experimental design have been addressed by the author. Clarifications about the type of inputs have been included. The last paragraphs of Section 4.3 have been rewritten to improve understanding. Thank you very much.

Validity of the findings

There are no further comments about the validity of the findings, as they had already been addressed by the author in previous revisions. Thank you.

Version 0.2

· · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** When preparing your next revision, please ensure that your manuscript is reviewed either by a colleague who is proficient in English and familiar with the subject matter, or by a professional editing service. PeerJ offers language editing services; if you are interested, you may contact us at [email protected] for pricing details. Kindly include your manuscript number and title in your inquiry. – PeerJ Staff

Reviewer 1 ·

Basic reporting

The quality of the paper has significantly improved thanks to the author’s work and modifications, and some interesting aspects that highlight the author’s work have now been addressed.

Regarding basic reporting, I would suggest that the author make a revision of the complete manuscript to check for grammar or spelling errors before publication. (e.g., in sentence “we conducted a comparative analysis assess the efficacy” should be changed to “we conducted a comparative analysis **to** assess the efficacy…”).

Experimental design

I would recommend explicitly mentioning which type of inputs are used with each model. I am assuming that the complete text is used directly as input for BERT’s AutoTokenizer, and TF-IDF features, which are described in Section 3.4, are used as input for the machine learning models, but this should be clarified.

I would also advise revising the last two paragraphs of Section 4.3. When the author mentions “raising the model's accuracy from 93% to 93.5% resulted in [...]”, it is unclear what the 93 and 93.5% values refer to. Is the author referring to the results from Table 7 or to an improvement of the BERT model trained for Amharic?

Validity of the findings

The comments regarding the validity of the findings have been addressed by the author. Thank you.

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

Reviewer 1 ·

Basic reporting

The paper is clearly written and organized. The dataset and code are attached to the submission.

The author presents a proposal for sexually explicit content detection in the Amharic language. This proposal is based on using a BERT model that has been previously trained on Amharic hate speech detection. Since Amharic is an under-resourced language, the number of samples available for the detection of sexually explicit content is not enough to train a BERT model, and the author fine-tunes the Amharic BERT model. LIME is used as an explainable AI framework. The proposed fine-tuned BERT model is compared with CNN, SVM, and MLP.

The author also presents a dataset in Amharic, with 34,710 comments and posts labeled as explicit or non-explicit.

I would suggest trying to improve the following aspects regarding the basic reporting of the paper:

1. The readability of Figure 2 and Figure 3 could be improved; black letters on a dark background are difficult to read. If the tool allows it, it would be advisable to set a color palette that improves readability. It is very interesting to include these tables to see which words are contributing to the prediction.
Also, I would recommend not changing the proportions of the image and maintaining the same ones as in the original.
In general, I would suggest improving the quality of the images when possible.

2. Line 110 and line 139: It would be interesting for the reader to know which language was used in each work, since it is something relevant in this study.

3. Line 119: SVM should be defined as Support Vector Machine (SVM) before its first appearance. The same for CNN in line 120 (CNN is defined in line 127; it should be defined in line 120, when it first appears).

Also, in line 283, the acronyms corresponding to these terms (Convolutional Neural Networks, Support Vector Machines, and Multi-Layer Perceptron) have been used before. It would be advisable to define them the first time they are used and refer to them using the acronyms after that.

Experimental design

The author adequately describes the difficulties found when working with the Amharic language: lack of sufficient data (line 53), Amharic characters sharing the same sound and meaning but without clear rules for their usage (line 199), and Amharic being an agglutinative language (line 206). The author explains the preprocessing techniques that were used to deal with these issues.

The author highlights the difficulties of working with low-resource languages; therefore, the advances presented in this work are relevant to improve sexually explicit content in Amharic.

The preprocessing operations and methods are adequately described for other researchers to reproduce the work. In addition, the dataset and code have been submitted by the author.

I would suggest taking a look at the following minor issues:

1. It would be useful to have more information about the CNN, SVM, and MLP models used in the experimentation. For how many epochs were they trained? How many layers does the MLP have? Perhaps some information similar to the one in Table 2 would be helpful for reproducibility purposes. The models are described in section 3.5.2, but no layer configuration or hyperparameters are mentioned. It is not clear whether these models were created by the author or if code made available by other researchers was used.

2. Line 208: The author mentions using a rule-based word segmentation. Perhaps some more detail could be included. Does the author use a word dictionary for this purpose?

Validity of the findings

The findings of LIME indicate how words related to body parts or desire show a high probability of belonging to a sexually explicit comment. The proposed solution obtains promising results, outperforming CNN, SVM, and MLP models.

I would like to make the following suggestions:

1. In Section 4.3: In addition to the comparison with other models (CNN, SVM, and MLP), I would find it interesting to see how BERT trained with Amharic performs in comparison to BERT trained only with English texts. Has the author considered this option?

2. Line 423: If the author has looked at the incorrect predictions, do they have some characteristics that make them more difficult for the classifier? I would find it interesting to see if all samples in each class are homogeneous or if there are some variations that pose a significant challenge for the classifier.

3. Line 392: Has the author trained for more than 4 epochs, and does validation performance decrease after the fourth epoch?

Additional comments

I would suggest taking a look at the following minor issues before acceptance:

1. Line 136: There are parentheses without text inside. I understand the author meant to write KNN, but the text is missing.

2. The writing in lines 355 and 356 could be improved. Perhaps there is a missing comma (or a stop): “Subjective assessments are insufficient. Instead, quantitative evaluation metrics must be used to objectively measure [...]”

Reviewer 2 ·

Basic reporting

- The manuscript presents a study on Amharic explicit content detection using a fine-tuned, pre-trained BERT model. The authors should clarify that the model is re-trained for the task and provide a reference to transfer learning to give proper context, particularly in Contribution (2), lines 88–90. Phrases such as “development of a deep…” should be revised to reflect the use of an existing architecture.

- Furthermore, the paper cites three of the authors' own publications, with two ([11] and [14]) having limited citation records. These references are used to support critical claims regarding the lack of large language models (LLMs), the lack of data, normalization methods, and linguistic challenges. It is recommended that the authors incorporate broader and more widely accepted literature to strengthen their argument.

- The paper currently lacks reference(s) to the linguistic diversity and scale of Amharic, notably the fact that it is spoken by over 40 million people.

- The paper will benefit if the code-switching is addressed, particularly common among younger generations. This phenomenon introduces real-world complexity to online content classification. Example “Tenaystilin betam busy neber eza week” highlights this challenge effectively.

Experimental design

The claim that integrating deep learning with XAI can enhance detection accuracy (lines 155–156) is misleading. XAI tools such as LIME are typically applied after model training for interpretability—not performance enhancement—unless a feedback mechanism is introduced, which is not evident in this study.

Similarly, in
Figure 1: Typically, XAI methods are employed to evaluate and interpret models using test and real-world data; however, from the flowchart, it can be inferred that XAI is used prior to testing and appears to enhance the model detection accuracy.

The manuscript would benefit from clarifying the scope of the study. Although societal challenges, such as the harassment of women and children (lines 42–44), and issues related to the high volume of online content are mentioned, these are not directly addressed in the paper and should not be implied as part of the research.

Additionally, the choice of LIME over other tools like SHAP should be justified, especially considering trade-offs in explanatory granularity and robustness.

The normalization and tokenization process must be elaborated with clear examples, particularly for non-Amharic speakers (authors and reviewers). Amharic’s complex morphological structure, cultural variation, and orthographic patterns demand more detailed explanation.

Additionally, the study does not address whether any regulatory or ethical considerations were taken into account when scraping data from social media, particularly regarding privacy and consent. A discussion of procedures or compliance with data protection standards is essential.

The choice of hyperparameters is not well supported—were these manually tuned or selected using automated methods? Please address this.

Validity of the findings

While the model achieves 93–94% test accuracy, the training accuracy of 98% suggests the possibility of overfitting. The authors should explicitly discuss this gap and provide metrics like validation loss curves or regularisation techniques used.

There is limited insight into misclassified instances, particularly the 198 non-explicit and 226 explicit comments, as indicated by the confusion metric. Understanding these misclassified data would strengthen the validity of the findings.

Further analysis of features (e.g., keywords within the dataset) could enhance model robustness. The manuscript should also explore how absolute vs. borderline classifications behave and whether XAI can reveal patterns in misclassification. 1 or examples can be included through visualisation for clarity.

Finally, the model's performance should not be generalized as "superior" based solely on one or two evaluation metrics. Metrics like precision, recall, and F1-score should be discussed across classes. While the proposed BERT-based model outperforms a CNN baseline on some metrics, it does not consistently do so across the board.

For context, a 0.5–1% accuracy increase from 93% on a test set of 6,942 samples translates to 30–60 fewer misclassification for a 80-20 split ( which further reduces for 90-10 split), which, while statistically modest, may have ethical or societal relevance in sensitive applications such as explicit content detection. However, this should be tested and explained for unseen data, especially for data samples with code-switching.

- Explicit content detection is typically a highly imbalanced classification problem; however, the proposed model was trained on balanced class data. This may impact the model’s robustness and generalizability when applied to real-world data.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.