Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on June 13th, 2025 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on July 29th, 2025.
  • The first revision was submitted on August 26th, 2025 and was reviewed by 3 reviewers and the Academic Editor.
  • The article was Accepted by the Academic Editor on October 9th, 2025.

Version 0.2 (accepted)

· · Academic Editor

Accept

Dear authors, we are pleased to verify that you meet the reviewer's valuable feedback to improve your research.

Thank you for considering PeerJ Computer Science and submitting your work.

Kind regards
PCoelho

[# PeerJ Staff Note - this decision was reviewed and approved by Maurice ter Beek, a PeerJ Section Editor covering this Section #]

Reviewer 1 ·

Basic reporting

-

Experimental design

-

Validity of the findings

-

Additional comments

All my concerns are addressed.

·

Basic reporting

-

Experimental design

-

Validity of the findings

-

Additional comments

The present work is generally well-described and written. I am satisfied with the responses from the author

Reviewer 3 ·

Basic reporting

The comments of the reviewers have been addressed adequately.

Experimental design

-

Validity of the findings

-

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** PeerJ staff have identified that the English language needs to be improved. When you prepare your next revision, please either (i) have a colleague who is proficient in English and familiar with the subject matter review your manuscript, or (ii) contact a professional editing service to review your manuscript. PeerJ can provide language editing services - you can contact us at [email protected] for pricing (be sure to provide your manuscript number and title). – PeerJ Staff

Reviewer 1 ·

Basic reporting

-

Experimental design

If I understand correctly, all benchmarks in the paper use a fixed action space of N = 4, so we do not get direct experimental data on how resource usage scales with increasing N. But based on the architecture, the Q-Matrix size grows linearly with the number of actions, and we can expect resource and power consumption to scale accordingly. While the design includes optimizations, BRAM usage will likely become a bottleneck as N increases, especially in more complex environments with larger action spaces. To improve, it would be great to have resource and power results on the cases where N > 4.

Validity of the findings

-

Additional comments

Although the journal does not make decisions on novelty, the novelty of the paper is somewhat limited. First of all, implementing Q-learning on FPGAs is not new, see [15] and [21]. Also, Vanilla Q-learning is not so frequently used in real-world applications. There is a growing trend of FPGA research targeting more advanced RL algorithms such as DDPG, PPO, and DQN. It would help to position this work in such a trend and possibly explain the motivation for focusing on tabular Q-learning.

In addition, it seems to me that the major innovation lies in reducing BRAM write operations by comparing new Q-values against temporary memory. While this gives efficiency benefits, the idea is a practical design optimization rather than a new architectural or algorithmic improvement.

Also, the “Comparison of the State of the Art” section is quite limited. The main comparison is with a single paper, [21], and although some other works are briefly mentioned, they are not discussed in depth. The authors focus mostly on hardware metrics like LUTs, FFs, BRAMs, and throughput, but there’s little explanation of why their design performs better, or what trade-offs are involved. Also, since the compared work [21] uses a different FPGA (Xilinx Ultra96 V2 FPGA, if I understand correctly), it’s unclear how fair the comparisons are. The section would benefit from broader benchmarks, normalized comparisons, and a more detailed discussion of design choices and limitations.

Finally, the proposed design is tied to a single FPGA model, the Kintex-7 XC7K325T. Although the paper includes some synthesis results on a Virtex UltraScale+ FPGA, it does not cover whether the design would work well on other FPGA types, families, or vendors. It is not clear how much the performance and resource savings rely on device-specific features such as BRAM layout or DSP configuration.

·

Basic reporting

This paper reduces the BRAM resources required to compute updated Q values in Q learning. Q values in the current state are stored in temporary memory to improve throughput. The article is professionally structured, and the circuit architecture is effectively shown in block diagrams. The results of the implementation are also presented in detail and are appropriate for an academic paper.

Experimental design

In this paper, the FPGA design environment and FPGA board specifications are clearly described.
The information required for the experiments is presented and is reproducible.

Implementation results are presented for two fixed-point parameters, each with 8 patterns of number of states, which strongly support the conclusions of this paper.

In Table 3, the proposed circuit resources differ slightly from the results in Tables 1 and 2. However, this is unavoidable due to the different FPGA boards used.

Validity of the findings

This paper also compares throughput with state-of-the-art technologies. Comparisons have been made using a common metric, MSPS. Sufficient data is provided to show that the proposal is effective in speeding up the process.

If the MSPS value is higher than the clock frequency, it means that the Q value is being updated multiple times in one clock cycle. I think it is because of parallel computation, but I think it would be better to specify how many parallel computations are being done.

In addition, it would be more helpful if the clock frequencies were also shown in Tables 3 and 4.

Even in complex environments (e.g.32 bits and 1024 states), it seems that an increase in operating frequency would further improve throughput.

Additional comments

・I think it would be better to unify the notation of LUT utilization rates for 16 and 32 states in Table 1, since it is different from the others.

・Throughput is defined in MSPS, but the unit is changed to frequency in lines 228 and 233 (407.5 MHZ, 19.67 MHz).

・In ref. 23 of Table 3, the notation of the throughput of the proposed method appears to be reversed for 16 bits and 32 bits (from the results in Tables 1 and 2). Also, in line 238, it would seem more helpful to indicate 16 bits and 32 bits.

・Only fig. 6 is a color graph. Unless there is a special reason, it is recommended to unify the format with fig. 5 and fig. 7 to avoid misunderstanding.

Reviewer 3 ·

Basic reporting

This paper presents a resource-efficient and low-power Q-learning algorithm implementation on FPGAs by using a temporary memory to optimize updating Q-values during learning.

The paper compares the results with other implementations presented in the literature in terms of area, performance, and energy efficiency.

However, the comparison with other implementations should be on the same terms. Other implementations are developed in SoC platforms, while the proposed scheme is standalone.

In the performance evaluation for the speedup, the communication overhead with the processors should also be taken into account, and not only the computation speed. FPGA provides great performance, but if the communication overhead is not taken into account, the overall system performance can be worse than the original implementation. Also, in the energy consumption measurements, the energy consumption of the communication bus is not measured.

Also, it lacks a clear motivation. The proposed scheme and the performance should be compared not only with other FPGA implementations but also with other schemes. For example, the paper should present how much time and energy it takes this algorithm when implemented in a CPU or GPU.

Experimental design

As noted before, in the performance evaluation for the speedup, the communication overhead with the processors should also be taken into account, and not only the computation speed. FPGA provides great performance, but if the communication overhead is not taken into account, the overall system performance can be worse than the original implementation. Also, in the energy consumption measurements, the energy consumption of the communication bus is not measured.

Validity of the findings

--

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.