All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Thank you for considering the suggested topics and revisions; it appears all matters were considered and addressed. The revised manuscript reads well and looks ready to move forward. Perhaps it may benefit to develop a Jupyter notebook to place on your GitHub site to go along with your TE processing to guide those interesting in following this process. Congratulations! I will accept the revisions and recommend acceptance of the manuscript.
[# PeerJ Staff Note - this decision was reviewed and approved by Paula Soares, a PeerJ Section Editor covering this Section #]
In general the reviews were positive and the novelty in analyses were appreciated. However, there were some areas which appeared to require additional explanation, more comparison, and possibly a need for improved test methodology. I would recommend paying close attention to the suggestions provided by reviewers as they appear constructive in improving the impact that the manuscript may have. I will return the manuscript as requiring major revisions at this point. I would think that the improved explanation and presentation of justifications will go over well in the revisions. We look forward to your updates. Thank you for the submission.
This paper describes ML methods to detect and classify LTR retrotransposons in plant genomes. They used several ML models, such as SVC. They used k=1 to 6 and precisely classified the LTR sequences from non-LTR sequences. The method they used is conceptually different from the approaches widely used which use sequence alignment to detect and classify transposons (e.g. RepeatMasker), thus this study is an important advance in this field.
Their analyses are mostly acceptable, with a few serious issues in the analysis design as detailed below.
A)	Line 119. It is unclear why they used the negative instances longer than 6 kb. Authors should describe the rationale. Some LTR-transposons, particularly partial LTR-RTs and solo-LTR elements, would be shorter than 6 kb. This difference may influence the classification by ML. Authors should show the distribution of the lengths of sequences used as a training data.
B)	Authors should describe how they calculate k-mer occurrence. In line 230, they show 10 most important features here. Those contains inverse (i.e. reverse complement) k-mers (e.g. AAAAAA and TTTTTTT). This brings me to think that, to calculate k-mers, they only used either forward or reverse strand of the sequences in their dataset. In genomics analyses, it is more common to consider inverse pairs of k-mers because transposons can be inserted at both forward and reverse directions. In such cases, two complementary k-mers should be considered as one k-mer. If they did not consider this property, it may generate potential bias towards one strandness in the machine learning process. The authors should further clarify how they calculated k-mer occurrence.
Throughout the manuscript, they used the term “detection” of LTR-RTs. However, what they are doing is the binary classification of sequences into LTR and non-LTR classes. Their current approach cannot determine genome regions or nucleotide-level locations of LTR transposons from the assembled plant genomes, thus “detection” is not an appropriate term to describe this. In the practical setting, the detection will be annotation of LTRs from the assembled genome sequences as described in line 252 and 294. In this context, what needed to say “detection” is annotation of LTR from long assembled plant DNA genomes, not from a set of short sequence instances (i.e. CDS and TEs). Authors should at least rephase the term “detection” or further implement pipeline that allow them to detect/annotate LTR-RTs from assembled genomes. Related to this, the authors should also discuss how the k-mer-based ML model can be applied to the LTR detection/annotation of the assembled plant genomes in future.
Minor points:
Throughout the manuscript, “k” of k-mer should be italic.
Line 249. “The most frequent repeated sequences in plant genomes are LTR retrotransposons.” Here requires references.
Lines 260-262.  Add references.
Fig7. What are the columns and rows of the confusion matrix? Label them.
Figures and tables lack title
no comment
no comment
This study compared performance of several common ML models to detect and classify LTRs in plan genomes, and also it conducted feature selection to select some k-mer based features to train models that shows not much difference as using all features. But I recommend rejecting at current version since there are some major problems should be fixed before considering further evaluation. 
Major issue
1. This study compared different ML methods and reported their performance. But it does not compare with other tools that are well address to solve the LTR classification. 
2. It looks like no independent testing dataset was set. independent testing is important since it can validate the models after the hyper-parameter tunning. 
3. The negative datasets may miss some important TE families and sequences such as Helitron as well as Intergenic sequences excluding all TEs, CDS, RNAs.
4. This study only uses F1-score to compare all ML models. Even F1-score may solve data imbalanced issue, more scores such as auPRC and auROC need to be added. 
Minor issue
Line 27 change ‘:’ to ‘;’
Line 116 what the composition of the 10,000 genomic feature sequences
Line 119 why keep sequences longer than 6kb not 5kb? Authors may consider checking the average length of LTRs. 
Line 134 to 137, when applying k-cross-validation, have authors set independent testing dataset to validate the models? 
Line 140, what algorithms used need to be clarified.
Line 221, I am thinking if we loosen threshold to 25 or 20, would the performance increase compared to 95% F1-Score based on 30.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.