Python code smells detection using conventional machine learning models

View article
PeerJ Computer Science

Main article text

 

Introduction

  • RQ1: How can we construct a labeled Python code smell dataset suitable for supervised learning and validate its quality?

  • RQ2: What is the detection performance of Machine Learning models in Python code smell datasets?

Python Code Smell Dataset

  • Select code smells from different granularities (class-level and method-level) smells.

  • Select the most investigated code smells in Java, since the existing literature mainly focuses on Java code smell datasets.

  • Large Class: is a class level code smell refers to a class that has become excessively huge and contains many lines.

  • Long Method: is a method level code smell that refers to a long method that is hard to understand and implemented with many code lines.

Code sources selection

  • The code must be written in Python programming language.

  • The code must be open source.

  • The code must be labeled (i.e., smelly or non-smelly) by the PySmell dataset.

Features extraction

Dataset labeling

Dataset validation

  • Using a verified tool for feature extraction to ensure the quality of the code metrics. We used the Radon tool, which is an accredited Python tool.

  • Using a validated and published Python code smell dataset to extract the labels. We used the Pysmell dataset, which was validated by experts.

Dataset distribution

Empirical Study Design

Goal

Data pre-processing

Feature scaling

Feature selection

Baselines

Experiment setup

Hyperparameter optimization

Model validation

Detection performance measures

Accuracy

Matthews correlation coefficient (MCC)

Statistical test

Results and Discussion

Threats to Validity

Internal validity

External validity

Conclusion validity

Conclusion

Supplemental Information

Wilcoxon Statistical Tests

DOI: 10.7717/peerj-cs.1370/supp-2

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Rana Sandouka conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Hamoud Aljamaan conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data is available at Zenodo: Sandouka, Rana, & Aljamaan, Hamoud. (2023). Python Code Smell Datasets [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7512516.

Funding

This work was supported by the King Fahd University of Petroleum and Minerals (KFUPM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

13 Citations 2,685 Views 368 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more