Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

Nouf Alturayeif; Jameleddine Hassine

doi:10.7717/peerj-cs.2730

Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

Nouf Alturayeif ^1,2, Jameleddine Hassine^1,3

1Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

2Computing Department, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia

3Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

DOI: 10.7717/peerj-cs.2730

Published: 2025-03-05
Accepted: 2025-02-05
Received: 2024-09-18

Academic Editor: Bilal Alatas

Subject Areas: Artificial Intelligence, Data Mining and Machine Learning, Software Engineering, Neural Networks
Keywords: Data leakage, Code quality, Transfer learning, Active learning, Low-shot prompting

Copyright: © 2025 Alturayeif and Hassine
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Alturayeif N, Hassine J. 2025. Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting? PeerJ Computer Science 11:e2730 https://doi.org/10.7717/peerj-cs.2730

The authors have chosen to make the review history of this article public.

Abstract

With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.

Introduction

With the advent of digital transformation, the integration of machine learning (ML) code has become widely used across a wide range of disciplines. From biology to finance, engineering to art, practitioners from every discipline are using ML to get new insights, automate complex processes, and innovate. As experts and novices both engage in this complex area, the absence of standardized practices and the complexity of machine learning, often lead to the creation of low-quality code, such as the lack of documentation (Yang et al., 2021) and irreproducible code (Wang et al., 2020a). Furthermore, researchers from different disciplines write machine learning code that violates best practices, that is continually copied and cloned (Koenzen, Ernst & Storey, 2020; Chattopadhyay et al., 2020; Yang et al., 2021; Pimentel et al., 2019). In addition, adversarial attacks pose a unique challenge by exploiting vulnerabilities in ML models through precise and deliberate manipulations of input data, potentially causing the models to make incorrect predictions or behave unpredictably (Goodfellow, Shlens & Szegedy, 2015; Ren et al., 2020). Several studies proposed methods to improve the quality of ML code, such as generating documentation for data wrangling code (Yang et al., 2021), enhancing the reproducibility of Jupyter notebooks (Wang et al., 2020a), assessing the best practices of collaborative notebooks (Quaranta, Calefato & Lanubile, 2022), and adversarial defenses (Xie et al., 2019). Low-quality ML code can lead to a cascade of issues, including increased maintenance costs, decreased system reliability, hindered innovation, and poor quality of the model’s predictions.

The accuracy of an ML model in production is considered a common quality metric (Nahar et al., 2022), providing a real-world measure of the model’s performance and its ability to make accurate predictions on new and unseen data under operational conditions. At times, an ML model exhibits satisfactory performance on test data but experiences a significant decline in effectiveness when deployed in a production environment. One potential factor contributing to this scenario is the incorporation of test data (leaked) during the training process, causing the model to overfit to the test set. This results in overly optimistic accuracy estimates (Yang et al., 2022) that may not generalize well to new, unseen data. This phenomenon is commonly referred to as data leakage (Burkov, 2020; Kaufman et al., 2012).

While data leakage represents an unintentional issue originating from poor data handling practices, it is useful to contrast this with adversarial attacks, which exploit ML models in a different way. Unlike data leakage, which inflates model performance during training and validation due to unintentional biases, adversarial attacks directly target the model during deployment by intentional manipulation of input data, potentially compromising its robustness and security (Goodfellow, Shlens & Szegedy, 2015). Although adversarial attacks and data leakage affect ML systems differently, both highlight the need for robust practices and detection/defense mechanisms to ensure the quality and reliability of ML models across their lifecycle.

Few studies proposed methods to detect data leakage in ML code (Kaufman et al., 2012; Yang et al., 2022; Chorev et al., 2022). However, those studies employed manual detection and code analysis methods. Manual approaches are prone to human error and are time-consuming. In addition, despite the effectiveness of code analysis approaches, they are laborious, as each type of data leakage necessitates a tailored code analysis approach, requires specialized expertise, and may prove challenging (or impossible) to implement for more complex data leaks. Although ML techniques have considerably improved in recent years (e.g., Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) and Generative Pre-training Transformer (GPT) (OpenAI, 2023)), to the best of our knowledge, the application of ML has not yet been attempted to detect data leakage in ML pipelines.

Nevertheless, training ML models often demands large amounts of annotated data, which is a resource-intensive process that can be a bottleneck in many applications. The conventional paradigm relies on large, annotated datasets, which may not always be available, posing a challenge for certain domains or applications where acquiring extensive labeled data is resource-intensive or impractical. However, recent approaches have emerged to address this challenge. Transfer learning is one approach that leverages pre-trained models on large datasets for related tasks, allowing the model to transfer its knowledge to the target task with limited labeled data (Zhuang et al., 2020). In addition, Active learning optimizes the labeling process by selectively choosing the most informative instances for annotation, leading to maximizing the utility of each labeled sample (Settles, 2009). Lastly, low-shot prompting is a technique that involves providing explicit examples to guide the model’s learning process with minimal labeled data (Wang et al., 2020b). These are different strategies that aim to overcome the labeled data availability issue, enhancing the efficiency and effectiveness of training ML.

In this work, we aim to explore the three different approaches to build ML models for data leakage detection in ML code using limited annotated datasets. These models will be trained on code snippets labeled as: (1) have a data leakage, or (2) does not have a data leakage.

In this article, we make the following contributions:

1.

Build and annotate a dataset for leakage detection in Python ML code that consists of 1,904 samples.
2.

Introduce an automated approach for code augmentation to address the imbalance issue.
3.

Investigate the effectiveness of three different ML approaches for limited annotated datasets to detect data leakage in ML code. The three approaches are transfer learning, active learning, and low-shot prompting.
4.

Publicly release the dataset and source code.

Research background

In this section, we provide an overview of the background related to data leakage, followed by a brief introduction to machine learning approaches tailored for limited annotated datasets.

Data leakage

Data leakage occurs when testing data are leaked (directly or indirectly, deliberately or unintentionally) to the training process, leading to unrealistically optimistic performance. It can occur because of three common sources: overlap leakage, multi-test leakage, preprocessing leakage (Burkov, 2020; Subotić, Milikić & Stojić, 2022). In what follows, we briefly describe each type with an illustrative example:

Overlap leakage: Occurs when test data is directly used for training or hyperparameter tuning. In some cases, test data is mistakenly used in creating the training data, such as in augmentation methods. For instance, in the example shown in Fig. 1A, SMOTE oversampling (smote.fit_resample) is applied to the entire dataset (X_resampled, y_resampled) before splitting into training and testing sets (line 6). This leads to overlap leakage, as oversampled data in the training set may include information derived from the testing set, compromising the evaluation process.
Multi-test leakage: Occurs when test data is used repeatedly for evaluating the model and making decisions such as: algorithm selection, model selection, and hyperparameter tuning. Instead, validation data should be used. Fig. 1B demonstrates this type of leakage, where cross-validation (cv=RepeatedKFold) is performed on the entire dataset (line 2) during hyperparameter tuning with GridSearchCV (line 11). This setup includes the test set in determining the optimal hyperparameters, indirectly introducing information from the test data into the model tuning process and resulting in multi-test leakage.
Pre-processing leakage: Occurs when test data is merged with the training data for pre-processing. For example, feature selection, normalization, and projection with principal component analysis (PCA). An example is shown in Fig. 1C, where the MinMaxScaler is applied to the entire dataset (line 6) before the data is split into training and testing sets (line 9). This results in preprocessing leakage, as the scaling operation allows information from the test set to influence the training process, leading to overly optimistic performance estimates.

Figure 1: Data leakage examples.
(A) Overlap leakage, (B) multi-test leakage, (C) pre-processing leakage.

Download full-size image

DOI: 10.7717/peerj-cs.2730/fig-1

Machine learning approaches for limited data

As discussed in the introduction, collecting large labeled datasets for machine learning poses significant challenges, impacting various domains and applications. One major obstacle lies in the resource-intensive nature of the process, requiring substantial time, human effort, and financial investment. The need for domain-specific expertise to accurately label data adds another layer of complexity, particularly in specialized fields where domain knowledge is crucial. Additionally, data availability may be limited, restricting the creation of extensive labeled datasets and hindering the development of robust machine learning models. Consequently, addressing these challenges is paramount for advancing the field and fostering the development of more accurate and ethical machine learning applications. In this research, we explore three different approaches, namely, transfer learning, active learning, and low-shot prompting.

Transfer learning

Transfer learning is an ML paradigm that involves training a model on one task and then transferring its learned knowledge to a different, but related, task. With transfer learning, the model is initialized using the pre-trained weights, and these weights are updated based on the new task-specific small dataset with task-specific objective function, usually with a smaller learning rate (Liu et al., 2023).

Transfer learning can be categorized into three main types (Pan & Yang, 2009; Zhuang et al., 2020):

Inductive transfer learning: It involves transferring knowledge from one task to another, even when the domains are different. This approach requires labeled data in the target domain to train a predictive model for that specific task. For example, a model trained to classify animals could be fine-tuned with labeled car images to perform vehicle classification.
Transductive transfer learning: It applies when the tasks are the same, but the domains differ. In this scenario, a large amount of labeled data is available in the source domain, but none in the target domain. For instance, a spam email classifier trained on English emails can be adapted to classify spam in French emails.
Unsupervised transfer learning: It focuses on unsupervised tasks like clustering or dimensionality reduction, where the target task is related to but distinct from the source task. Unlike inductive learning, neither the source nor target domains have labeled data. An example could be using a model pre-trained on a large dataset of generic product reviews (source domain) to cluster customer feedback about a new software product (target domain) into groups even though no labeled data is available in either domain.

Transfer learning has proven to be highly useful and powerful, particularly in deep learning applications, and it has been widely adopted across various domains and tasks (Zhuang et al., 2020). In addition, it has been successfully applied across different data types, such as using BERT for text (Devlin et al., 2019), ResNet for images (He et al., 2016), and Wav2Vec for speech (Schneider et al., 2019). Its ability to leverage pre-trained models reduces the need for large labeled datasets in the target domain, making it especially powerful for resource-constrained tasks. For a comprehensive overview, readers are referred to Zhuang et al. (2020) and Pan & Yang (2009).

Active learning

Active learning is a sub-field of ML that produces models of high performance while reducing manual labeling efforts (Settles, 2009). A key objective of active learning is to select the most informative data for labeling, with the notion that if the model selects its own data, it will perform better with less training (Settles, 2009). It involves an iterative process where the model is trained on a small initial dataset, and then the most informative samples are selected to be labeled. It relies on a query function that calculates scores for each data point that needs to be labeled (Settles, 2009).

Several strategies for selecting the most informative samples have been proposed (Settles, 2009):

Uncertainty sampling: The model selects instances for which it has the least confidence in predictions. For example, in a binary classification task, this could mean selecting data points where the predicted probability is closest to 0.5.
Query by committee (QBC): A committee of models with diverse hypotheses selects samples based on the level of disagreement among their predictions (Seung, Opper & Sompolinsky, 1992).
Diversity-based sampling: Ensures the selected samples represent a wide range of the input space, reducing the risk of redundant or overly similar samples being labeled.

Active learning has been applied successfully across various domains. For instance, in medical imaging, models can select ambiguous X-rays for expert annotation, minimizing the workload for radiologists while maintaining diagnostic accuracy. In natural language processing (NLP), active learning helps text classification by identifying and labeling the most uncertain sentences. Similarly, in autonomous driving, it selects edge cases like unusual objects or weather conditions for manual annotation, ensuring robust performance in diverse scenarios.

For readers seeking to explore active learning in depth, comprehensive surveys are available. Settles’ foundational survey (Settles, 2009) provides an excellent starting point, covering core strategies, theoretical foundations, and practical applications. More recent reviews, such as those by Ren et al. (2021), focus on integrating active learning with deep learning, tackling challenges like scalability and handling high-dimensional data. Domain-specific reviews, such as those in NLP (Olsson, 2009), computer vision (Cohn, Ghahramani & Jordan, 1996), and medical imaging (Smailagic et al., 2018), further highlight its versatility and impact across fields.

Low-shot prompting

Low-shot prompting refers to another ML paradigm where a model is trained given only a few labeled examples of each class (Wang et al., 2020b). Instead of fine-tuning on a task-specific dataset, low-shot prompting relies on providing prompts (or examples) during the inference phase. This approach is particularly useful when only a small number of examples are available for each class, and the model needs to adapt quickly to new tasks. There are typically three low-shot prompting scenarios that describe the amount of examples provided to the model:

Zero-shot: The model is required to perform the task without being provided any task-specific examples. Instead, the model relies on general knowledge learned from pre-training or other tasks to make predictions.
One-shot: The model is provided with only one example per class. The goal is to enable the model to generalize from this single example and make predictions for new instances.
Few-shot: The model is provided with a small number of examples per class. The number of examples is higher than one but still limited, allowing the model to learn from a small amount of task-specific data.

Literature review

In this section, we present the state-of-the-art studies on data leakage detection and avoidance in machine learning code, respectively.

Few studies proposed manual approaches for data leakage detection and avoidance (Kaufman et al., 2012; Kohavi & Parekh, 2003), such as exploratory data analysis (EDA) (Tukey, 1977). EDA is an approach to data analysis in which data is explored using a variety of statistical and visual techniques, such as histograms and correlation analysis, in order to gain insight into the structure and relationships of the data (Tukey, 1977). One can use EDA to detect surprising cases such as unexpected behavior of a feature in a fitted model or surprising model performance. However, there is no doubt that the implementation and execution of manual approaches present greater challenges compared to their automated counterparts. Automated approaches, such as ML-based systems, can exhibit a high level of efficacy in detecting data leakage, making them a better alternative for detecting and avoiding data leakage. Furthermore, automated approaches are also more cost-effective and easier to maintain than manual approaches.

Code analysis (Yang et al., 2022; Drobnjakovic, Subotic & Urban, 2024; Cousot & Cousot, 1977; Subotić, Milikić & Stojić, 2022; Chorev et al., 2022) is one of the most commonly used approaches for data leakage detection. Yang et al. (2022) developed a static data-flow analysis to detect three types of data leakage in ML code: overlap, multi-test, and preprocessing leakage. Their approach tracks the flow of data and detects common patterns that can result in data leakage. For example, multi-test leakage will be reported when only validation data is detected and no testing data is present. On the other hand, overlap leakage will be detected when the model’s testing/validation data overlaps with the training data. Lastly, preprocessing leakage will be reported when the training data includes reduced information from the testing/validation data. They found that their analysis accurately detects data leakage with an accuracy of 92.9%. In addition, they found that there is a significant amount of leakage (30%) among over 100,000 public notebooks. Drobnjakovic, Subotic & Urban (2024) proposed static code analysis based on abstract interpretation (Cousot & Cousot, 1977) that derives an abstract data leakage semantics systematically and rigorously. As an example, when a variable is passed to a function that trains or tests a model, the variable is asserted to be disjoint and untainted. They evaluated their approach in terms of performance and accuracy on 2,088 real-world notebooks. The results show that the approach detects 30 real data leakages with a precision of 94%, while scaling to the performance constraints of interactive notebook environments. Subotić, Milikić & Stojić (2022) introduced a static code analysis framework that is specific to notebooks and based on what-if analysis on notebook actions, such as cell executions, creation, and deletion. For example, the framework will warn the user that if a specific cell A is executed, data leakage can occur once cells B and C are executed. On the other hand, the framework can recommend to the user to execute cell C then cell B in order to avoid data leakage. Chorev et al. (2022) employed dynamic code analysis to develop Deepchecks, a Python library to validate different aspects of machine learning models, including overlap data leakage. Nevertheless, their work does not reveal any technical information about their approach that supports scientific research, instead, it shows information about the tool’s capabilities and usage.

While code analysis methods have proven effective for detecting predefined patterns of data leakage, they often struggle with scalability and adapting to complex or novel scenarios in evolving codebases (Pujar et al., 2024). In contrast, ML models, such as CodeBERT and GPT, can scale efficiently to large codebases and adapt to diverse coding practices that rule-based methods may miss (Pujar et al., 2024; Zhuang et al., 2020). Additionally, techniques like transfer learning and active learning further enhance ML-based methods by reducing training overhead and minimizing data annotation requirements (Pan & Yang, 2009; Settles, 2009), making them a promising alternative for data leakage detection in dynamic software environments.

Few studies proposed approaches to avoid data leakage. Kaufman et al. (2012) analyzed data-level leakage in terms of the relationship between inputs ( $x$ ) and target ( $y$ ) samples. They introduced a prevention approach, called learn-predict separation, based on analyzing two sources of leakage: features and training samples. The approach consists of two stages: (1) tagging every sample by “is $x$ legitimate for inferring $y$ ”, and (2) only including the features that are purely legitimate for predicting $y$ and only include the inputs that are purely legitimate with all targets as training samples. Lyu et al. (2021) studied the data leakage challenges in the context of AIOps (Artificial Intelligence for IT Operations). They found that a time-based splitting of the dataset can significantly reduce the possibility of having data leakage.

Several studies have explored approaches for assisting data scientists in developing ML code that is of higher quality, such as developing frameworks for ML pipelines. Few of these studies can be utilized as a means to detect/avoid data leakage in ML models. For example, Biswas, Wardat & Rajan (2022) performed a comprehensive study to understand the nature of data science pipelines in order to facilitate research and practice on the pipelines. Using data science pipelines containing stages of sourcing, cleaning, splitting, normalization, and training can eliminate the need to perform a normalization step before splitting (a type of data leakage). Namaki et al. (2020) introduced the ML provenance tracking problem, which identifies the columns in a dataset that have been used to train a given ML model. When combined with dependency graphs, data provenance techniques can be used to detect data leakage. The effectiveness of these approaches, however, has not yet been assessed on data leakage detection/avoidance.

Based on the conducted literature review, we identified a gap in applying ML for data leakage detection. In this work, we will fill this gap by proposing ML-based approaches that are generic, scalable, and can be easily extended to any type of data leakage. Additionally, automated ML approaches do not require the same level of manual labor or expertise, making them more accessible and easier to maintain. Exploring ML-based approaches for automation is becoming increasingly popular, as it has the potential to provide more accurate results efficiently and effectively, especially with the breakthroughs in ML-based approaches. Unfortunately, the surveyed approaches did not offer replication packages preventing us from thoroughly evaluating them.

Proposed data leakage detection models

In this research, we examine transfer learning (Zhuang et al., 2020), active learning (Settles, 2009), and low-shot prompting (Wang et al., 2020b) paradigms for data leakage detection as solutions for limited annotated datasets. We utilize CodeBERT (Feng et al., 2020) for transfer learning and active learning and GPT (OpenAI, 2023) for prompting. We evaluate the effectiveness of the three approaches in terms of recall, precision, and F2-score.

To address the objectives of our study, we aim to answer the following research questions:

RQ1: How effectively can transfer learning identify data leakage in ML code?
RQ2: How does active learning affect transfer learning performance when using a smaller number of training examples?
RQ3: Can low-shot prompting outperform transfer learning in zero-shot, one-shot, and few-shot learning scenarios?

Data leakage dataset

In this research, we created a dataset that consists of 1,904 labeled samples for data leakage detection. The positive samples (i.e., contain data leakage) constitute 6% of the dataset with 115 samples, whereas the negative samples (i.e., do not contain data leakage) constitute 94% of the dataset with 1,789 samples.

The core of our dataset is Code4ML (Drozdova et al., 2023) which contains 7,944 Python code snippets that are publicly available in Kaggle (http://kaggle.com); a platform that is widely recognized as a prominent host for data science competitions. Python was selected as it is one of the most widely used programming languages in machine learning and data science, making it highly relevant for data leakage use cases. Code4ML is manually annotated with the main phases of ML pipelines based on a Machine Learning Taxonomy Tree (Drozdova et al., 2023). The taxonomy has two levels, where the high level consists of nine main phases: data export, data extraction, data transformation, debug, environment, exploratory data analysis, hyper-parameters tuning, model evaluation, model interpretation, model training, and visualization. Each high-level phase consists of many low-level phases, resulting in $\approx 80$ low-level categories.

The dataset was created in four stages as visualized in Fig. 2.

Figure 2: Dataset construction process.

Download full-size image

DOI: 10.7717/peerj-cs.2730/fig-2

Identify data leakage types

As a first step, we surveyed the literature to identify the common types of data leakage. As a result, we found that data leakage can occur because of three common sources: overlap leakage, multi-test leakage, and preprocessing leakage (Burkov, 2020; Subotić, Milikić & Stojić, 2022).

Mapping

In this step, we mapped each data leakage type with the main high-level phases from the taxonomy in which each data leakage may occur. The mapping process is formalized by associating each data leakage type L with the corresponding ML pipeline phases P where the leakage could occur. This can be expressed mathematically as:

$M = {(L_{i}, P_{j}) | P_{j} \in P h a s e s, L_{i} \in L e a k a g e T y p e s, f (P_{j}, L_{i}) = T r u e}$ where $f (P_{j}, L_{i})$ is a mapping function that evaluates whether a phase $P_{j}$ is associated with a leakage type $L_{i}$ based on domain knowledge or observed practices. The mapping results are presented in Table 1.

Table 1:

Mapping between data leakage types and ML pipeline phases.

ML pipeline phase	Leakage types
	Overlap	Multi-test	Pre-processing
Data export
Data extraction
Data transformation	X		X
Debug
Environment
Exploratory data analysis
Hyper-parameters tuning	X	X
Model evaluation
Model interpretation			X
Model training	X	X
Visualization

DOI: 10.7717/peerj-cs.2730/table-1

Filtering unrelated code

In this step, we used the mapping table to filter Code4ML to include only the code snippets that are associated with a phase that has the potential to introduce data leakage. This step was performed to mitigate the imbalance issue caused by the high volume of unrelated code. We define “unrelated code” as a code snippet that poses no risk of causing data leakage, such as visualization code. The filtering process can be represented as:

$C_{f i l t e r e d} = {c \in C | \exists L_{i} m a p p e d t o t h e p h a s e o f c}$ where $C_{f i l t e r e d}$ represents the subset of code snippets relevant to data leakage, C is the original set of code snippets, and $L_{i}$ refers to the identified leakage types. Only snippets associated with phases mapped to at least one leakage type were retained, significantly reducing the dataset size and mitigating the imbalance caused by unrelated code.

Although the unrelated code snippets have been filtered out, the dataset is still imbalanced, with only 4% of the dataset contain code snippets with data leakage. Handling the imbalance issue is discussed in “Data Imbalance” section.

Manual annotation

In this step, we manually annotated the resulting dataset for data leakage detection. Each code snippet was labeled with either a positive class (contains data leakage) or a negative class (does not contain data leakage). The labeling process was conducted by one author and subsequently validated by an expert in machine learning with a Ph.D. in Computer Science.

Data imbalance

Data imbalance is characterized by a disproportionate distribution of samples across classes, resulting in biased models that favor the majority class while neglecting the minority class. Training a model on an imbalanced dataset will likely result in suboptimal performance, especially for instances belonging to the minority class. For example, in our study, the minority class comprises only 5% of the dataset (91 samples). When training a model without addressing this imbalance, we observed a high accuracy of 92%, but both precision and recall for the minority class were 0. This indicates that the model predicted all samples as belonging to the majority class, failing to learn patterns from the minority class.

In order to handle data imbalance, oversampling techniques can be used, such as synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002) or undersampling techniques, such as random undersampling. These techniques help to mitigate the imbalance of classes, resulting in more accurate models. However, in this study, undersampling was not considered viable because it would have reduced the majority class to match the minority class, resulting in an extremely small training set with only 182 samples in total. Such a reduction would hinder the model’s ability to generalize effectively.

As a result, we considered oversampling techniques to address the imbalance, by allowing the model to better learn patterns from the minority class and improving its overall performance.

SMOTE oversamples the minority class by generating synthetic examples using the K nearest neighbors in the feature space, as represented by the equation:

$x_{n e w} = x + λ \cdot (x_{n e a r e s t} - x)$ where $x_{n e a r e s t}$ is a randomly selected nearest neighbor of $x$ in the feature space, and $λ$ is a random value between 0 and 1.

LLMs, such as BERT and GPT, are embedding-based models that learn contextual dense vector representations of words, while SMOTE operates in the input feature space and may not be directly applicable to the embeddings learned by LLMs. However, there are alternative techniques to address class imbalance in NLP tasks for embedding-based models, including data augmentation. One of the most common methods of augmentation of code snippets is code refactoring (Fowler & Beck, 1997; Tsantalis et al., 2018). Refactoring is the process of restructuring code to improve its readability, maintainability, and performance without affecting its external behavior (Fowler & Beck, 1997). It can involve renaming variables and functions, moving classes and methods, and restructuring classes.

In this research, we utilized GPT model (OpenAI, 2023) to automatically augment the dataset (i.e., generate refactored code snippets from each training sample). We followed OpenAI’s prompt engineering guide and best practices (https://platform.openai.com/docs/guides/prompt-engineering) to design the prompt. A prompt consists of three roles: System, User, and Assistant (optionally). The System role is used to give instructions to the model, such as asking the model to adopt a persona and use a specific output format. A User represents the entity that interacts with GPT and asks questions. The Assistant is the large language model (i.e., GPT) and can be optionally used in the prompt, such as in the case of few-shot prompting.

While GPT-based refactoring techniques are effective for generating augmented datasets, challenges such as generating semantically incorrect or syntactically invalid code can arise, particularly in complex code snippets. To address these issues, we instructed GPT to perform specific refactoring techniques considered simple and less error-prone, such as renaming variables, functions, or methods. These transformations are inherently less likely to introduce errors and are well-suited for automated augmentation. We instructed GPT to randomly use a combination of a selected set of techniques from the catalog of refactoring techniques (Fowler & Beck, 1997). Among the refactoring techniques, we excluded those specific to Object-Oriented Programming (OOP) (e.g., extract classes) as ML code typically emphasizes functional and modular programming over class-based hierarchies. While OOP is integral to ML frameworks like scikit-learn and TensorFlow, pipeline scripts and experimentation often follow a procedural or functional style, focusing on modular, stateless functions. Applying OOP-specific refactoring would risk introducing incorrect refactoring, as these transformations might not align with the functional and modular characteristics of ML code. In order to verify GPT’s understanding of the refactoring techniques, we asked GPT to define each one and verified that the answer was accurate. Following is a list of the considered factoring techniques, along with a description of each one:

Rename variable: changes the name of a variable and all references to it.
Extract variable: declares a new variable and assigns the selected expression to it.
Change function declaration: includes renaming the function, adding a parameter, removing a parameter, and changing the signature.
Inline function: replaces the usage of a method with its body, as well as removing the original declaration of the method.
Introduce special case: adds code to handle special cases, such as Null objects.

Figure 3 shows examples of refactored code snippets generated by GPT. Additionally, Fig. 4 shows an example of the prompt used to generate the refactored code in order to create the augmented dataset.

Figure 3: Refactoring examples using GPT.
(A) Original code, (B) refactoring using: Rename Variable, (C) refactoring using: Change Function Declaration, (D) refactoring using: Inline Function, (E) refactoring using: Introduce Special Case.

Download full-size image

DOI: 10.7717/peerj-cs.2730/fig-3

Figure 4: An example of the prompt used to generate the refactored code.

Download full-size image

DOI: 10.7717/peerj-cs.2730/fig-4

While our study focused on effective and reliable techniques for addressing class imbalance, such as code refactoring, we acknowledge the importance of systematically comparing various methods, including different oversampling and augmentation approaches. Conducting a detailed comparison to evaluate the impact of these methods on performance remains a valuable direction for future work.