Fast binary logistic regression

Nurdan Ayse Saran; Fatih Nar

doi:10.7717/peerj-cs.2579

Fast binary logistic regression

Nurdan Ayse Saran ¹, Fatih Nar²

1Department of Computer Engineering, Cankaya University, Ankara, Türkiye

2Department of Computer Engineering, Ankara Yildirim Beyazit University, Ankara, Türkiye

DOI: 10.7717/peerj-cs.2579

Published: 2025-01-30
Accepted: 2024-11-13
Received: 2024-08-06

Academic Editor: Charles Elkan

Subject Areas: Data Mining and Machine Learning, Data Science
Keywords: Logistic regression, Low-rank, Singular value decomposition, L_f-norm regularization

Copyright: © 2025 Saran and Nar
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Saran NA, Nar F. 2025. Fast binary logistic regression. PeerJ Computer Science 11:e2579 https://doi.org/10.7717/peerj-cs.2579

The authors have chosen to make the review history of this article public.

Abstract

This study presents a novel numerical approach that improves the training efficiency of binary logistic regression, a popular statistical model in the machine learning community. Our method achieves training times an order of magnitude faster than traditional logistic regression by employing a novel Soft-Plus approximation, which enables reformulation of logistic regression parameter estimation into matrix-vector form. We also adopt the L_f-norm penalty, which allows using fractional norms, including the L₂-norm, L₁-norm, and L₀-norm, to regularize the model parameters. We put L_f-norm formulation in matrix-vector form, providing flexibility to include or exclude penalization of the intercept term when applying regularization. Furthermore, to address the common problem of collinear features, we apply singular value decomposition (SVD), resulting in a low-rank representation commonly used to reduce computational complexity while preserving essential features and mitigating noise. Moreover, our approach incorporates a randomized SVD alongside a newly developed SVD with row reduction (SVD-RR) method, which aims to manage datasets with many rows and features efficiently. This computational efficiency is crucial in developing a generalized model that requires repeated training over various parameters to balance bias and variance. We also demonstrate the effectiveness of our fast binary logistic regression (FBLR) method on various datasets from the OpenML repository in addition to synthetic datasets.

Introduction

Classification has been extensively investigated in statistical learning as a supervised learning task. As pioneering works, the contributions of Berkson (1944) and Cox (1958) to the field of logistic regression are recognized as foundational. Berkson applied the logistic function to bio-assay, while Cox introduced regression analysis of binary sequences. Different forms of logistic regression are available to handle outputs in binary, multinomial, and ordinal formats, making it a versatile machine learning algorithm capable of addressing a wide range of classification tasks such as marketing, finance, healthcare, and fraud detection. With the rapid development of machine learning, highly flexible models such as extreme gradient boosting (XGBoost), support vector machine (SVM), and neural networks have become prominent (Cioffi et al., 2020). However, studies indicating logistic regression’s superior performance in some scenarios (de Hond et al., 2022; Jiang, Hu & Jia, 2023; Nusinovici et al., 2020; Zabor et al., 2022) and continuous usage of it (Jaskie, Elkan & Spanias, 2019; Lu, 2024) underscore the statistical machine learning community’s continued interest in this method.

Several improvements to the original logistic regression method are proposed (Berkson, 1944), such as weight regularization to obtain a generalized model or adapt the method for large data (Jiang, Hu & Jia, 2023; Bertsimas, Pauphilet & Van Parys, 2021; Tibshirani, 1996; Zaidi et al., 2016). Achieving a generalized model is closely tied to the bias-variance tradeoff where high-bias models risk oversimplification and underfitting, while low-bias models can overfit, exhibiting high variance. The tradeoff involves balancing these extremes to ensure the model generalizes well to unseen data without overfitting or underfitting. The optimal tradeoff is achieved by selecting appropriate model parameters that yield the most effective model complexity for training and validation sets, using techniques such as cross-validation and bootstrapping (Wahba et al., 1998; Gong, 2006; Mohr & van Rijn, 2023). It can be summarized that there must be a match between model complexity and data complexity to ensure that a model trained on training data performs well on unseen test data (Geman, Bienenstock & Doursat, 1992). Identifying optimal model parameters necessitates multiple training iterations with data subsets or random subsamples, a computationally intensive process for various parameter combinations, particularly with large datasets (Mohr & van Rijn, 2023; Emmert-Streib & Dehmer, 2019).

Logistic regression assumes a linear relationship between features and class distributions without demanding a specific statistical distribution. However, it is strictly required that features exhibit no multicollinearity. Regularization emerges as a vital solution to address multicollinearity and effectively manage model complexity. Aside from the intercept term, weight regularization applied to features encourages sparsity in the weight vector, resulting in fewer and more distinct features. This reduces multicollinearity and complexity, improving the model’s generalization abilities (Shi et al., 2010; Zhang et al., 2021; Avalos, Grandvalet & Ambroise, 2003). Additionally, in scenarios with numerous features available but limited training data, the risk of overfitting can arise without an effective regularization technique (Vapnik, 2006). Thus, pursuing a sparse solution is crucial to reduce overfitting and prevent data’s random pattern memorization (Zhang et al., 2021).

$L_{2}$ -norm (also known as Ridge) regularization, first introduced by Tikhonov (1963), penalizes the magnitudes of the weights to enforce smaller values, thereby increasing model robustness. In the literature, other weight regularization techniques (Tikhonov & Arsenin, 1979; Morozov, 2012; Bertero, 1986) also enforce sparsity for better robustness and increased model generalization. Among these techniques, $L_{0}$ -norm¹ , provide most sparse solutions (Wang, Chen & Yang, 2022; Greenwood et al., 2020). Nevertheless, using $L_{0}$ -norm is challenging since it is not convex. Thereby, $L_{1}$ -norm, namely least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), is proposed as a convex relaxation of non-convex $L_{0}$ -norm. Each regularization approach imposes different constraints on the model training. For example, $L_{2}$ -norm imposes a squared penalty (Hoerl & Kennard, 1970) and $L_{1}$ -norm imposes an absolute penalty on the model’s parameters (Ozgur, Nar & Erdem, 2018; Wei et al., 2019; Xie et al., 2023) while $L_{0}$ -norm imposes a constant penalty for all non-zero weights (Wang, Chen & Yang, 2022; Greenwood et al., 2020). Note that the Bayesian information criterion (BIC) (Schwarz, 1978) and the Akaike information criterion (AIC) (Akaike, 1998), well-known model selection criteria, are special cases of $L_{0}$ -norm regularization. So, sparse models using $L_{0}$ -norm and $L_{1}$ -norm enforce many coefficients to shrink to exactly zero while $L_{2}$ -norm tends to produce model parameters closer but not precisely zero (Pereyra et al., 2017). Thus, $L_{1}$ -norm and $L_{0}$ -norm are employed in feature selection to avoid overfitting. On the contrary, $L_{2}$ -norm has better computational efficiency than $L_{1}$ -norm and is even more efficient than $L_{0}$ -norm. Thereby, several studies prefer to use $L_{1}$ -norm or $L_{2}$ -norm regularized logistic regression for large scale data (Koh, Kim & Boyd, 2007; Shi et al., 2010; Jovanovich & Lazar, 2012; Su, 2020). Elastic-net combines the $L_{1}$ -norm and $L_{2}$ -norm penalties, so it tends to choose more features than LASSO, but computational efficiency becomes similar to Ridge. Recently, some studies tackled the $L_{0}$ -norm regularization for logistic regression while dealing with challenges of using non-convex $L_{0}$ -norm regularization term (Ming & Yang, 2024; Hazimeh, Mazumder & Nonet, 2023; Knauer & Rodner, 2023; Deza & Atamturk, 2022). Although not proposed for logistic regression, the use of fractional norm as a penalty term was proposed as a flexible and practical approach for enforcing smoothness on the image denoising (Ozcan, Sen & Nar, 2016), where it is also better suited for handling the generalized Gaussian distribution of the variables (Bernigaud et al., 2021).

As already stated in the literature, the maximum likelihood estimation of the logistic regression has some shortcomings:

It may struggle to handle massive sparse datasets (Holland & Welsch, 1977; Li, Zhu & Wang, 2023) effectively.
Logistic regression often struggles with imbalanced data, tending to favor the majority class and resulting in poor performance on the minority class (King & Zeng, 2001).
For the training set where the number of samples is smaller than the number of features, directly solving the logistic regression is an ill-posed problem (Liu, Chen & Ye, 2009).
It is sensitive to anomalous data and collinearity (Feng et al., 2014; Midi, Sarkar & Rana, 2010).
Oversampling data instances may decrease estimation performance and increase computational expenses (Wang, 2020).

One strategy to mitigate these challenges involves utilizing the iteratively reweighted least squares method with an appropriate solver to construct binary logistic regression classifiers, particularly for large-scale datasets (Paciorek, 2007; Rouhani-Kalleh, 2007). Employing appropriate numerical solvers can also address these problems. Hence, numerous numerical solvers have been deployed to tackle these challenges, such as gradient descent and its variants, Newton’s method, and quasi-Newton methods. These solvers play a crucial role in training models efficiently by iteratively adjusting parameters to minimize a specified objective function. Although these approaches attempt to eliminate anomalies or collinearity problems in the data, working with many solvers with different regularizers poses a challenging issue for researchers (Liu, Chen & Ye, 2009). The selection of the solver typically depends on the specific characteristics of the data and the nature of the problem under consideration.

Scikit-learn (Pedregosa et al., 2011), a versatile Python library integrating a wide range of state-of-the-art machine learning algorithms, has become one of the most popular choices among many libraries due to its simplicity, comprehensive documentation, and extensive community support. The LogisticRegression class in scikit-learn already incorporates several optimization methods (see Table 1), each tailored for different regularizers, data sizes, and computational needs. Similarly, other open-source libraries, including cuML and Statsmodels, provide their own implementations of logistic regression, each with a variety of solvers and parameter configurations to enhance flexibility and efficiency. In particular, cuML offers graphics processing unit (GPU)-accelerated solvers for large-scale computations and Statsmodels provides robust statistical modeling options for binary response variables.

Table 1:

Minimizers employed as solvers in scikit-learn’s logistic regression algorithm.

	$L_{1}$	$L_{2}$	ElasticNet	No regularization
LBFGS¹		✓		✓
LIBLINEAR (Fan et al., 2008)	✓	✓
Newton-CG¹		✓		✓
Newton-Cholesky¹		✓		✓
SAG (Schmidt, Roux & Bach, 2017)		✓		✓
SAGA (Defazio, Bach & Lacoste-Julien, 2014)	✓	✓	✓	✓

DOI: 10.7717/peerj-cs.2579/table-1

Note:

1 See Nocedal & Wright (2006) for further information on these common optimization approaches.

With increasing dataset sizes, logistic regression needs to improve in terms of efficiency and scalability. In response, advanced versions have emerged, designed to prioritize speed and accuracy. These innovative approaches accelerate learning by leveraging techniques such as dimensionality reduction, parallel processing, and advanced data structures.

Logistic regression training is formulated to solve an optimization problem based on likelihood maximization. However, calculating the gradient with all training data in each iteration for large datasets becomes computationally demanding. Therefore, Song et al. (2021) proposed an adaptive sampling method in which the gradient estimation is divided into several sub-problems according to the data size, and then each sub-problem is solved independently before combining the results of all sub-problems. Shi et al. (2010) suggested a hybrid algorithm that merges two optimization iterations: one that is fast and memory-efficient and another that is slower but yields more precise results.

Various approaches have been proposed to enhance the speed of logistic regression, which can generally be categorized into efficient numerical methods and software-based improvements. Table 1 summarizes the solvers employed in scikit-learn. A notable earlier study by Komarek & Moore (2003) utilizes Cholesky decomposition to accelerate logistic regression. However, since many contemporary methods now incorporate Cholesky decomposition within NumPy or employ more advanced solvers, this study has yet to be considered state-of-the-art. Additionally, some approaches leverage field programmable gate array (FPGA) and GPU (Wienbrandt et al., 2019), stochastic gradient descent, and mini-batch (Yang et al., 2019; Liang et al., 2020; Jurafsky & Martin, 2024) methods to improve the speed of logistic regression. These techniques are often complementary and can be integrated with other methods, representing practical efforts to accelerate logistic regression further.

We introduce a novel numerical approach that remarkably improves the training efficiency of binary logistic regression without relying on dimension reduction or parallel processing and without requiring feature independence. This efficiency is obtained by employing a novel Soft-Plus approximation, which enables reformulation of logistic regression parameter estimation into matrix-vector form. Additionally, unlike the multiple solvers employed in scikit-learn’s logistic regression for different regularization schemas, we present a single flexible $L_{f}$ -norm regularization approach that also provides flexibility to include or exclude penalization of the intercept term. Our regularization approach supports a range of weight penalties through a unified codebase with a specifically designed numerical minimizer. To demonstrate the computational efficiency of our method, we conducted several quantitative experiments, comparing our method against the widely used scikit-learn library. Here, we used benchmark data from OpenML, an open machine learning data repository, and synthetic data designed for controlled experiments. Experiments demonstrate that our method effectively handles collinear features and large data while providing superior efficiency with minimal to no loss in accuracy. Our fast binary logistic regression (FBLR) exclusively utilizes the Python NumPy library. As a result, it benefits from all current capabilities of NumPy and upcoming enhancements while keeping the codebase simple. This streamlined yet powerful approach allows it to surpass the performance of scikit-learn’s logistic regression (LR) implementation in processing speed, which relies on well-established and highly optimized libraries. Our experiments demonstrate that FBLR achieves an average speedup of an order of magnitude faster, with the maximum observed speedup reaching $48.3$ times.

Proposed method

Fast binary logistic regression

For logistic regression, the likelihood function in the context of maximum likelihood estimation (MLE) is formulated based on the assumption that each observation is an independent Bernoulli trial. Consider a dataset with $n$ observations $(x_{i}, y_{i})$ , where each $x_{i} \in R^{d}$ pairs with a binary outcome $y_{i} \in {0, 1}$ . Here, $i$ is the data index, and $d$ is the number of features, including the intercept. Then, the logistic regression model predicts the probability of the outcome being 1 is defined as:

(1) $P (y_{i} = 1 | x_{i}) = \frac{e^{x_{i}^{⊤} w}}{1 + e^{x_{i}^{⊤} w}}$ where $w \in R^{d}$ represents the model parameters, namely the weight vector and $x_{i} = [1 x_{i, 1} x_{i, 2} \dots x_{i, d - 1}]^{⊤}$ is the $i^{t h}$ data vector. For the binary classification using logistic regression, likelihood $l (w)$ of observing the entire dataset given $w$ is the product of the probabilities for each observation that is given in Eq. (2):

(2) $l (w) = \prod_{i} P (y_{i} = 1 | x_{i}) \prod_{i} P (y_{i} = 0 | x_{i}) = \prod_{i : y_{i} = 1} P (x_{i}) \prod_{i : y_{i} = 0} (1 - P (x_{i}))$

The optimum model parameter is found by maximizing the $l (w)$ with respect to weight vector $w$ for the given input data and corresponding target class labels, known as MLE:

(3) $a r g m a x_{w} l (w)$

Optimizing $l (w)$ presents challenges due to the presence of product terms. However, instead of maximizing $l (w)$ , one can opt to minimize $- \log l (w)$ . This approach is feasible because the maximum of $l (w)$ corresponds to the minimum of $- \log l (w)$ , and $- \log l (w)$ is significantly easier to optimize. Then, we simplify $- \log l (w)$ as below:

(4) $\begin{aligned} - \log l (w) & = - [\sum_{i : y_{i} = 1} \log P (x_{i}) + \sum_{i : y_{i} = 0} \log (1 - P (x_{i}))] \\ = \sum_{i : y_{i} = 0} \log (1 + e^{x_{i}^{⊤} w}) - \sum_{i : y_{i} = 1} (x_{i}^{⊤} w - \log (1 + e^{x_{i}^{⊤} w})) \\ = \sum_{i} (\log (1 + e^{x_{i}^{⊤} w}) - y_{i} x_{i}^{⊤} w) . \end{aligned}$

To develop an efficient minimizer for $- \log l (w)$ , we approximate the Soft-Plus function, $\log (1 + e^{x_{i}^{⊤} w})$ , at $\hat{w}$ to obtain a quadratic form where $\hat{w}$ is a proxy constant for $w$ . Details of our novel quadratic Soft-Plus approximation are provided in “Approximation of Soft-Plus”. By substituting this Soft-Plus approximation into Eq. (4), we derive the following form of $- \log l (w)$ :

(5) $- \log l (w) \approx \sum_{i} (z_{i} {(x_{i}^{⊤} w)}^{2} + (\frac{1}{2} - y_{i}) (x_{i}^{⊤} w) + \log 2)$ where

(6) $z_{i} = {\begin{matrix} \frac{1}{8}, & x_{i}^{⊤} \hat{w} = 0 \\ \frac{\log (1 + e^{x_{i}^{⊤} \hat{w}}) - \log (2)}{{(x_{i}^{⊤} \hat{w})}^{2}} - \frac{1}{2 x_{i}^{⊤} \hat{w}}, & o t h e r w i s e . \end{matrix}$

Finally, we put Eq. (5) into matrix-vector form as below:

(7) $\begin{aligned} - \log l (w) & \approx (X w)^{⊤} Z (X w) + {(\frac{\vec{1}}{2} - y)}^{⊤} X w + n \log 2 \\ \approx w^{⊤} X^{⊤} Z X w + {(\frac{\vec{1}}{2} - y)}^{⊤} X w + n \log 2 \end{aligned}$ where $X \in R^{n \times d}$ is input data in matrix form, $y \in R^{n \times 1}$ is target data in vector form, $w \in R^{d \times 1}$ is weight vector, $Z \in R^{n \times n}$ is a diagonal matrix form of $z_{i}$ .

Various forms of weight regularization have been proposed in the literature for logistic regression, each requiring distinct numerical approaches. Alternatively, we use smooth $L_{f}$ -norm (Ozcan, Sen & Nar, 2016) regularization to achieve a unified framework that only requires a single numerical approach, incorporating Ridge, pseudo ElasticNet, LASSO, $L_{0}$ -norm (notably, a pseudo-norm), and other fractional norm regularizations.

For quadratic approximation, let $\hat{ρ}$ be a constant proxy for $ρ$ and define $u$ as a constant coefficient, $u = {({| \hat{ρ} |}^{2 - f} + ϵ)}^{- 1}$ . Then, quadratically approximated $L_{f}$ -norm is

(8) $L_{f} (ρ) \approx ρ^{2} (| \hat{ρ} |^{2 - f} + ϵ)^{- 1} = u ρ^{2} .$

Refer to “Approximation of L_f-Norm” for details and justification of the quadratic $L_{f}$ -norm approximation.

An iterative minimization approach (solver) is now required as we utilize quadratic approximations of $\log (1 + e^{x_{i}^{⊤} w})$ (Soft-Plus) and a smooth $L_{f}$ -norm. Both approximations are performed on $w$ for each iteration, where the applied approximations are accurate near the point of approximation, namely $\hat{w}$ which is a proxy constant for $w$ . To ensure that the new solution remains close to the approximation point at the current iteration, we introduce a slow-step-regularization (SSR) term with a coefficient $λ$ . Finally, we normalize the terms by the data count and weight vector dimension.

(9) $J^{(k)} (w) = - \frac{1}{n} \log l (w) + \frac{λ}{2 d} (w - \hat{w})^{⊤} (w - \hat{w}) + \frac{γ}{2 d} p^{⊤} L_{f} (w)$ where $k$ is the iteration index, $w$ is weight vector, $\hat{w}$ is a proxy constant for $w$ at $k^{t h}$ iteration, $λ$ is the SSR coefficient, and $γ$ is $L_{f}$ -norm regularization coefficient. Here, $p$ is the vector used to prevent penalizing the intercept term, defined as $p = [\dots, p_{j}, \dots]$ , where $p_{j}$ is $0$ for the intercept (to avoid penalizing it) and 1 for all other coefficients. Then, $p^{⊤} L_{f} (w)$ is approximated as:

(10) $p^{⊤} L_{f} (w) = \sum_{j = 1}^{d} p_{j} L_{f} (w_{j}) = \sum_{j = 1}^{d} h_{j} w_{j}^{2} = w^{⊤} H w$ where $H \in R^{d \times d}$ is a diagonal matrix form of $h_{j}$ values that is defined as below ( $ϵ$ is a small positive constant with default as $ϵ = 10^{- 10}$ ):

(11) $h_{j} = p_{j} {(| {\hat{w}}_{j} |^{2 - f} + ϵ)}^{- 1} .$

The obtained cost function $J^{(k)} (w)$ , comprising only quadratic terms, linear terms, and constants, can be expressed in a matrix-vector form.

(12) $J^{(k)} (w) = \frac{1}{n} w^{⊤} X^{⊤} Z X w + \frac{1}{n} {(\frac{\vec{1}}{2} - y)}^{⊤} X w + \log 2 + \frac{λ}{2 d} (w - \hat{w})^{⊤} (w - \hat{w}) + \frac{γ}{2 d} w^{⊤} H w .$

Afterwards, we can minimize $J^{(k)} (w)$ with respect to $w$ at $k^{t h}$ iteration by taking the derivative of $J^{(k)} (w)$ with respect to $w$ and equalizing it to zero:

(13) $\frac{\partial J^{(k)} (w)}{\partial w} = \frac{1}{n} 2 X^{⊤} Z X w + \frac{1}{n} X^{⊤} (\frac{\vec{1}}{2} - y) + \frac{λ}{2 d} 2 (w - \hat{w}) + \frac{γ}{2 d} 2 H w = 0$ which can be arranged as:

(14) $(\frac{2}{n} X^{⊤} Z X + \frac{λ}{d} I + \frac{γ}{d} H) w = \frac{λ}{d} \hat{w} + \frac{1}{n} X^{⊤} (y - \frac{\vec{1}}{2})$ where $I \in R^{d \times d}$ is the identity matrix. Note that Eq. (14) can be represented as a linear system $A w^{(k + 1)} = b$ . Finally, the solution can be represented as below where $Z$ and $H$ are updated in each iteration. In our case, $A$ is a symmetric matrix, and $v$ is a constant vector that can be computed before the iterative minimization.

Initialization	Iteration
$v = \frac{1}{n} X^{⊤} (y - \frac{\vec{1}}{2})$	$A w^{(k + 1)} = b$
	$A = \frac{2}{n} X^{⊤} Z X + \frac{λ}{d} I + \frac{γ}{d} H$
	$b = \frac{λ}{d} w^{(k)} + v$

DOI: 10.7717/peerj-cs.2579/table-11

Data matrix $X$ can be rank-deficient due to colinear features or insufficient data samples. When the data matrix $X$ is rank-deficient, and no regularization technique is applied ( $λ = 0$ ), the matrix $A$ may become a symmetric positive semi-definite matrix. Nonetheless, if $X$ is of full rank or a regularization is applied ( $λ > 0$ ), the matrix $A$ becomes symmetric positive definite. To mitigate the challenges due to rank-deficiency and the subsequent numerical instabilities, we utilized low-rank approximation using singular value decomposition (SVD) (Lawson & Hanson, 1995; Hansen, 1990) for approximating the data matrix $X \in R^{n \times d}$ (Ye, 2005).

(15) $U, S, V^{⊤} = T r u n c a t e d - S V D (X)$ where $X = U S V^{⊤}$ , $U \in R^{n \times d}$ , $S \in R^{d \times d}$ , and $V \in R^{d \times d}$ . Then

(16) $\begin{aligned} A & = \frac{2}{n} (U S V^{⊤})^{⊤} Z (U S V^{⊤}) + \frac{λ}{d} I + \frac{γ}{d} H = \frac{2}{n} V S^{⊤} U^{⊤} Z U S V^{⊤} + \frac{λ}{d} I + \frac{γ}{d} H \\ = \frac{2}{n} V S (U^{⊤} Z U) S V^{⊤} + \frac{λ}{d} I + \frac{γ}{d} H . \end{aligned}$

We use $r$ -rank approximation of $U$ , $S$ , and $V$ matrices such that $X \approx \hat{U} \hat{S} {\hat{V}}^{⊤}$ . Note that, $\hat{U} \in R^{n \times r}$ , $\hat{S} \in R^{r \times r}$ , and $\hat{V} \in R^{d \times r}$ denotes low-rank approximations of the $U$ , $S$ , and $V$ matrices, respectively. Here, $\hat{U}$ and $\hat{V}$ are orthonormal matrices and $\hat{S}$ is a diagonal matrix. In this low-rank approximation, $r$ is the rank such that $r \leq d$ and $r$ is chosen such that almost all of the energy (by eigenvalues) is preserved.

We initialize $w^{(0)}$ using least-square to start from a reasonable initial point. For computational efficiency, we also use the SVD decomposition $X = U S V^{⊤}$ on the least-square equation:

(17) $w^{(0)} = (X^{⊤} X)^{- 1} X^{⊤} y = {(V S^{⊤} U^{⊤} U S V^{⊤})}^{- 1} V S^{⊤} U^{⊤} y = V S^{- 1} U^{⊤} y$

To circumvent numerical challenges and gain further computational efficiency, we employ a low-rank approximation as $w^{(0)} = \hat{V} {\hat{S}}^{- 1} {\hat{U}}^{⊤} y = F y$ where $F = \hat{V} {\hat{S}}^{- 1} {\hat{U}}^{⊤}$ . Computing $F$ is computationally efficient since the inverse of the diagonal matrix $\hat{S}$ is computationally cheap. Also, computing $w^{(0)} = F y$ is cheap as well. For both with and without regularization, a common initialization is defined as follows:

U

S

V^{⊤} = T r u n c a t e d - S V D (X)

since

X = U S V^{⊤}

Determine rank

r

such that

X \approx \hat{U} \hat{S} {\hat{V}}^{⊤}

\hat{U} = r - r a n k (U) a n d \hat{S} = r - r a n k (S) a n d \hat{V} = r - r a n k (V)

w^{(0)} = F y

where

F = \hat{V} {\hat{S}}^{- 1} {\hat{U}}^{⊤}

DOI: 10.7717/peerj-cs.2579/table-12

There are $t w o$ paths for the developed iterative minimization: (a) with regularization and (b) without regularization. It should be noted that applying $L_{f}$ -norm regularization requires setting a positive $γ$ value, which in turn necessitates assigning a positive $λ$ parameter to enable SSR regularization.

a) With regularization ( $λ > 0$ and $γ > 0$ )

Recall, first we need to construct $A$ matrix and $b$ vector then solve the linear system $A w^{(k + 1)} = b$ . Using the $v = \frac{1}{n} X^{⊤} (y - \frac{\vec{1}}{2})$ we have:

(18) $A = \frac{2}{n} X^{⊤} Z X + \frac{λ}{d} I + \frac{γ}{d} H a n d b = \frac{λ}{d} w^{(k)} + v .$

With $G = \hat{S} {\hat{V}}^{⊤}$ , low-rank approximation of $X \approx \hat{U} \hat{S} {\hat{V}}^{⊤}$ leads to $X \approx \hat{U} G$ :

(19) $\begin{aligned} A & = \frac{2}{n} X^{⊤} Z X + \frac{λ}{d} I + \frac{γ}{d} H = \frac{2}{n} (\hat{U} \hat{S} {\hat{V}}^{⊤})^{⊤} Z (\hat{U} \hat{S} {\hat{V}}^{⊤}) + \frac{λ}{d} I + \frac{γ}{d} H \\ = \frac{2}{n} G^{⊤} ({\hat{U}}^{⊤} Z \hat{U}) G + \frac{λ}{d} I + \frac{γ}{d} H . \end{aligned}$

Let compute $v$ in order to compute $b$ and use $r$ -rank approximation of $v$ as below:

(20) $v = \frac{1}{n} X^{⊤} (y - \frac{\vec{1}}{2}) = \frac{1}{n} G^{⊤} ({\hat{U}}^{⊤} (y - \frac{\vec{1}}{2})) .$

Finally, the numerical minimization approach with regularization is as follows:

Initialization	Iteration
$G = \hat{S} {\hat{V}}^{⊤}$ where $X \approx \hat{U} \hat{S} {\hat{V}}^{⊤}$	$A w^{(k + 1)} = b$
with low-rank estimation using SVD	$A = \frac{2}{n} G^{⊤} ({\hat{U}}^{⊤} Z \hat{U}) G + \frac{λ}{d} I + \frac{γ}{d} H$
$v = \frac{1}{n} G^{⊤} ({\hat{U}}^{⊤} (y - \frac{\vec{1}}{2}))$	$b = \frac{λ}{d} w^{(k)} + v$

DOI: 10.7717/peerj-cs.2579/table-13

Thanks to the low-rank approximation applied to the data matrix $X$ in addition to applied $L_{f}$ -norm and SSR regularization, the proposed cost function becomes strictly convex. So, the linear system $A w^{(k + 1)} = b$ is well-conditioned, ensuring a unique solution for $w^{(k + 1)}$ , whereas the logistic regression cost function remains only convex due to potential rank deficiencies in $X$ .

b) Without regularization ( $λ = 0$ and $γ = 0$ )

First $A$ and $b$ are constructed then linear system $A w^{(k + 1)} = b$ is solved.

(21) $\begin{aligned} \frac{2}{n} (X^{⊤} Z X) w^{(k + 1)} & = \frac{1}{n} X^{⊤} (y - \frac{\vec{1}}{2}) \\ w^{(k + 1)} & = \frac{1}{2} {(X^{⊤} Z X)}^{- 1} X^{⊤} (y - \frac{\vec{1}}{2}) . \end{aligned}$

Let us use low-rank approximation of $X \approx \hat{U} \hat{S} {\hat{V}}^{⊤}$ :

(22) $\begin{aligned} w^{(k + 1)} & \approx \frac{1}{2} ((\hat{U} \hat{S} {\hat{V}}^{⊤})^{⊤} Z (\hat{U} \hat{S} {\hat{V}}^{⊤}))^{- 1} (\hat{U} \hat{S} {\hat{V}}^{⊤})^{⊤} (y - \frac{\vec{1}}{2}) \\ \approx \frac{1}{2} \hat{V} {\hat{S}}^{- 1} {\hat{U}}^{⊤} Z^{- 1} \hat{U} {\hat{U}}^{⊤} (y - \frac{\vec{1}}{2}) \\ \approx \frac{1}{2} F Z^{- 1} y_{q} w h e r e y_{q} = \hat{U} ({\hat{U}}^{⊤} (y - \frac{\vec{1}}{2})) . \end{aligned}$

The final iterative numerical minimization approach without regularization is as follows:

Initialization	Iteration
$y_{q} = \hat{U} ({\hat{U}}^{⊤} (y - \frac{\vec{1}}{2}))$	$w^{(k + 1)} \approx \frac{1}{2} F Z^{- 1} y_{q}$

DOI: 10.7717/peerj-cs.2579/table-14

Note that $Z$ is a diagonal matrix with all positive entries, making its inverse extremely efficient to compute. Additionally, $F Z^{- 1} y_{q}$ can be calculated efficiently, as computing $Z^{- 1} y_{q}$ requires only $n$ operations, and multiplying $F$ by the resulting vector requires only $d \times n$ operations. As a result, each iteration without regularization remains highly efficient. In addition to computational efficiency, it is guaranteed that $F Z^{- 1}$ is always well-conditioned, and a unique solution for $w^{(k + 1)}$ exists since $Z$ , as a diagonal matrix with positive entries, is always invertible. Thus, the employed low-rank approximation regularizes the solution to mitigate possible collinearity in the data matrix $X$ .

Implementation

This section outlines the implementation of the proposed method, with the Low-Rank approximation described in the Algorithm 1 and the proposed FBLR method presented in the Algorithm 2. In the Algorithm 1, $X$ is input data matrix, $ξ$ is energy-percentile (default is 99.9999) the DIMENSION function returns the number of rows ( $n$ ) and the number of features ( $d$ ). The SVD^(∗) function carries out singular value decomposition, choosing the most suitable variant—either truncated or randomized SVD (refer to “Randomized SVD”), or SVD with row reduction (SVD-RR) (refer to “SVD with Row Reduction”)—based on an evaluation of $n$ and $d$ . Lastly, the RANK function (see “Determining the Rank”) in the Algorithm 1 finds the matrix’s rank by analyzing the eigenvalues derived from $S$ , consequently generating matrices of low rank.

Algorithm 1 :

Procedure to obtain a low-rank approximation of the data matrix X.

1: procedure LowRankApproximation (

X

, ξ)

2: Returns the dimension of the data matrix

X

n, d \leftarrow

DIMENSION

(X)

⊳ n

: #row, d: #feature

4: Utilizing either Truncated SVD or Randomized SVD

U, S, V^{⊤} \leftarrow

SVD^(∗)

(X)

⊳ n

see “SVD with Row Reduction (SVD-RR) & Randomized SVD”

6: Computes the rank of matrix

X

using eigenvalues

r \leftarrow

RANK

(S, ξ)

⊳ n

see “Determining the Rank”

8: The low-rank components of the matrices are extracted

\hat{U} \leftarrow U [:, 1 : r]

10:

\hat{S} \leftarrow S [1 : r, 1 : r]

11:

\hat{V} \leftarrow V [1 : r, 1 : r]

12: return

n, d, r, \hat{U}, \hat{S}, \hat{V}

13: end procedure

DOI: 10.7717/peerj-cs.2579/table-101

Algorithm 2 :

Fast binary logistic regression (FBLR).

1: procedure FBLR(

X

, y, f, λ, γ, ξ, K, C_tolerance)

2: Initialization

n, d, r, \hat{U}, \hat{S}, {\hat{V}}^{⊤} \leftarrow

LowRankApproximation

(X, ξ)

F = \hat{V} {\hat{S}}^{- 1} {\hat{U}}^{⊤}

w = F y

⊳ n

Initializing weights via the pseudo-inverse

6: if

(λ > 0

γ > 0)

then

G \leftarrow \hat{S} {\hat{V}}^{⊤}

v \leftarrow \frac{1}{n} G^{⊤} ({\hat{U}}^{⊤} (y - \frac{\vec{1}}{2}))

9: else

10:

y_{q} \leftarrow \hat{U} ({\hat{U}}^{⊤} (y - \frac{\vec{1}}{2}))

11: end if

12: Iteration

13: for

k \leftarrow 0

to K do

14:

\hat{w} \leftarrow w

15:

Z \leftarrow

diag(z)

⊳ z

= vector

(z_{i})

, Eq. (6)

16: if

(λ > 0

γ > 0)

then

17:

H \leftarrow

diag(h)

⊳ h

= vector

(h_{i})

, Eq. (11)

18:

A \leftarrow \frac{2}{n} G^{⊤} ({\hat{U}}^{⊤} Z \hat{U}) G + \frac{λ}{d} I + \frac{γ}{d} H

19:

b \leftarrow \frac{λ}{d} w^{(k)} + v

20:

w \leftarrow

solve

(A, b)

⊳ h

Applying Cholesky decomposition for solving

A x = b

21: else

22:

w \leftarrow \frac{1}{2} F Z^{- 1} y_{q}

23: end if

24: if

(k \geq 3

and

| | w - \hat{w} | |_{\infty} \leq C_{t o l e r a n c e})

then

25: break

26: end if

27: end for

28: Finalization

29: return w

30: end procedure

DOI: 10.7717/peerj-cs.2579/table-102

The pseudocode for the proposed FBLR method is given in Algorithm 2. In this algorithm, operations that remain constant throughout iterations are performed upfront in the initialization phase. First, low-rank approximation of the matrix $X$ is obtained using Algorithm 1. Depending on whether regularization is applied, various matrices and vectors are precomputed accordingly. Subsequently, at each iteration, a dense linear system is constructed and solved using the Cholesky decomposition, leveraging the fact that $A$ is a symmetric positive definite matrix when regularization is employed (line 20 in Algorithm 2). Even when regularization is not employed, computing $w$ is still well-conditioned (line 22 in Algorithm 2) since $Z$ is a diagonal matrix with all positive entries.

Using Python v3.12.1, NumPy v1.26.4, and scikit-learn v1.3.2, proposed FBLR method in Algorithm 2, is implemented as the Python class FastLogisticRegressionLowRank extending scikit-learn’s abstract BaseEstimator and ClassifierMixin classes. We use NumPy for all matrix and vector operations where our NumPy configuration uses OpenBLAS as the backend for basic linear algebra subprograms (BLAS) operations in our environment. For additional efficiency, NumPy also utilizes CPU instructions such as single instruction multiple data (SIMD).

Computational complexity analysis

In the Algorithm 2, the time complexity of the initialization phase is $O (n d^{2})$ since the time complexity of both the SVD used in the low-rank approximation and $F = \hat{V} {\hat{S}}^{- 1} {\hat{U}}^{⊤}$ is $O (n d^{2})$ . At the same time, the remaining operations have lower time complexity. Furthermore, the computational complexity per iteration is $O (n d^{2})$ when incorporating regularization ( $L_{f}$ -norm & SSR) and $O (n d)$ without regularization. Noting that $k$ represents the number of iterations executed until the Algorithm 2 converges, the total time complexity of the FBLR method is $O (n d^{2} k)$ for cases with regularization and $O (n d max (d, k))$ when no regularization is applied. In the case without regularization, the time complexity $O (n d max (d, k))$ is obtained since the computational complexity of the initialization is $O (n d^{2})$ , and the complexity per iteration is $O (n d k)$ . The total computational complexity is $O (n d^{2}) + O (n d k) \to O (n d d) + O (n d k)$ , which simplifies to $O (n d max (d, k))$ . For the proposed FBLR method, $k$ is typically less than or equal to the maximum iteration (K) with a default value of 10.

Our time complexity analysis reveals that the FBLR method has linear time complexity with respect to the number of rows ( $n$ ) and at most quadratic time complexity with respect to the data dimension ( $d$ ). Furthermore, for the proposed FBLR method, the maximum number of iterations is only $10$ , which is considerably low. Thus, the maximum iteration count of the proposed FBLR method is small compared to logistic regression in scikit-learn with various solvers (see Table 2), demonstrating its efficiency.

Table 2:

Maximum number of iterations for the scikit-learn logistic regression solvers.

Solver	Default maximum iteration (K)	Typical iteration count
LIBLINEAR	100	100 to 1,000
LBFGS	100	100 to 500 (up to 1,000)
Newton-CG
Newton-cholesky
SAG	1,000	1,000 to 5,000
SAGA

DOI: 10.7717/peerj-cs.2579/table-2

The space complexity of the proposed FBLR method is at least the size of the input data matrix $X \in R^{n \times d}$ . So, we ignore the vectors employed in the FBLR method, as they have dimensions of $n \times 1$ or $d \times 1$ , which are much smaller than $n \times d$ , size of the data matrix $X$ . So, we will focus only on the matrices employed in the FBLR method, as outlined below:

$X, U, F^{⊤} \in R^{n \times d}$ and $V, G, A \in R^{d \times d}$ are dense matrices $\to$ space complexity is $O (n d)$
$Z, H \in R^{n \times n}$ and $S \in R^{d \times d}$ are sparse diagonal matrices $\to$ space complexity is $O (n)$ .

Therefore, space complexity of the proposed FBLR method is $O (n d)$ .

Results

We conducted experiments on a system running Ubuntu Linux 20.04 with an Intel Core i9-10900KF CPU (10 cores) and 64 GB of RAM. Note that NumPy leverages SIMD extensions supported by a CPU to a reasonable extent. For Intel Core i9-10900KF CPU, utilized SIMD extensions by NumPy are SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX22, F16C, and FMA3.

For the proposed FBLR method, $λ$ and $γ$ parameters are set to zero for all experiments unless stated otherwise. Also, fixed default values are used for the parameters $ϵ$ , $ξ$ , K, and $C_{t o l e r a n c e}$ . The parameter $ϵ$ , a small positive constant for the applied $L_{f}$ -norm approximation, is set to $10^{- 10}$ . Also, $ξ$ , representing the energy percentile for low-rank approximation, is $99.9999$ , excluding zeros or very small eigenvalues. The maximum number of iterations is set to $K = 10$ , and the convergence tolerance is set to $C_{t o l e r a n c e} = 10^{- 3}$ . From this point forward, whenever logistic regression (LR) is mentioned in the text, it refers specifically to the logistic regression within scikit-learn.

Datasets

Realworld dataset

We used datasets with binary labels from the OpenML in the experiments. We focused on datasets with no missing data and training times exceeding $3$ s when utilizing LR. This strategy focused on selecting datasets that would benefit the most from speed enhancements. Our analysis compared the classification performance and execution efficiency between LR and our FBLR method across different scenarios, such as balanced and imbalanced datasets of medium to large sizes. Additionally, we evaluated the LR and FBLR method on the UCI’s HEPMASS dataset, which features rich data from key physics experiments targeting exotic particle discovery and poses a binary classification challenge. Detailed information on the datasets used with their attributes is provided in Table 3. In a binary dataset, the term ‘majority’ denotes a frequency between $0.5$ and $1.0$ for the class with more occurrences than the other class. In Table 3, creditcard dataset is imbalanced since it has a high majority value.

Table 3:

Datasets utilized in the experiments along with their attributes.

Name	n (#samples)	d (#features)	Majority%
Kits	1,000	27,648	52%
Road-safety	111,762	32	50%
Creditcard	284,807	30	99%
Airlines	539,383	7	55%
Colon	5,100,000	62	50%
HEPMASS	10,500,000	28	50%

DOI: 10.7717/peerj-cs.2579/table-3

Synthetic dataset

The make_classification method in the scikit-learn library is a mechanism for creating random synthetic data with a desired number of classes, samples, and features. n_redundant is responsible for generating linear combinations of informative features, weights sets the sample distribution across classes to introduce imbalance, and flip_y alters the class of a specified fraction of samples at random, introducing noise and increasing the complexity of the classification task.

Experiment design

Using real-world data, three experiments were designed to evaluate classification performance and execution time of LR and FBLR: (a) experiment on HEPMASS data with multiple performance metrics (b) a single-run experiment without regularization on OpenML datasets and (c) an experiment incorporating regularization through various parameters on OpenML datasets. Optimizing hyperparameters is crucial to improve the model’s classification performance. We employed the grid search cross-validation (GridSearchCV) method in scikit-learn to find the best models for LR and FBLR, specifically to analyze the impact of regularization on accuracy metrics and execution time.

We also examined the alignment between theoretical computational complexity and practical execution times on synthetic datasets, as these allow for greater control over the data.

Performance metrics

There exist several performance metrics. Accuracy measures the proportion of true results among all samples, providing an overall view of classification accuracy.

$A c c u r a c y = \frac{T P + T N}{F P + F N + T P + T N}$ where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

Precision measures the proportion of TP in all positive predictions, recall measures the proportion of actual positives that are correctly identified, and F1-score is the harmonic mean of precision and recall.

$P r e c i s i o n = \frac{T P}{F P + T P} a n d R e c a l l = \frac{T P}{F N + T P} a n d F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

The area under the curve (AUC) is determined by plotting the true positive rate (TPR) vs. the false positive rate (FPR) at various thresholds and calculating the area beneath this curve.

Impact of solver selection on LR performance

The LogisticRegression class in scikit-learn employs several optimization methods (see Table 1), each tailored for different regularizers, data sizes, and computational needs. However, this leads to differences in execution time and accuracy across the solvers. As seen in Table 4, the execution time of Logistic Regression in scikit-learn varies significantly (from 3.575 to 99.339 s). The Newton-CG and LIBLINEAR solvers demonstrate good accuracy, while the limited-memory Broyden-Fletcher-Goldfarb-Shanno (LFBGS) algorithm, stochastic average gradient (SAG), and stochastic average gradient descent (SAGA) have lower accuracy due to inadequate default iteration limits. It is important to note that in this experiment, we did not use GridSearchCV; instead, each solver was executed individually with its default parameters.

Table 4:

Performance of LR solvers on the roadsafety dataset.

Solver	Penalty	Time (s)	Accuracy
Newton-CG	$L_{2}$	6.7949	0.6949
Newton-CG	No regularization	5.5673	0.6949
LBFGS	$L_{2}$	59.4638	0.6211
LBFGS	No regularization	55.8338	0.6290
LIBLINEAR	$L_{1}$	39.889	0.6949
LIBLINEAR	$L_{2}$	3.5751	0.6935
SAG	$L_{2}$	42.6379	0.5819
SAG	No regularization	42.5621	0.5820
SAGA	$L_{1}$	99.0867	0.5821
SAGA	$L_{2}$	85.0862	0.5821
SAGA	ElasticNet ( $α = 0.5$ )	99.3389	0.5821
SAGA	No regularization	85.9067	0.5821

DOI: 10.7717/peerj-cs.2579/table-4

For the LR implementation, each solver given in Table 1 has different computational complexity, some being quite efficient, like LIBLINEAR, but the required number of iterations is much larger than our method (see Table 2). The execution time of the proposed FBLR method remains tightly bounded (from $0.1695$ to $0.2721$ s, with an average of $0.2482$ s) as the penalty (regularization) varies (see Table 5). This demonstrates the efficiency of our minimization scheme, which is relatively unaffected by the choice of regularizer compared to solvers used within the LR (Table 4).

Table 5:

Performance of FBLR with different penalties for road-safety dataset.

$λ_{s s r}$	$f$	$γ$	Time (s)	Accuracy
0	0	0.001	0.2094	0.6951
0	0.5	0.001	0.2653	0.6950
0	1	0.001	0.2447	0.6948
0	1.5	0.001	0.2380	0.6950
0	2	0.001	0.1695	0.6949
1	0	0.001	0.2558	0.6950
1	0.5	0.001	0.2508	0.6948
1	1	0.001	0.2470	0.6948
1	1.5	0.001	0.2550	0.6948
1	2	0.001	0.2688	0.6949
2	0	0.001	0.2602	0.6947
2	0.5	0.001	0.2690	0.6948
2	1	0.001	0.2721	0.6947
2	1.5	0.001	0.2685	0.6947
2	2	0.001	0.2649	0.6947

DOI: 10.7717/peerj-cs.2579/table-5

We observed that specific solvers within scikit-learn, like LIBLINEAR, lack CPU-level parallelization support, whereas our approach, leveraging NumPy, enables parallelism through scikit-learn’s inherent parallel capabilities. We suggest that scikit-learn developers consider adopting a single NumPy-based solver to simplify maintenance and automatically benefit from NumPy improvements.

Experimental results

For the experiments, we used the real-world data in Table 3 and synthetic data created with make_classification method in the scikit-learn. Real-world data is divided into $70 %$ training data and $30 %$ test data.

Experiments on real-world datasets

Table 6 presents the performance metrics, while Fig. 1 shows the receiver operating characteristic (ROC) curves for LR and FBLR with default parameters on the HEPMASS dataset. In LR, we used the LIBLINEAR solver for faster execution and set the maximum number of iterations to 250 to ensure high accuracy for a fair comparison. We use a single thread, even though FBLR can run with multiple threads, while LR with LIBLINEAR cannot. Experiments were run 10 times to compare LR and FBLR, showing that all metrics evaluated had an accuracy loss not exceeding 0.005, while $13.56 \times$ speedup is obtained.

Table 6:

Multiple performance metrics for LR and FBLR on the HEPMASS dataset.

Metric	Train		Test
	LR	FBLR	LR	FBLR
Accuracy	0.83686	0.83583	0.83646	0.83553
Recall	0.83588	0.83124	0.83562	0.83104
Precision	0.83759	0.83901	0.83708	0.83863
F1-Score	0.83673	0.83511	0.83635	0.83481

DOI: 10.7717/peerj-cs.2579/table-6

Figure 1: ROC curves and AUC values of LR and FBLR on HEPMASS dataset.

Download full-size image

DOI: 10.7717/peerj-cs.2579/fig-1

LR is powered by LIBLINEAR, a highly efficient C/C++ library designed for large-scale linear classification, among other optimized solvers (see Table 1). On the other hand, the proposed FBLR method relies solely on the NumPy package (with openblas64 as the backbone), a core numerical library of Python. This approach ensures that FBLR takes full advantage of NumPy’s efficient handling of array operations and integration with the basic linear algebra subprograms (BLAS)² .

In Table 7, we give the comparison of LR and FBLR for OpenML datasets in terms of accuracy, time (in seconds), and speedup (LR Time/FBLR Time). No regularization is used in these experiments. In LR, we set LIBLINEAR as a solver parameter and None as a penalty parameter are chosen, while for the other parameters, default values are used. Table 7 gives the significant execution time improvements obtained by comparing FBLR to LR. Our method achieves training times up to $48.3$ times faster, on average, an order of magnitude faster than LR.

Table 7:

Single-run experiment comparing accuracy and execution time (s) of LR and proposed FBLR method without regularization on OpenML datasets.

Filename	LR ACC	FBLR ACC	LR time (s)	FBLR time (s)	Speedup
Kits	0.5700	0.5633	4.5378	0.6795	6.6780
Road-safety	0.6935	0.6956	3.7074	0.1517	24.4337
Creditcard	0.9991	0.9990	3.4729	0.2766	12.5530
Airlines2	0.5965	0.5939	5.0756	0.1051	48.3072
Colon	0.9740	0.9680	10.6571	1.0858	9.8149

DOI: 10.7717/peerj-cs.2579/table-7

In Table 8, we compare regularized LR and regularized FBLR using GridSearchCV. We used the grid search cross-validation in scikit-learn to find the best models (by searching various parameters) to compare LR and our FBLR method. It is worth mentioning that experimenting with a broader range of parameter combinations could improve both methods’ generalization and accuracy. Nonetheless, as our primary objective was to compare the speed of the methods, we maintained an equal number of experiments for consistency. In GridSearchCV, 5-folds are used. It assesses every combination of parameter values to identify the most effective combination, yielding the best classifier. Due to memory limitations in the experimental setup, results for the kits dataset could not be included for the LR algorithm in Table 8 while FBLR is able to process the kits dataset.

Table 8:

GridSearchCV experiment comparing accuracy and execution time (s) of LR and the proposed FBLR method with regularization on OpenML datasets.

Filename	LR ACC	FBLR ACC	LR time (s)	FBLR time (s)	Speedup
Kits	–	0.5667	–	18,761.8684	–
Road-safety	0.6949	0.6951	191.0991	5.8685	32.56
Creditcard	0.9992	0.9990	49.9478	11.1041	4.5
Airlines2	0.5965	0.5956	92.7151	5.9504	15.58
Colon	0.9740	0.9740	145.2604	77.7610	1.87

DOI: 10.7717/peerj-cs.2579/table-8

Table 8 presents the best accuracies and corresponding execution times and speedups for LR and FBLR on 5 OpenML datasets. For LR, we used the solvers (‘lbfgs’, ‘newton-cg’, ‘liblinear’, ‘sag’, ‘saga’) and their penalties (‘l1’, ‘l2’, ‘elasticnet’, None) as GridSearchCV search parameters. As detailed in Table 1, specific solvers may not support certain penalties, so we combine solvers and penalties in GridSearchCV accordingly. To have an equal number of experiments, we limited the FBLR parameters within a specific range and tested 12 combinations. In particular, the parameter settings for FBLR in experiments were set as follows: the norm of $L_{f}$ -norm, $f$ , was varied over the values [0, 0.5, 1, 2] and the $L_{f}$ -norm regularization coefficient ( $λ$ ) over the values [0, 1, 2].

The proposed FBLR method is designed to prioritize computational efficiency while maintaining accuracy. We employ SVD for low-rank approximation, and with a high energy percentile (99.9999), it retains significant information. Our novel second-order Soft-Plus approximation provides a good, though not exact, fit, enhancing numerical minimization efficiency and reducing required iterations ( $k$ ), with minimal potential accuracy loss. If desired, the weight vector from FBLR can be fine-tuned with an efficient Logistic Regression method in just a few iterations for a precise result.

Experiments on synthetic dataset

Recall that the computational complexity of the proposed FBLR method is $O (n d^{2} k)$ when regularization is applied and $O (n d max (d, k))$ without regularization. Here, $n$ denotes the number of rows, $d$ denotes the number of features, $k$ denotes the number of iterations, and K denotes the maximum number of iterations. When $d \geq K$ , the expression $max (d, k)$ simplifies to $d$ , given that $k \leq K$ , where $k$ iterates up to a maximum of K. Therefore, computational complexity without regularization becomes $O (n d^{2} k)$ . In accordance, Fig. 2 illustrates that the execution time of FBLR scales linearly with $n$ and exhibits a near-quadratic growth with respect to $d$ . Note that we analyzed the theoretical and computational complexity with respect to practical execution times on synthetic datasets since we have more control over synthetic data.

Execution time of FBLR with respect to
$n$n
and
$d$d
using synthetic data. — Figure 2: Execution time of FBLR with respect to $n$ and $d$ using synthetic data.

Download full-size image

DOI: 10.7717/peerj-cs.2579/fig-2

Visualization of regularization paths presents change of feature coefficients as the regularization parameter changes. When coefficients of features drop to zero corresponding feature is removed. Figure 3 shows regularization paths for coefficients of the proposed FBLR method under $L_{0}$ -norm regularization. It illustrates the effect of $L_{0}$ -norm regularization in promoting sparsity within the model’s parameters ( $w_{1}$ to $w_{8}$ ). Sharpness in $L_{0}$ -norm regularization paths and the reduction of weight coefficients to $0$ demonstrate the effectiveness of the proposed $L_{f}$ -norm regularizer in promoting sparsity for $f = 0$ . Note that, the intercept term, $w_{0}$ , remains unpenalized and almost constant across varying regularization levels determined by the $γ$ parameter. Applying regularization without penalizing the intercept is one strong feature of the proposed FBLR method which is not directly supported by LR. In scikit-learn, non-penalizing the intercept requires non-trivial parameter tweaking and also very data dependent.

${L_{\bi 0}}$L0
-norm paths using synthetic data which contains 10K samples with seven features and an intercept. — Figure 3: $L_{0}$ -norm paths using synthetic data which contains 10K samples with seven features and an intercept.
Feature $2$ is a linear combination of feature $0$ and $1$ , and feature $4$ is correlated with feature $3$ .

Download full-size image

DOI: 10.7717/peerj-cs.2579/fig-3

Conclusion and discussion

This study presents a novel approach, namely fast binary logistic regression (FBLR), to enhance the training efficiency of binary logistic regression. FBLR reduces training times, achieving speeds that are, on average, an order of magnitude faster than those of scikit-learn’s logistic regression, courtesy of the efficient numerical minimizer we developed. Moreover, unlike scikit-learn’s logistic regression, which utilizes multiple solvers, we propose a single, efficient solver that incorporates a second-order accurate Soft-Plus approximation and flexible $L_{f}$ -norm regularizer. This approach supports various weight penalties, thereby enhancing the training process for versatile models. Compared to scikit-learn, our approach relies on a single solver, simplifying the code and improving maintainability.

Furthermore, we have devised a low-rank approximation strategy to address collinearity among features. Our low-rank approach harnesses randomized SVD for high-dimensional feature sets and also a novel SVD with row reduction (SVD-RR) approach, tailored specifically for large datasets containing numerous rows. For smaller to moderate-sized datasets, we just employ truncated SVD. Also, we introduce an innovative strategy to determine the optimal transition from truncated SVD to randomized SVD. To mitigate the dominance of overly large eigenvalues, the logarithm of the eigenvalues was employed to establish the rank, $r$ , suitable for low-rank approximation. The achieved computational efficiency is crucial for crafting generalized logistic regression models, enabling iterative parameter tuning to achieve an optimal balance between bias and variance.

In addition to existing computational efficiencies, further acceleration can be achieved by applying stochastic gradient descent or mini-batch methods for faster approximate gradient evaluation. Also, parallel computing frameworks such as CuPy, Numba, PyCUDA, or PyTorch can be utilized. Additionally, deploying a caching strategy for SVD and storing the compact diagonal sparse matrix $S$ and the dense square matrix $V$ in a dictionary can enable efficient reuse. Matrix $U$ can be computed using Eq. (41) once diagonal matrix $S$ and square matrix $V$ are retrieved using the hash code of input data $X$ . This is particularly useful for repeated training sessions when performing best-parameter searches in scikit-learn. Originally conceived for binary labels, our method possesses sufficient flexibility to accommodate multinomial classifications using the inherent functions provided by scikit-learn. Nonetheless, our approach offers no advantage over scikit-learn’s logistic regression for handling missing values. Also, similar to scikit-learn’s logistic regression, our method is limited to handling linearly separable data. The matrix-vector form of our cost function, resembling Ridge Regression, allows our method to incorporate kernel methods like Kernel Ridge Regression, enabling nonlinear data handling while supporting linear data in its current form. Such kernel method extension can benefit from techniques such as Random Fourier Features (RFF) (Rahimi & Recht, 2007) to increase execution time performance further.

Appendix

Approximation of soft-plus

The quadratic approximation of $\log (1 + e^{τ})$ at $\hat{τ}$ is defined as:

(23) $\log (1 + e^{τ}) \approx z τ^{2} + q τ + t$ where $\hat{τ}$ is a proxy constant for $τ$ and $z$ , $q$ , and $t$ are all constant coefficients.

First, we determine $t$ by substituting $0$ to $τ$ as in Eq. (24):

(24) $\log (1 + e^{0}) \approx z 0^{2} + q 0 + t$

(25) $t = \log (2) .$

Then, for $τ = 0$ , we have $z = \frac{1}{8}$ and $q = \frac{1}{2}$ using Maclaurin series. Next, for $τ \neq 0$ , to find the value of $q$ and $z$ , we compute the equation at $- \hat{τ}$ and $\hat{τ}$

(26) $\log (1 + e^{\hat{τ}}) \approx z {\hat{τ}}^{2} + q \hat{τ} + \log 2$

(27) $\log (1 + e^{- \hat{τ}}) \approx z {\hat{τ}}^{2} - q \hat{τ} + \log 2.$

By subtracting the Eq. (27) from the Eq. (26) we obtain

(28) $q = \frac{\log (1 + e^{\hat{τ}}) - \log (1 + e^{- \hat{τ}})}{2 \hat{τ}} .$

Note that,

(29) $\log (1 + e^{- \hat{τ}}) = \log (1 + e^{\hat{τ}}) - \hat{τ} .$

Then, we obtain $q = \frac{1}{2}$ by plugging Eq. (29) into Eq. (28).

Finally plugging $q$ and $t$ into Eq. (26), we obtain

(30) $z = \frac{\log (1 + e^{\hat{τ}}) - \log 2}{{\hat{τ}}^{2}} - \frac{1}{2 \hat{τ}}$ which leads to the below approximation at point $\hat{τ}$ :

(31) $\log (1 + e^{τ}) \approx z τ^{2} + \frac{1}{2} τ + \log 2 w h e r e z = {\begin{matrix} \frac{1}{8} & f o r \hat{τ} = 0 \\ \frac{\log (1 + e^{\hat{τ}}) - \log 2}{{\hat{τ}}^{2}} - \frac{1}{2 \hat{τ}} & o t h e r w i s e . \end{matrix}$

Note that our quadratic Soft-Plus approximation is specifically designed for minimizing $- \log l (w)$ and differs from second-order Taylor series-based approximation. Compared to the second-order Taylor approximation, our Soft-Plus approximation more effectively adapts to the shape of the Soft-Plus function, being exact at both the approximation point and the origin. This provides a more accurate quadratic approximation for minimizing the cost function. The second-order Taylor approximation of Eq. (23) at $\hat{τ}$ is given as below which is different than our approximation:

(32) $\begin{aligned} \log (1 + e^{τ}) & \approx (\frac{e^{\hat{τ}}}{2 {(1 + e^{\hat{τ}})}^{2}}) τ^{2} \\ + (\frac{e^{\hat{τ}}}{1 + e^{\hat{τ}}} - \frac{\hat{τ} e^{\hat{τ}}}{{(1 + e^{\hat{τ}})}^{2}}) τ \\ + (\log (1 + e^{\hat{τ}}) - \frac{\hat{τ} e^{\hat{τ}}}{1 + e^{\hat{τ}}} + \frac{{\hat{τ}}^{2} e^{\hat{τ}}}{2 {(1 + e^{\hat{τ}})}^{2}}) . \end{aligned}$

Empirical quadratic approximation of soft-plus

The employed quadratic approximation of the Soft-Plus function given in Eq. (23) at point $\hat{τ} = 0$ and $\hat{τ} = - 5$ are shown in Fig. 4. These approximations are exact at both the approximated point ( $\hat{τ}$ ) and the origin ( $τ = 0$ ), with high accuracy between them, but become less accurate outside this range. In contrast, the second-order Taylor approximation only guarantees exactness at the approximation point $\hat{τ}$ .

Quadratic approximation of Soft-Plus at
$\hat \tau = 0$τ^=0
and
$\hat \tau = - 5$τ^=−5
. — Figure 4: Quadratic approximation of Soft-Plus at $\hat{τ} = 0$ and $\hat{τ} = - 5$ .

Download full-size image

DOI: 10.7717/peerj-cs.2579/fig-4

Approximation Of $L_{f}$ -norm

We also approximated smooth $L_{f} (ρ) = ρ^{2} (| ρ |^{2 - f} + ϵ)^{- 1}$ quadratically at a point $\hat{ρ}$ where $u$ is defined as a constant. Here, $\hat{ρ}$ is a proxy constant for $ρ$ . Quadratic approximation of smooth $L_{f} (ρ)$ function is:

(33) $L_{f} (ρ) \approx ρ^{2} (| \hat{ρ} |^{2 - f} + ϵ)^{- 1} = u ρ^{2} w h e r e u = {({| \hat{ρ} |}^{2 - f} + ϵ)}^{- 1} .$

Optimization problems with $L_{0}$ terms are combinatorial, requiring discrete decisions to minimize non-zero elements, making direct approaches computationally impractical for large-scale issues. Smooth approximations are employed as a solution, substituting the $L_{0}$ with a continuous, differentiable function that mimics the counting of non-zero elements. Similarly, our $L_{f}$ -norm approximation smoothly mirrors the $L_{1}$ -norm for $f = 1$ and $L_{0}$ -norm for $f = 0$ , as illustrated in Fig. 5. With our smooth approximation, as $ϵ$ gets smaller approximation gets better (default $ϵ = 10^{- 10}$ ).

Smooth approximation of
${L_f}$Lf
-norm for
${L_1}$L1
-norm and
${L_0}$L0
-norm. — Figure 5: Smooth approximation of $L_{f}$ -norm for $L_{1}$ -norm and $L_{0}$ -norm.

Download full-size image

DOI: 10.7717/peerj-cs.2579/fig-5

Despite the $L_{1}$ -norm and $L_{0}$ -norm approximation in Fig. 5 being precise even for relatively large $ϵ$ values ( $ϵ = 10^{- 2}$ and $ϵ = 10^{- 4}$ ), its direct use is unsuitable within our cost function, leading us to employ a quadratic approximation over this smooth approximation. The quadratic approximations of the $L_{f}$ -norm for $L_{1}$ -norm ( $f = 1$ ) and for $L_{0}$ -norm ( $f = 0$ ) are depicted in Fig. 6.

Quadratic approximation of
${L_f}$Lf
-norm for
${L_1}$L1
-norm and
${L_0}$L0
-norm. — Figure 6: Quadratic approximation of $L_{f}$ -norm for $L_{1}$ -norm and $L_{0}$ -norm.

Download full-size image

DOI: 10.7717/peerj-cs.2579/fig-6

SVD with row reduction

In the case of a large data matrix $X \in R^{n \times d}$ , truncated SVD is computationally demanding due to its complexity $O (n d^{2})$ . In scenarios where $d^{2}$ becomes large, the computational burden of applying SVD is mitigated by utilizing randomized SVD (Halko, Martinsson & Tropp, 2010; Halko et al., 2011; Martinsson et al., 2010). However, this method is less effective in dealing with cases where $n$ is large. In such instances, a large subset of the matrix $X$ can sufficiently mimic the statistical properties of the full data set. Hence, we sub-sample matrix $X$ to create $\tilde{X}$ , consisting of $m$ rows. This sub-sampling involves randomly or sequentially selecting $m$ rows from $X$ where we just sub-sampled the first $m$ rows. Note that one can employ a sampling mechanism proposed in Menon & Elkan (2011), such as the Drineas, Kannan, and Mahoney method, to obtain better statistical guarantees for the sampled subset. We use a simple sampling mechanism where we set $m$ to a sufficiently large value, $10^{5}$ , ensuring $X$ ’s statistical preservation and reducing computational demands for $n > 2 m$ , thus outperforming direct SVD on the full matrix. However, $U$ returned by SVD applied on $\tilde{X}$ is now $m \times d$ instead of $n \times d$ and also $S$ requires normalization. To address these issues, we define the covariance matrices for the full matrix $X$ and the subsampled matrix $\tilde{X}$ as follows:

(34) $C = \frac{1}{n - 1} X^{⊤} X$

(35) $\tilde{C} = \frac{1}{m - 1} {\tilde{X}}^{⊤} \tilde{X} .$

Assuming $\tilde{X}$ , a large subset of $X$ , reflects the dataset’s statistics, their covariances are nearly equal ( $C \approx \tilde{C}$ ):

(36) $\frac{1}{n - 1} X^{⊤} X \approx \frac{1}{m - 1} {\tilde{X}}^{⊤} \tilde{X} .$

Applying the definition of SVD to both matrices, represented as $X = U S V^{⊤}$ and $\tilde{X} = \tilde{U} \tilde{S} {\tilde{V}}^{⊤}$ respectively, the aforementioned equation can be reformulated as follows:

(37) $\frac{1}{n - 1} (U S V^{⊤})^{⊤} (U S V^{⊤}) \approx \frac{1}{m - 1} (\tilde{U} \tilde{S} {\tilde{V}}^{⊤})^{⊤} (\tilde{U} \tilde{S} {\tilde{V}}^{⊤}) .$

The above equation can be simplified to:

(38) $\frac{1}{n - 1} (V S S V) \approx \frac{1}{m - 1} (\tilde{V} \tilde{S} \tilde{S} \tilde{V})$ where $U$ and $\tilde{U}$ are eliminated since $U^{⊤} U = I$ and ${\tilde{U}}^{⊤} \tilde{U} = I$ . We assume the same eigenvectors for both data matrices:

(39) $\begin{aligned} V & = \tilde{V} \\ \frac{1}{n - 1} (V S S V) & \approx \frac{1}{m - 1} (V \tilde{S} \tilde{S} V) . \end{aligned}$

Multiplying both sides of the above equation by $V^{⊤}$ from left and right and applying simple algebraic simplifications:

(40) $S \approx \sqrt{\frac{n - 1}{m - 1}} \tilde{S} .$

Finally, to compute $U$ , we use SVD of $X$ given as $X = U S V^{⊤}$ , then we multiply both sides by $V S^{- 1}$ , leading to:

(41) $U = X V S^{- 1} .$

Thereby, we efficiently compute $\tilde{S}$ and $\tilde{V}$ using truncated SVD on the sub-sampled smaller data matrix $\tilde{X}$ and then we also efficiently compute $U$ matrix using Eq. (41), $S$ matrix using Eq. (40), and $V$ matrix using Eq. (39). Thus, proposed SVD-RR enables computing SVD of a large data matrix $X$ .

Randomized SVD

For handling a large matrix $X$ , where truncated SVD may become computationally slow or unfeasible, the utilization of randomized SVD offers a viable alternative, as described in works by Martinsson et al. (2010) and Halko, Martinsson & Tropp (2010), Halko et al. (2011). Randomized SVD have many uses in machine learning with datasets that are too large to be stored in memory (Halko et al., 2011). In comparison to classic SVD algorithms, randomized techniques for computing low-rank approximations are frequently faster and, unexpectedly, more robust (Halko, Martinsson & Tropp, 2010). Thus, for large datasets where truncated SVD is slow or impractical, so using randomized SVD is suggested by Martinsson et al. (2010).

A critical decision is to determine when to use randomized SVD instead of truncated SVD, taking into account the number of rows, $n$ , and the number of features, $d$ . To address this issue, we propose to leverage the computational complexities of both approaches as a criterion for decision-making. Computational complexity of the truncated SVD is given as below:

(42) $O (n d^{2}) .$

Similarly, computational complexity of the randomized SVD is:

(43) $O (n d r + n r^{2} + r^{2} d)$ where $r$ is the inherent rank of the matrix $X$ . Taking the ratio of randomized and truncated SVD and comparing it with a threshold, we obtained below decision rule (default T is $0.90$ ):

(44) $\frac{n d^{2}}{n d \hat{r} + n {\hat{r}}^{2} + {\hat{r}}^{2} d} \leq 2 T .$

However, $\hat{r}$ is unknown until SVD is taken. So, as a practical solution, we estimate $\hat{r}$ using $d$ as shown below:

(45) $\hat{r} = m i n (\sqrt{(d_{l i m i t} d)}, d)$ where $d_{l i m i t} = 500$ serves as the soft threshold beyond which truncated SVD may begins to exhibit slower performance as $d$ exceeds this value. If $n$ is larger than $10^{5}$ then SVD ^(∗) computes SVD of $X$ using SVD-RR (“SVD with Row Reduction”). Otherwise, SVD ^(∗) method gets the $X$ as input, uses Eq. (44) to decide to compute truncated SVD or randomized SVD of the data matrix $X$ .

Determining the rank

Finally, we developed Algorithm 3 which determines the rank ( $r$ ) using the energy-percentile ( $ξ$ ) and eigenvalues obtained by SVD^(∗).

Algorithm 3 :

Determine rank using S.

1: procedure RANK{

S

, ξ}

s \leftarrow \log (D i a g o n a l (S) + 1)

c \leftarrow \frac{C M F (s)}{\sum_{i = 1}^{d} s_{i}}

⊳

CMF: Cumulative Mass Function

r_{ξ} \leftarrow a r g m i n_{i} (c_{i} > ξ a n d s_{i} > 1 0^{- 10})

r \leftarrow m i n (r_{ξ}, d)

6: return r

7: end procedure

DOI: 10.7717/peerj-cs.2579/table-103

L₀ is not a norm but a pseudo-norm since it does not satisfy all norm axioms

BLAS is a collection of low-level routines that accelerate linear algebra computations. BLAS underpins the method’s ability to perform computationally intensive tasks more efficiently.

[1] Akaike H. 1998. Information theory and an extension of the maximum likelihood principle. New York: Springer. 199-213

[2] Avalos M, Grandvalet Y, Ambroise C. 2003. Regularization methods for additive models. In: Advances in Intelligent Data Analysis V. Berlin Heidelberg: Springer. 509-520

[3] Berkson J. 1944. Application of the logistic function to bio-assay. Journal of the American Statistical Association 39(227):357-365

[4] Bernigaud A, Gratton S, Lenti F, Simon E, Sohab O. 2021. L_p-norm regularization approaches in variational data assimilation. Quarterly Journal of the Royal Meteorological Society 147(736):2067-2081

[5] Bertero M. 1986. Regularization methods for linear inverse problems. In: Inverse Problems. Berlin Heidelberg: Springer. 52-112

[6] Bertsimas D, Pauphilet J, Van Parys B. 2021. Sparse classification: a scalable discrete optimization perspective. Machine Learning 110(11–12):3177-3209

[7] Cioffi R, Travaglioni M, Piscitelli G, Petrillo A, Felice FD. 2020. Artificial intelligence and machine learning applications in smart production: progress, trends, and directions. Sustainability 12(2):492

[8] Cox DR. 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) 20(2):215-242

[9] de Hond AAH, Kant IMJ, Honkoop PJ, Smith AD, Steyerberg EW, Sont JK. 2022. Machine learning did not beat logistic regression in time series prediction for severe asthma exacerbations. Scientific Reports 12:20363

[10] Defazio A, Bach F, Lacoste-Julien S. 2014. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives.

[11] Deza A, Atamturk A. 2022. Safe screening for logistic regression with ℓ₀-ℓ₂ regularization.

[12] Emmert-Streib F, Dehmer M. 2019. Evaluation of regression models: model assessment, model selection and generalization error. Machine Learning and Knowledge Extraction 1(1):521-551

[13] Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. 2008. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research 9:1871-1874

[14] Feng J, Xu H, Mannor S, Yan S. 2014. Robust logistic regression and classification. In: Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc.

[15] Geman S, Bienenstock E, Doursat R. 1992. Neural networks and the bias/variance dilemma. Neural Computation 4(1):1-58

[16] Gong G. 2006. Appendix B: cross-validation, the jackknife, and the bootstrap: excess error estimation in forward logistic regression. New York: John Wiley & Sons, Ltd. 205-218

[17] Greenwood CJ, Youssef GJ, Letcher P, Macdonald JA, Hagg LJ, Sanson A, Mcintosh J, Hutchinson DM, Toumbourou JW, Fuller-Tyszkiewicz M, Olsson CA. 2020. A comparison of penalised regression methods for informing the selection of predictive markers. PLOS ONE 15(11):e0242730

[18] Halko N, Martinsson PG, Shkolnisky Y, Tygert M. 2011. An algorithm for the principal component analysis of large data sets. SIAM Journal on Scientific Computing 33(5):2580-2594

[19] Halko N, Martinsson PG, Tropp JA. 2010. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, Survey and Review Section 53(2):217-288

[20] Hansen PC. 1990. Truncated singular value decomposition solutions to discrete ill-posed problems with ill-determined numerical rank. SIAM Journal on Scientific and Statistical Computing 11(3):503-518

[21] Hazimeh H, Mazumder R, Nonet T. 2023. L0learn: a scalable package for sparse learning using L0 regularization. Journal of Machine Learning Research 24:205

[22] Hoerl AE, Kennard RW. 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55-67

[23] Holland PW, Welsch RE. 1977. Robust regression using iteratively reweighted least-squares. Communications in Statistics—Theory and Methods 6(9):813-827

[24] Jaskie K, Elkan C, Spanias A. 2019. A modified logistic regression for positive and unlabeled learning.

[25] Jiang H, Hu X, Jia H. 2023. Penalized logistic regressions with technical indicators predict up and down trends. Soft Computing 27(18):13677-13688

[26] Jovanovich A, Lazar A. 2012. Comparison of optimization methods for L₁-regularized logistic regression.

[27] Jurafsky D, Martin JH. 2024. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition with language models. Hoboken: Prentice Hall.

[28] King G, Zeng L. 2001. Logistic regression in rare events data. Political Analysis 9(2):137-163

[29] Knauer R, Rodner E. 2023. Cost-sensitive best subset selection for logistic regression: a mixed-integer conic optimization perspective. In: KI 2023: Advances in Artificial Intelligence. Cham: Springer Nature. 114-129

[30] Koh K, Kim SJ, Boyd S. 2007. A method for large-scale L₁-regularized logistic regression. Washington, DC: AAAI. 565-571

[31] Komarek P, Moore AW. 2003. Fast robust logistic regression for large sparse datasets with binary outputs. In: Bishop CM, Frey BJ, eds. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Volume R4 of Proceedings of Machine Learning Research. Cambridge: PMLR. 163-170

[32] Lawson CL, Hanson RJ. 1995. Solving least squares problems. Philadelphia: Society for Industrial and Applied Mathematics.

[33] Li X, Zhu X, Wang H. 2023. Distributed logistic regression for massive data with rare events. ArXiv

[34] Liang J, Song Y, Li D, Wang Z, Dang C. 2020. An accelerator for the logistic regression algorithm based on sampling on-demand. Science China Information Sciences 63(6):1869-1919

[35] Liu J, Chen J, Ye J. 2009. Large-scale sparse logistic regression.

[36] Lu F. 2024. Online shopping consumer perception analysis and future network security service technology using logistic regression model. PeerJ Computer Science 10:e1777

[37] Martinsson G, Gillman A, Liberty E, Halko N, Rokhlin V, Hao S, Shkolnisky Y, Young P, Tropp J, Tygert M, Woolfe F. 2010. Randomized methods for computing the singular value decomposition (SVD) of very large matrices.

[38] Menon AK, Elkan C. 2011. Fast algorithms for approximating the singular value decomposition. ACM Transactions on Knowledge Discovery from Data 5(2):13

[39] Midi H, Sarkar SK, Rana S. 2010. Collinearity diagnostics of binary logistic regression model. Journal of Interdisciplinary Mathematics 13(3):253-267

[40] Ming H, Yang H. 2024. L0 regularized logistic regression for large-scale data. Pattern Recognition 146:110024

[41] Mohr F, van Rijn JN. 2023. Fast and informative model selection using learning curve cross-validation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(8):9669-9680

[42] Morozov VA. 2012. Methods for solving incorrectly posed problems. New York: Springer Science & Business Media.

[43] Nocedal J, Wright SJ. 2006. Numerical optimization (Second Edition). Cham: Springer.

[44] Nusinovici S, Tham YC, Chak Yan MY, Wei Ting DS, Li J, Sabanayagam C, Wong TY, Cheng CY. 2020. Logistic regression was as good as machine learning for predicting major chronic diseases. Journal of Clinical Epidemiology 122:56-69

[45] Ozcan C, Sen B, Nar F. 2016. Sparsity-driven despeckling for SAR images. IEEE Geoscience and Remote Sensing Letters 13(1):115-119

[46] Ozgur A, Nar F, Erdem H. 2018. Sparsity-driven weighted ensemble classifier. International Journal of Computational Intelligence Systems 11(1):962-978

[47] Paciorek CJ. 2007. Computational techniques for spatial logistic regression with large data sets. Computational Statistics & Data Analysis 51(8):3631-3653

[48] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. 2011. Scikit-learn: machine learning in python. The Journal of Machine Learning Research 12:2825-2830

[49] Pereyra G, Tucker G, Chorowski J, Kaiser L, Hinton G. 2017. Regularizing neural networks by penalizing confident output distributions. ArXiv

[50] Rahimi A, Recht B. 2007. Random features for large-scale kernel machines. In: Platt J, Koller D, Singer Y, Roweis S, eds. Advances in Neural Information Processing Systems. Red Hook: Curran Associates, Inc. 20

[51] Rouhani-Kalleh O. 2007. Algorithms for fast large scale data mining using logistic regression.

[52] Schmidt M, Roux NL, Bach F. 2017. Minimizing finite sums with the stochastic average gradient. Mathematical Programming 162:83-112

[53] Schwarz G. 1978. Estimating the dimension of a model. The Annals of Statistics 6(2):461-464

[54] Shi J, Yin W, Osher S, Sajda P. 2010. A fast hybrid algorithm for large-scale l₁-regularized logistic regression. Journal of Machine Learning Research 11(23):713-741

[55] Song Y, Kong X, Huang S, Zhang C. 2021. Fast training logistic regression via adaptive sampling. Scientific Programming 2021:9991859

[56] Su X. 2020. Efficient logistic regression with L₂ regularization using ADMM on spark. New York: Association for Computing Machinery. 6-10

[57] Tibshirani R. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1):267-288

[58] Tikhonov AN. 1963. Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady 4:1035-1038

[59] Tikhonov AN, Arsenin V. 1979. Solutions of ill posed problems

[60] Vapnik V. 2006. Estimation of dependences based on empirical data. New York: Springer Science & Business Media.

[61] Wahba G, Lin X, Gao F, Xiang D, Klein R, Klein B. 1998. The bias-variance tradeoff and the randomized GACV. In: Advances in Neural Information Processing Systems. Cambridge: MIT Press. 11

[62] Wang H. 2020. Logistic regression for massive data with rare events.

[63] Wang P, Chen S, Yang S. 2022. Recent advances on penalized regression models for biological data. Mathematics 10(19):3695

[64] Wei C, Lee JD, Liu Q, Ma T. 2019. Regularization matters: generalization and optimization of neural nets vs their induced kernel. In: 33rd Conference on Neural Information Processing Systems. Red Hook: Curran Associates, Inc.

[65] Wienbrandt L, Kässens JC, Hübenthal M, Ellinghaus D. 2019. 1000× faster than PLINK: combined FPGA and GPU accelerators for logistic regression-based detection of epistasis. Journal of Computational Science 30:183-193

[66] Xie C, Wen X, Pang H, Zhang B. 2023. Application of graph auto-encoders based on regularization in recommendation algorithms. PeerJ Computer Science 9:e1335

[67] Yang Z, Wang C, Zhang Z, Li J. 2019. Mini-batch algorithms with online step size. Knowledge-Based Systems 165:228-240

[68] Ye J. 2005. Generalized low rank approximations of matrices. Machine Learning 61:167-191

[69] Zabor EC, Reddy CA, Tendulkar RD, Patil S. 2022. Logistic regression in clinical studies. International Journal of Radiation Oncology, Biology, Physics 112(2):271-277

[70] Zaidi NA, Webb GI, Carman MJ, Petitjean F, Cerquides J. 2016. Alrⁿ: accelerated higher-order logistic regression. Machine Learning 104(2):151-194

[71] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. 2021. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64(3):107-115

Fast binary logistic regression

Introduction

Proposed method

Fast binary logistic regression

Implementation

Computational complexity analysis

Results

Datasets

Realworld dataset

Synthetic dataset

Experiment design

Performance metrics

Impact of solver selection on LR performance

Experimental results

Experiments on real-world datasets

Figure 1: ROC curves and AUC values of LR and FBLR on HEPMASS dataset.

Experiments on synthetic dataset

Figure 2: Execution time of FBLR with respect to n and d using synthetic data.

Figure 3: L0-norm paths using synthetic data which contains 10K samples with seven features and an intercept.

Conclusion and discussion

Appendix

Approximation of soft-plus

Empirical quadratic approximation of soft-plus

Figure 4: Quadratic approximation of Soft-Plus at τ^=0 and τ^=−5.

Approximation Of Lf-norm

Figure 5: Smooth approximation of Lf-norm for L1-norm and L0-norm.

Figure 6: Quadratic approximation of Lf-norm for L1-norm and L0-norm.

SVD with row reduction

Randomized SVD

Determining the rank

Download article

Report a problem

Follow this publication for updates

Change notification settings or unfollow

Top referrals unique visitors

Share this publication

Metrics

Links

Articles citing this paper

Figure 2: Execution time of FBLR with respect to $n$ and $d$ using synthetic data.

Figure 3: $L_{0}$ -norm paths using synthetic data which contains 10K samples with seven features and an intercept.

Figure 4: Quadratic approximation of Soft-Plus at $\hat{τ} = 0$ and $\hat{τ} = - 5$ .

Approximation Of $L_{f}$ -norm

Figure 5: Smooth approximation of $L_{f}$ -norm for $L_{1}$ -norm and $L_{0}$ -norm.

Figure 6: Quadratic approximation of $L_{f}$ -norm for $L_{1}$ -norm and $L_{0}$ -norm.