Fast binary logistic regression

View article
PeerJ Computer Science

Introduction

Classification has been extensively investigated in statistical learning as a supervised learning task. As pioneering works, the contributions of Berkson (1944) and Cox (1958) to the field of logistic regression are recognized as foundational. Berkson applied the logistic function to bio-assay, while Cox introduced regression analysis of binary sequences. Different forms of logistic regression are available to handle outputs in binary, multinomial, and ordinal formats, making it a versatile machine learning algorithm capable of addressing a wide range of classification tasks such as marketing, finance, healthcare, and fraud detection. With the rapid development of machine learning, highly flexible models such as extreme gradient boosting (XGBoost), support vector machine (SVM), and neural networks have become prominent (Cioffi et al., 2020). However, studies indicating logistic regression’s superior performance in some scenarios (de Hond et al., 2022; Jiang, Hu & Jia, 2023; Nusinovici et al., 2020; Zabor et al., 2022) and continuous usage of it (Jaskie, Elkan & Spanias, 2019; Lu, 2024) underscore the statistical machine learning community’s continued interest in this method.

Several improvements to the original logistic regression method are proposed (Berkson, 1944), such as weight regularization to obtain a generalized model or adapt the method for large data (Jiang, Hu & Jia, 2023; Bertsimas, Pauphilet & Van Parys, 2021; Tibshirani, 1996; Zaidi et al., 2016). Achieving a generalized model is closely tied to the bias-variance tradeoff where high-bias models risk oversimplification and underfitting, while low-bias models can overfit, exhibiting high variance. The tradeoff involves balancing these extremes to ensure the model generalizes well to unseen data without overfitting or underfitting. The optimal tradeoff is achieved by selecting appropriate model parameters that yield the most effective model complexity for training and validation sets, using techniques such as cross-validation and bootstrapping (Wahba et al., 1998; Gong, 2006; Mohr & van Rijn, 2023). It can be summarized that there must be a match between model complexity and data complexity to ensure that a model trained on training data performs well on unseen test data (Geman, Bienenstock & Doursat, 1992). Identifying optimal model parameters necessitates multiple training iterations with data subsets or random subsamples, a computationally intensive process for various parameter combinations, particularly with large datasets (Mohr & van Rijn, 2023; Emmert-Streib & Dehmer, 2019).

Logistic regression assumes a linear relationship between features and class distributions without demanding a specific statistical distribution. However, it is strictly required that features exhibit no multicollinearity. Regularization emerges as a vital solution to address multicollinearity and effectively manage model complexity. Aside from the intercept term, weight regularization applied to features encourages sparsity in the weight vector, resulting in fewer and more distinct features. This reduces multicollinearity and complexity, improving the model’s generalization abilities (Shi et al., 2010; Zhang et al., 2021; Avalos, Grandvalet & Ambroise, 2003). Additionally, in scenarios with numerous features available but limited training data, the risk of overfitting can arise without an effective regularization technique (Vapnik, 2006). Thus, pursuing a sparse solution is crucial to reduce overfitting and prevent data’s random pattern memorization (Zhang et al., 2021).

L2-norm (also known as Ridge) regularization, first introduced by Tikhonov (1963), penalizes the magnitudes of the weights to enforce smaller values, thereby increasing model robustness. In the literature, other weight regularization techniques (Tikhonov & Arsenin, 1979; Morozov, 2012; Bertero, 1986) also enforce sparsity for better robustness and increased model generalization. Among these techniques, L0-norm1 , provide most sparse solutions (Wang, Chen & Yang, 2022; Greenwood et al., 2020). Nevertheless, using L0-norm is challenging since it is not convex. Thereby, L1-norm, namely least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), is proposed as a convex relaxation of non-convex L0-norm. Each regularization approach imposes different constraints on the model training. For example, L2-norm imposes a squared penalty (Hoerl & Kennard, 1970) and L1-norm imposes an absolute penalty on the model’s parameters (Ozgur, Nar & Erdem, 2018; Wei et al., 2019; Xie et al., 2023) while L0-norm imposes a constant penalty for all non-zero weights (Wang, Chen & Yang, 2022; Greenwood et al., 2020). Note that the Bayesian information criterion (BIC) (Schwarz, 1978) and the Akaike information criterion (AIC) (Akaike, 1998), well-known model selection criteria, are special cases of L0-norm regularization. So, sparse models using L0-norm and L1-norm enforce many coefficients to shrink to exactly zero while L2-norm tends to produce model parameters closer but not precisely zero (Pereyra et al., 2017). Thus, L1-norm and L0-norm are employed in feature selection to avoid overfitting. On the contrary, L2-norm has better computational efficiency than L1-norm and is even more efficient than L0-norm. Thereby, several studies prefer to use L1-norm or L2-norm regularized logistic regression for large scale data (Koh, Kim & Boyd, 2007; Shi et al., 2010; Jovanovich & Lazar, 2012; Su, 2020). Elastic-net combines the L1-norm and L2-norm penalties, so it tends to choose more features than LASSO, but computational efficiency becomes similar to Ridge. Recently, some studies tackled the L0-norm regularization for logistic regression while dealing with challenges of using non-convex L0-norm regularization term (Ming & Yang, 2024; Hazimeh, Mazumder & Nonet, 2023; Knauer & Rodner, 2023; Deza & Atamturk, 2022). Although not proposed for logistic regression, the use of fractional norm as a penalty term was proposed as a flexible and practical approach for enforcing smoothness on the image denoising (Ozcan, Sen & Nar, 2016), where it is also better suited for handling the generalized Gaussian distribution of the variables (Bernigaud et al., 2021).

As already stated in the literature, the maximum likelihood estimation of the logistic regression has some shortcomings:

  • It may struggle to handle massive sparse datasets (Holland & Welsch, 1977; Li, Zhu & Wang, 2023) effectively.

  • Logistic regression often struggles with imbalanced data, tending to favor the majority class and resulting in poor performance on the minority class (King & Zeng, 2001).

  • For the training set where the number of samples is smaller than the number of features, directly solving the logistic regression is an ill-posed problem (Liu, Chen & Ye, 2009).

  • It is sensitive to anomalous data and collinearity (Feng et al., 2014; Midi, Sarkar & Rana, 2010).

  • Oversampling data instances may decrease estimation performance and increase computational expenses (Wang, 2020).

One strategy to mitigate these challenges involves utilizing the iteratively reweighted least squares method with an appropriate solver to construct binary logistic regression classifiers, particularly for large-scale datasets (Paciorek, 2007; Rouhani-Kalleh, 2007). Employing appropriate numerical solvers can also address these problems. Hence, numerous numerical solvers have been deployed to tackle these challenges, such as gradient descent and its variants, Newton’s method, and quasi-Newton methods. These solvers play a crucial role in training models efficiently by iteratively adjusting parameters to minimize a specified objective function. Although these approaches attempt to eliminate anomalies or collinearity problems in the data, working with many solvers with different regularizers poses a challenging issue for researchers (Liu, Chen & Ye, 2009). The selection of the solver typically depends on the specific characteristics of the data and the nature of the problem under consideration.

Scikit-learn (Pedregosa et al., 2011), a versatile Python library integrating a wide range of state-of-the-art machine learning algorithms, has become one of the most popular choices among many libraries due to its simplicity, comprehensive documentation, and extensive community support. The LogisticRegression class in scikit-learn already incorporates several optimization methods (see Table 1), each tailored for different regularizers, data sizes, and computational needs. Similarly, other open-source libraries, including cuML and Statsmodels, provide their own implementations of logistic regression, each with a variety of solvers and parameter configurations to enhance flexibility and efficiency. In particular, cuML offers graphics processing unit (GPU)-accelerated solvers for large-scale computations and Statsmodels provides robust statistical modeling options for binary response variables.

Table 1:
Minimizers employed as solvers in scikit-learn’s logistic regression algorithm.
L1 L2 ElasticNet No regularization
LBFGS1
LIBLINEAR (Fan et al., 2008)
Newton-CG1
Newton-Cholesky1
SAG (Schmidt, Roux & Bach, 2017)
SAGA (Defazio, Bach & Lacoste-Julien, 2014)
DOI: 10.7717/peerj-cs.2579/table-1

Note:

See Nocedal & Wright (2006) for further information on these common optimization approaches.

With increasing dataset sizes, logistic regression needs to improve in terms of efficiency and scalability. In response, advanced versions have emerged, designed to prioritize speed and accuracy. These innovative approaches accelerate learning by leveraging techniques such as dimensionality reduction, parallel processing, and advanced data structures.

Logistic regression training is formulated to solve an optimization problem based on likelihood maximization. However, calculating the gradient with all training data in each iteration for large datasets becomes computationally demanding. Therefore, Song et al. (2021) proposed an adaptive sampling method in which the gradient estimation is divided into several sub-problems according to the data size, and then each sub-problem is solved independently before combining the results of all sub-problems. Shi et al. (2010) suggested a hybrid algorithm that merges two optimization iterations: one that is fast and memory-efficient and another that is slower but yields more precise results.

Various approaches have been proposed to enhance the speed of logistic regression, which can generally be categorized into efficient numerical methods and software-based improvements. Table 1 summarizes the solvers employed in scikit-learn. A notable earlier study by Komarek & Moore (2003) utilizes Cholesky decomposition to accelerate logistic regression. However, since many contemporary methods now incorporate Cholesky decomposition within NumPy or employ more advanced solvers, this study has yet to be considered state-of-the-art. Additionally, some approaches leverage field programmable gate array (FPGA) and GPU (Wienbrandt et al., 2019), stochastic gradient descent, and mini-batch (Yang et al., 2019; Liang et al., 2020; Jurafsky & Martin, 2024) methods to improve the speed of logistic regression. These techniques are often complementary and can be integrated with other methods, representing practical efforts to accelerate logistic regression further.

We introduce a novel numerical approach that remarkably improves the training efficiency of binary logistic regression without relying on dimension reduction or parallel processing and without requiring feature independence. This efficiency is obtained by employing a novel Soft-Plus approximation, which enables reformulation of logistic regression parameter estimation into matrix-vector form. Additionally, unlike the multiple solvers employed in scikit-learn’s logistic regression for different regularization schemas, we present a single flexible Lf-norm regularization approach that also provides flexibility to include or exclude penalization of the intercept term. Our regularization approach supports a range of weight penalties through a unified codebase with a specifically designed numerical minimizer. To demonstrate the computational efficiency of our method, we conducted several quantitative experiments, comparing our method against the widely used scikit-learn library. Here, we used benchmark data from OpenML, an open machine learning data repository, and synthetic data designed for controlled experiments. Experiments demonstrate that our method effectively handles collinear features and large data while providing superior efficiency with minimal to no loss in accuracy. Our fast binary logistic regression (FBLR) exclusively utilizes the Python NumPy library. As a result, it benefits from all current capabilities of NumPy and upcoming enhancements while keeping the codebase simple. This streamlined yet powerful approach allows it to surpass the performance of scikit-learn’s logistic regression (LR) implementation in processing speed, which relies on well-established and highly optimized libraries. Our experiments demonstrate that FBLR achieves an average speedup of an order of magnitude faster, with the maximum observed speedup reaching 48.3 times.

Proposed method

Fast binary logistic regression

For logistic regression, the likelihood function in the context of maximum likelihood estimation (MLE) is formulated based on the assumption that each observation is an independent Bernoulli trial. Consider a dataset with n observations (xi,yi), where each xiRd pairs with a binary outcome yi{0,1}. Here, i is the data index, and d is the number of features, including the intercept. Then, the logistic regression model predicts the probability of the outcome being 1 is defined as:

P(yi=1|xi)=exiw1+exiwwhere wRd represents the model parameters, namely the weight vector and xi=[1xi,1xi,2xi,d1] is the ith data vector. For the binary classification using logistic regression, likelihood l(w) of observing the entire dataset given w is the product of the probabilities for each observation that is given in Eq. (2):

l(w)=iP(yi=1|xi)iP(yi=0|xi)=i:yi=1P(xi)i:yi=0(1P(xi))

The optimum model parameter is found by maximizing the l(w) with respect to weight vector w for the given input data and corresponding target class labels, known as MLE:

argmaxwl(w)

Optimizing l(w) presents challenges due to the presence of product terms. However, instead of maximizing l(w), one can opt to minimize logl(w). This approach is feasible because the maximum of l(w) corresponds to the minimum of logl(w), and logl(w) is significantly easier to optimize. Then, we simplify logl(w) as below:

logl(w)=[i:yi=1logP(xi)+i:yi=0log(1P(xi))]=i:yi=0log(1+exiw)i:yi=1(xiwlog(1+exiw))=i(log(1+exiw)yixiw).

To develop an efficient minimizer for logl(w), we approximate the Soft-Plus function, log(1+exiw), at w^ to obtain a quadratic form where w^ is a proxy constant for w. Details of our novel quadratic Soft-Plus approximation are provided in “Approximation of Soft-Plus”. By substituting this Soft-Plus approximation into Eq. (4), we derive the following form of logl(w):

logl(w)i(zi(xiw)2+(12yi)(xiw)+log2)where

zi={18,xiw^=0log(1+exiw^)log(2)(xiw^)212xiw^,otherwise.

Finally, we put Eq. (5) into matrix-vector form as below:

logl(w)(Xw)Z(Xw)+(12y)Xw+nlog2wXZXw+(12y)Xw+nlog2where XRn×d is input data in matrix form, yRn×1 is target data in vector form, wRd×1 is weight vector, ZRn×n is a diagonal matrix form of zi.

Various forms of weight regularization have been proposed in the literature for logistic regression, each requiring distinct numerical approaches. Alternatively, we use smooth Lf-norm (Ozcan, Sen & Nar, 2016) regularization to achieve a unified framework that only requires a single numerical approach, incorporating Ridge, pseudo ElasticNet, LASSO, L0-norm (notably, a pseudo-norm), and other fractional norm regularizations.

For quadratic approximation, let ρ^ be a constant proxy for ρ and define u as a constant coefficient, u=(|ρ^|2f+ϵ)1. Then, quadratically approximated Lf-norm is

Lf(ρ)ρ2(|ρ^|2f+ϵ)1=uρ2.

Refer to “Approximation of Lf-Norm” for details and justification of the quadratic Lf-norm approximation.

An iterative minimization approach (solver) is now required as we utilize quadratic approximations of log(1+exiw) (Soft-Plus) and a smooth Lf-norm. Both approximations are performed on w for each iteration, where the applied approximations are accurate near the point of approximation, namely w^ which is a proxy constant for w. To ensure that the new solution remains close to the approximation point at the current iteration, we introduce a slow-step-regularization (SSR) term with a coefficient λ. Finally, we normalize the terms by the data count and weight vector dimension.

J(k)(w)=1nlogl(w)+λ2d(ww^)(ww^)+γ2dpLf(w)where k is the iteration index, w is weight vector, w^ is a proxy constant for w at kth iteration, λ is the SSR coefficient, and γ is Lf-norm regularization coefficient. Here, p is the vector used to prevent penalizing the intercept term, defined as p=[,pj,], where pj is 0 for the intercept (to avoid penalizing it) and 1 for all other coefficients. Then, pLf(w) is approximated as:

pLf(w)=j=1dpjLf(wj)=j=1dhjwj2=wHwwhere HRd×d is a diagonal matrix form of hj values that is defined as below ( ϵ is a small positive constant with default as ϵ=1010):

hj=pj(|w^j|2f+ϵ)1.

The obtained cost function J(k)(w), comprising only quadratic terms, linear terms, and constants, can be expressed in a matrix-vector form.

J(k)(w)=1nwXZXw+1n(12y)Xw+log2+λ2d(ww^)(ww^)+γ2dwHw.

Afterwards, we can minimize J(k)(w) with respect to w at kth iteration by taking the derivative of J(k)(w) with respect to w and equalizing it to zero:

J(k)(w)w=1n2XZXw+1nX(12y)+λ2d2(ww^)+γ2d2Hw=0which can be arranged as:

(2nXZX+λdI+γdH)w=λdw^+1nX(y12)where IRd×d is the identity matrix. Note that Eq. (14) can be represented as a linear system Aw(k+1)=b. Finally, the solution can be represented as below where Z and H are updated in each iteration. In our case, A is a symmetric matrix, and v is a constant vector that can be computed before the iterative minimization.

Initialization Iteration
v=1nX(y12) Aw(k+1)=b
A=2nXZX+λdI+γdH
b=λdw(k)+v
DOI: 10.7717/peerj-cs.2579/table-11

Data matrix X can be rank-deficient due to colinear features or insufficient data samples. When the data matrix X is rank-deficient, and no regularization technique is applied ( λ=0), the matrix A may become a symmetric positive semi-definite matrix. Nonetheless, if X is of full rank or a regularization is applied ( λ>0), the matrix A becomes symmetric positive definite. To mitigate the challenges due to rank-deficiency and the subsequent numerical instabilities, we utilized low-rank approximation using singular value decomposition (SVD) (Lawson & Hanson, 1995; Hansen, 1990) for approximating the data matrix XRn×d (Ye, 2005).

U,S,V=Truncated-SVD(X)where X=USV, URn×d, SRd×d, and VRd×d. Then

A=2n(USV)Z(USV)+λdI+γdH=2nVSUZUSV+λdI+γdH=2nVS(UZU)SV+λdI+γdH.

We use r-rank approximation of U, S, and V matrices such that XU^S^V^. Note that, U^Rn×r, S^Rr×r, and V^Rd×r denotes low-rank approximations of the U, S, and V matrices, respectively. Here, U^ and V^ are orthonormal matrices and S^ is a diagonal matrix. In this low-rank approximation, r is the rank such that rd and r is chosen such that almost all of the energy (by eigenvalues) is preserved.

We initialize w(0) using least-square to start from a reasonable initial point. For computational efficiency, we also use the SVD decomposition X=USV on the least-square equation:

w(0)=(XX)1Xy=(VSUUSV)1VSUy=VS1Uy

To circumvent numerical challenges and gain further computational efficiency, we employ a low-rank approximation as w(0)=V^S^1U^y=Fy where F=V^S^1U^. Computing F is computationally efficient since the inverse of the diagonal matrix S^ is computationally cheap. Also, computing w(0)=Fy is cheap as well. For both with and without regularization, a common initialization is defined as follows:

U, S, V=Truncated-SVD(X) since X=USV
Determine rank r such that XU^S^V^
U^=r-rank(U)andS^=r-rank(S)andV^=r-rank(V)
w(0)=Fy where F=V^S^1U^
DOI: 10.7717/peerj-cs.2579/table-12

There are two paths for the developed iterative minimization: (a) with regularization and (b) without regularization. It should be noted that applying Lf-norm regularization requires setting a positive γ value, which in turn necessitates assigning a positive λ parameter to enable SSR regularization.

a) With regularization ( λ>0 and γ>0)

Recall, first we need to construct A matrix and b vector then solve the linear system Aw(k+1)=b. Using the v=1nX(y12) we have:

A=2nXZX+λdI+γdHandb=λdw(k)+v.

With G=S^V^, low-rank approximation of XU^S^V^ leads to XU^G:

A=2nXZX+λdI+γdH=2n(U^S^V^)Z(U^S^V^)+λdI+γdH=2nG(U^ZU^)G+λdI+γdH.

Let compute v in order to compute b and use r-rank approximation of v as below:

v=1nX(y12)=1nG(U^(y12)).

Finally, the numerical minimization approach with regularization is as follows:

Initialization Iteration
G=S^V^where XU^S^V^ Aw(k+1)=b
with low-rank estimation using SVD A=2nG(U^ZU^)G+λdI+γdH
v=1nG(U^(y12)) b=λdw(k)+v
DOI: 10.7717/peerj-cs.2579/table-13

Thanks to the low-rank approximation applied to the data matrix X in addition to applied Lf-norm and SSR regularization, the proposed cost function becomes strictly convex. So, the linear system Aw(k+1)=b is well-conditioned, ensuring a unique solution for w(k+1), whereas the logistic regression cost function remains only convex due to potential rank deficiencies in X.

b) Without regularization ( λ=0 and γ=0)

First A and b are constructed then linear system Aw(k+1)=b is solved.

2n(XZX)w(k+1)=1nX(y12)w(k+1)=12(XZX)1X(y12).

Let us use low-rank approximation of XU^S^V^:

w(k+1)12((U^S^V^)Z(U^S^V^))1(U^S^V^)(y12)12V^S^1U^Z1U^U^(y12)12FZ1yqwhereyq=U^(U^(y12)).

The final iterative numerical minimization approach without regularization is as follows:

Initialization Iteration
yq=U^(U^(y12)) w(k+1)12FZ1yq
DOI: 10.7717/peerj-cs.2579/table-14

Note that Z is a diagonal matrix with all positive entries, making its inverse extremely efficient to compute. Additionally, FZ1yq can be calculated efficiently, as computing Z1yq requires only n operations, and multiplying F by the resulting vector requires only d×n operations. As a result, each iteration without regularization remains highly efficient. In addition to computational efficiency, it is guaranteed that FZ1 is always well-conditioned, and a unique solution for w(k+1) exists since Z, as a diagonal matrix with positive entries, is always invertible. Thus, the employed low-rank approximation regularizes the solution to mitigate possible collinearity in the data matrix X.

Implementation

This section outlines the implementation of the proposed method, with the Low-Rank approximation described in the Algorithm 1 and the proposed FBLR method presented in the Algorithm 2. In the Algorithm 1, X is input data matrix, ξ is energy-percentile (default is 99.9999) the DIMENSION function returns the number of rows ( n) and the number of features ( d). The SVD(∗) function carries out singular value decomposition, choosing the most suitable variant—either truncated or randomized SVD (refer to “Randomized SVD”), or SVD with row reduction (SVD-RR) (refer to “SVD with Row Reduction”)—based on an evaluation of n and d. Lastly, the RANK function (see “Determining the Rank”) in the Algorithm 1 finds the matrix’s rank by analyzing the eigenvalues derived from S, consequently generating matrices of low rank.

Algorithm 1 :
Procedure to obtain a low-rank approximation of the data matrix X.
1: procedure LowRankApproximation ( X, ξ)
2:    Returns the dimension of the data matrix X
3:     n,d DIMENSION (X) n: #row, d: #feature
4:    Utilizing either Truncated SVD or Randomized SVD
5:     U,S,V SVD(∗) (X) n see “SVD with Row Reduction (SVD-RR) & Randomized SVD”
6:    Computes the rank of matrix X using eigenvalues
7:     r RANK (S,ξ) n see “Determining the Rank”
8:    The low-rank components of the matrices are extracted
9:     U^U[:,1:r]
10:    S^S[1:r,1:r]
11:    V^V[1:r,1:r]
12:   return n,d,r,U^,S^,V^
13: end procedure
DOI: 10.7717/peerj-cs.2579/table-101
Algorithm 2 :
Fast binary logistic regression (FBLR).
1: procedure FBLR( X, y, f, λ, γ, ξ, K, Ctolerance)
2:    Initialization
3:     n,d,r,U^,S^,V^ LowRankApproximation (X,ξ)
4:     F=V^S^1U^
5:     w=Fy n Initializing weights via the pseudo-inverse
6:    if (λ>0 or γ>0) then
7:       GS^V^
8:       v1nG(U^(y12))
9:   else
10:      yqU^(U^(y12))
11:  end if
12:  Iteration
13:  for k0 to K do
14:      w^w
15:      Z diag(z) z = vector (zi), Eq. (6)
16:     if (λ>0 or γ>0) then
17:        H diag(h)  h = vector (hi), Eq. (11)
18:        A2nG(U^ZU^)G+λdI+γdH
19:        bλdw(k)+v
20:        w solve (A,b) h Applying Cholesky decomposition for solving Ax=b
21:     else
22:         w12FZ1yq
23:     end if
24:     if (k3 and ||ww^||Ctolerance) then
25:        break
26:     end if
27:  end for
28:  Finalization
29:  return w
30: end procedure
DOI: 10.7717/peerj-cs.2579/table-102

The pseudocode for the proposed FBLR method is given in Algorithm 2. In this algorithm, operations that remain constant throughout iterations are performed upfront in the initialization phase. First, low-rank approximation of the matrix X is obtained using Algorithm 1. Depending on whether regularization is applied, various matrices and vectors are precomputed accordingly. Subsequently, at each iteration, a dense linear system is constructed and solved using the Cholesky decomposition, leveraging the fact that A is a symmetric positive definite matrix when regularization is employed (line 20 in Algorithm 2). Even when regularization is not employed, computing w is still well-conditioned (line 22 in Algorithm 2) since Z is a diagonal matrix with all positive entries.

Using Python v3.12.1, NumPy v1.26.4, and scikit-learn v1.3.2, proposed FBLR method in Algorithm 2, is implemented as the Python class FastLogisticRegressionLowRank extending scikit-learn’s abstract BaseEstimator and ClassifierMixin classes. We use NumPy for all matrix and vector operations where our NumPy configuration uses OpenBLAS as the backend for basic linear algebra subprograms (BLAS) operations in our environment. For additional efficiency, NumPy also utilizes CPU instructions such as single instruction multiple data (SIMD).

Computational complexity analysis

In the Algorithm 2, the time complexity of the initialization phase is O(nd2) since the time complexity of both the SVD used in the low-rank approximation and F=V^S^1U^ is O(nd2). At the same time, the remaining operations have lower time complexity. Furthermore, the computational complexity per iteration is O(nd2) when incorporating regularization ( Lf-norm & SSR) and O(nd) without regularization. Noting that k represents the number of iterations executed until the Algorithm 2 converges, the total time complexity of the FBLR method is O(nd2k) for cases with regularization and O(ndmax(d,k)) when no regularization is applied. In the case without regularization, the time complexity O(ndmax(d,k)) is obtained since the computational complexity of the initialization is O(nd2), and the complexity per iteration is O(ndk). The total computational complexity is O(nd2)+O(ndk)O(ndd)+O(ndk), which simplifies to O(ndmax(d,k)). For the proposed FBLR method, k is typically less than or equal to the maximum iteration (K) with a default value of 10.

Our time complexity analysis reveals that the FBLR method has linear time complexity with respect to the number of rows ( n) and at most quadratic time complexity with respect to the data dimension ( d). Furthermore, for the proposed FBLR method, the maximum number of iterations is only 10, which is considerably low. Thus, the maximum iteration count of the proposed FBLR method is small compared to logistic regression in scikit-learn with various solvers (see Table 2), demonstrating its efficiency.

Table 2:
Maximum number of iterations for the scikit-learn logistic regression solvers.
Solver Default maximum iteration (K) Typical iteration count
LIBLINEAR 100 100 to 1,000
LBFGS 100 100 to 500 (up to 1,000)
Newton-CG
Newton-cholesky
SAG 1,000 1,000 to 5,000
SAGA
DOI: 10.7717/peerj-cs.2579/table-2

The space complexity of the proposed FBLR method is at least the size of the input data matrix XRn×d. So, we ignore the vectors employed in the FBLR method, as they have dimensions of n×1 or d×1, which are much smaller than n×d, size of the data matrix X. So, we will focus only on the matrices employed in the FBLR method, as outlined below:

  • X,U,FRn×d and V,G,ARd×d are dense matrices space complexity is O(nd)

  • Z,HRn×n and SRd×d are sparse diagonal matrices space complexity is O(n).

Therefore, space complexity of the proposed FBLR method is O(nd).

Results

We conducted experiments on a system running Ubuntu Linux 20.04 with an Intel Core i9-10900KF CPU (10 cores) and 64 GB of RAM. Note that NumPy leverages SIMD extensions supported by a CPU to a reasonable extent. For Intel Core i9-10900KF CPU, utilized SIMD extensions by NumPy are SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX22, F16C, and FMA3.

For the proposed FBLR method, λ and γ parameters are set to zero for all experiments unless stated otherwise. Also, fixed default values are used for the parameters ϵ, ξ, K, and Ctolerance. The parameter ϵ, a small positive constant for the applied Lf-norm approximation, is set to 1010. Also, ξ, representing the energy percentile for low-rank approximation, is 99.9999, excluding zeros or very small eigenvalues. The maximum number of iterations is set to K=10, and the convergence tolerance is set to Ctolerance=103. From this point forward, whenever logistic regression (LR) is mentioned in the text, it refers specifically to the logistic regression within scikit-learn.

Datasets

Realworld dataset

We used datasets with binary labels from the OpenML in the experiments. We focused on datasets with no missing data and training times exceeding 3 s when utilizing LR. This strategy focused on selecting datasets that would benefit the most from speed enhancements. Our analysis compared the classification performance and execution efficiency between LR and our FBLR method across different scenarios, such as balanced and imbalanced datasets of medium to large sizes. Additionally, we evaluated the LR and FBLR method on the UCI’s HEPMASS dataset, which features rich data from key physics experiments targeting exotic particle discovery and poses a binary classification challenge. Detailed information on the datasets used with their attributes is provided in Table 3. In a binary dataset, the term ‘majority’ denotes a frequency between 0.5 and 1.0 for the class with more occurrences than the other class. In Table 3, creditcard dataset is imbalanced since it has a high majority value.

Table 3:
Datasets utilized in the experiments along with their attributes.
Name n (#samples) d (#features) Majority%
Kits 1,000 27,648 52%
Road-safety 111,762 32 50%
Creditcard 284,807 30 99%
Airlines 539,383 7 55%
Colon 5,100,000 62 50%
HEPMASS 10,500,000 28 50%
DOI: 10.7717/peerj-cs.2579/table-3

Synthetic dataset

The make_classification method in the scikit-learn library is a mechanism for creating random synthetic data with a desired number of classes, samples, and features. n_redundant is responsible for generating linear combinations of informative features, weights sets the sample distribution across classes to introduce imbalance, and flip_y alters the class of a specified fraction of samples at random, introducing noise and increasing the complexity of the classification task.

Experiment design

Using real-world data, three experiments were designed to evaluate classification performance and execution time of LR and FBLR: (a) experiment on HEPMASS data with multiple performance metrics (b) a single-run experiment without regularization on OpenML datasets and (c) an experiment incorporating regularization through various parameters on OpenML datasets. Optimizing hyperparameters is crucial to improve the model’s classification performance. We employed the grid search cross-validation (GridSearchCV) method in scikit-learn to find the best models for LR and FBLR, specifically to analyze the impact of regularization on accuracy metrics and execution time.

We also examined the alignment between theoretical computational complexity and practical execution times on synthetic datasets, as these allow for greater control over the data.

Performance metrics

There exist several performance metrics. Accuracy measures the proportion of true results among all samples, providing an overall view of classification accuracy.

Accuracy=TP+TNFP+FN+TP+TNwhere TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

Precision measures the proportion of TP in all positive predictions, recall measures the proportion of actual positives that are correctly identified, and F1-score is the harmonic mean of precision and recall.

Precision=TPFP+TPandRecall=TPFN+TPandF1-score=2×Precision×RecallPrecision+Recall

The area under the curve (AUC) is determined by plotting the true positive rate (TPR) vs. the false positive rate (FPR) at various thresholds and calculating the area beneath this curve.

Impact of solver selection on LR performance

The LogisticRegression class in scikit-learn employs several optimization methods (see Table 1), each tailored for different regularizers, data sizes, and computational needs. However, this leads to differences in execution time and accuracy across the solvers. As seen in Table 4, the execution time of Logistic Regression in scikit-learn varies significantly (from 3.575 to 99.339 s). The Newton-CG and LIBLINEAR solvers demonstrate good accuracy, while the limited-memory Broyden-Fletcher-Goldfarb-Shanno (LFBGS) algorithm, stochastic average gradient (SAG), and stochastic average gradient descent (SAGA) have lower accuracy due to inadequate default iteration limits. It is important to note that in this experiment, we did not use GridSearchCV; instead, each solver was executed individually with its default parameters.

Table 4:
Performance of LR solvers on the roadsafety dataset.
Solver Penalty Time (s) Accuracy
Newton-CG L2 6.7949 0.6949
Newton-CG No regularization 5.5673 0.6949
LBFGS L2 59.4638 0.6211
LBFGS No regularization 55.8338 0.6290
LIBLINEAR L1 39.889 0.6949
LIBLINEAR L2 3.5751 0.6935
SAG L2 42.6379 0.5819
SAG No regularization 42.5621 0.5820
SAGA L1 99.0867 0.5821
SAGA L2 85.0862 0.5821
SAGA ElasticNet ( α=0.5) 99.3389 0.5821
SAGA No regularization 85.9067 0.5821
DOI: 10.7717/peerj-cs.2579/table-4

For the LR implementation, each solver given in Table 1 has different computational complexity, some being quite efficient, like LIBLINEAR, but the required number of iterations is much larger than our method (see Table 2). The execution time of the proposed FBLR method remains tightly bounded (from 0.1695 to 0.2721 s, with an average of 0.2482 s) as the penalty (regularization) varies (see Table 5). This demonstrates the efficiency of our minimization scheme, which is relatively unaffected by the choice of regularizer compared to solvers used within the LR (Table 4).

Table 5:
Performance of FBLR with different penalties for road-safety dataset.
λssr f γ Time (s) Accuracy
0 0 0.001 0.2094 0.6951
0 0.5 0.001 0.2653 0.6950
0 1 0.001 0.2447 0.6948
0 1.5 0.001 0.2380 0.6950
0 2 0.001 0.1695 0.6949
1 0 0.001 0.2558 0.6950
1 0.5 0.001 0.2508 0.6948
1 1 0.001 0.2470 0.6948
1 1.5 0.001 0.2550 0.6948
1 2 0.001 0.2688 0.6949
2 0 0.001 0.2602 0.6947
2 0.5 0.001 0.2690 0.6948
2 1 0.001 0.2721 0.6947
2 1.5 0.001 0.2685 0.6947
2 2 0.001 0.2649 0.6947
DOI: 10.7717/peerj-cs.2579/table-5

We observed that specific solvers within scikit-learn, like LIBLINEAR, lack CPU-level parallelization support, whereas our approach, leveraging NumPy, enables parallelism through scikit-learn’s inherent parallel capabilities. We suggest that scikit-learn developers consider adopting a single NumPy-based solver to simplify maintenance and automatically benefit from NumPy improvements.

Experimental results

For the experiments, we used the real-world data in Table 3 and synthetic data created with make_classification method in the scikit-learn. Real-world data is divided into 70% training data and 30% test data.

Experiments on real-world datasets

Table 6 presents the performance metrics, while Fig. 1 shows the receiver operating characteristic (ROC) curves for LR and FBLR with default parameters on the HEPMASS dataset. In LR, we used the LIBLINEAR solver for faster execution and set the maximum number of iterations to 250 to ensure high accuracy for a fair comparison. We use a single thread, even though FBLR can run with multiple threads, while LR with LIBLINEAR cannot. Experiments were run 10 times to compare LR and FBLR, showing that all metrics evaluated had an accuracy loss not exceeding 0.005, while 13.56× speedup is obtained.

Table 6:
Multiple performance metrics for LR and FBLR on the HEPMASS dataset.
Metric Train Test
LR FBLR LR FBLR
Accuracy 0.83686 0.83583 0.83646 0.83553
Recall 0.83588 0.83124 0.83562 0.83104
Precision 0.83759 0.83901 0.83708 0.83863
F1-Score 0.83673 0.83511 0.83635 0.83481
DOI: 10.7717/peerj-cs.2579/table-6
ROC curves and AUC values of LR and FBLR on HEPMASS dataset.
Figure 1: ROC curves and AUC values of LR and FBLR on HEPMASS dataset.

LR is powered by LIBLINEAR, a highly efficient C/C++ library designed for large-scale linear classification, among other optimized solvers (see Table 1). On the other hand, the proposed FBLR method relies solely on the NumPy package (with openblas64 as the backbone), a core numerical library of Python. This approach ensures that FBLR takes full advantage of NumPy’s efficient handling of array operations and integration with the basic linear algebra subprograms (BLAS)2 .

In Table 7, we give the comparison of LR and FBLR for OpenML datasets in terms of accuracy, time (in seconds), and speedup (LR Time/FBLR Time). No regularization is used in these experiments. In LR, we set LIBLINEAR as a solver parameter and None as a penalty parameter are chosen, while for the other parameters, default values are used. Table 7 gives the significant execution time improvements obtained by comparing FBLR to LR. Our method achieves training times up to 48.3 times faster, on average, an order of magnitude faster than LR.

Table 7:
Single-run experiment comparing accuracy and execution time (s) of LR and proposed FBLR method without regularization on OpenML datasets.
Filename LR ACC FBLR ACC LR time (s) FBLR time (s) Speedup
Kits 0.5700 0.5633 4.5378 0.6795 6.6780
Road-safety 0.6935 0.6956 3.7074 0.1517 24.4337
Creditcard 0.9991 0.9990 3.4729 0.2766 12.5530
Airlines2 0.5965 0.5939 5.0756 0.1051 48.3072
Colon 0.9740 0.9680 10.6571 1.0858 9.8149
DOI: 10.7717/peerj-cs.2579/table-7

In Table 8, we compare regularized LR and regularized FBLR using GridSearchCV. We used the grid search cross-validation in scikit-learn to find the best models (by searching various parameters) to compare LR and our FBLR method. It is worth mentioning that experimenting with a broader range of parameter combinations could improve both methods’ generalization and accuracy. Nonetheless, as our primary objective was to compare the speed of the methods, we maintained an equal number of experiments for consistency. In GridSearchCV, 5-folds are used. It assesses every combination of parameter values to identify the most effective combination, yielding the best classifier. Due to memory limitations in the experimental setup, results for the kits dataset could not be included for the LR algorithm in Table 8 while FBLR is able to process the kits dataset.

Table 8:
GridSearchCV experiment comparing accuracy and execution time (s) of LR and the proposed FBLR method with regularization on OpenML datasets.
Filename LR ACC FBLR ACC LR time (s) FBLR time (s) Speedup
Kits 0.5667 18,761.8684
Road-safety 0.6949 0.6951 191.0991 5.8685 32.56
Creditcard 0.9992 0.9990 49.9478 11.1041 4.5
Airlines2 0.5965 0.5956 92.7151 5.9504 15.58
Colon 0.9740 0.9740 145.2604 77.7610 1.87
DOI: 10.7717/peerj-cs.2579/table-8

Table 8 presents the best accuracies and corresponding execution times and speedups for LR and FBLR on 5 OpenML datasets. For LR, we used the solvers (‘lbfgs’, ‘newton-cg’, ‘liblinear’, ‘sag’, ‘saga’) and their penalties (‘l1’, ‘l2’, ‘elasticnet’, None) as GridSearchCV search parameters. As detailed in Table 1, specific solvers may not support certain penalties, so we combine solvers and penalties in GridSearchCV accordingly. To have an equal number of experiments, we limited the FBLR parameters within a specific range and tested 12 combinations. In particular, the parameter settings for FBLR in experiments were set as follows: the norm of Lf-norm, f, was varied over the values [0, 0.5, 1, 2] and the Lf-norm regularization coefficient ( λ) over the values [0, 1, 2].

The proposed FBLR method is designed to prioritize computational efficiency while maintaining accuracy. We employ SVD for low-rank approximation, and with a high energy percentile (99.9999), it retains significant information. Our novel second-order Soft-Plus approximation provides a good, though not exact, fit, enhancing numerical minimization efficiency and reducing required iterations ( k), with minimal potential accuracy loss. If desired, the weight vector from FBLR can be fine-tuned with an efficient Logistic Regression method in just a few iterations for a precise result.

Experiments on synthetic dataset

Recall that the computational complexity of the proposed FBLR method is O(nd2k) when regularization is applied and O(ndmax(d,k)) without regularization. Here, n denotes the number of rows, d denotes the number of features, k denotes the number of iterations, and K denotes the maximum number of iterations. When dK, the expression max(d,k) simplifies to d, given that kK, where k iterates up to a maximum of K. Therefore, computational complexity without regularization becomes O(nd2k). In accordance, Fig. 2 illustrates that the execution time of FBLR scales linearly with n and exhibits a near-quadratic growth with respect to d. Note that we analyzed the theoretical and computational complexity with respect to practical execution times on synthetic datasets since we have more control over synthetic data.

Execution time of FBLR with respect to 
$n$n
 and 
$d$d
 using synthetic data.
Figure 2: Execution time of FBLR with respect to n and d using synthetic data.

Visualization of regularization paths presents change of feature coefficients as the regularization parameter changes. When coefficients of features drop to zero corresponding feature is removed. Figure 3 shows regularization paths for coefficients of the proposed FBLR method under L0-norm regularization. It illustrates the effect of L0-norm regularization in promoting sparsity within the model’s parameters ( w1 to w8). Sharpness in L0-norm regularization paths and the reduction of weight coefficients to 0 demonstrate the effectiveness of the proposed Lf-norm regularizer in promoting sparsity for f=0. Note that, the intercept term, w0, remains unpenalized and almost constant across varying regularization levels determined by the γ parameter. Applying regularization without penalizing the intercept is one strong feature of the proposed FBLR method which is not directly supported by LR. In scikit-learn, non-penalizing the intercept requires non-trivial parameter tweaking and also very data dependent.


${L_{\bi 0}}$L0
-norm paths using synthetic data which contains 10K samples with seven features and an intercept.
Figure 3: L0-norm paths using synthetic data which contains 10K samples with seven features and an intercept.
Feature 2 is a linear combination of feature 0 and 1, and feature 4 is correlated with feature 3.

Conclusion and discussion

This study presents a novel approach, namely fast binary logistic regression (FBLR), to enhance the training efficiency of binary logistic regression. FBLR reduces training times, achieving speeds that are, on average, an order of magnitude faster than those of scikit-learn’s logistic regression, courtesy of the efficient numerical minimizer we developed. Moreover, unlike scikit-learn’s logistic regression, which utilizes multiple solvers, we propose a single, efficient solver that incorporates a second-order accurate Soft-Plus approximation and flexible Lf-norm regularizer. This approach supports various weight penalties, thereby enhancing the training process for versatile models. Compared to scikit-learn, our approach relies on a single solver, simplifying the code and improving maintainability.

Furthermore, we have devised a low-rank approximation strategy to address collinearity among features. Our low-rank approach harnesses randomized SVD for high-dimensional feature sets and also a novel SVD with row reduction (SVD-RR) approach, tailored specifically for large datasets containing numerous rows. For smaller to moderate-sized datasets, we just employ truncated SVD. Also, we introduce an innovative strategy to determine the optimal transition from truncated SVD to randomized SVD. To mitigate the dominance of overly large eigenvalues, the logarithm of the eigenvalues was employed to establish the rank, r, suitable for low-rank approximation. The achieved computational efficiency is crucial for crafting generalized logistic regression models, enabling iterative parameter tuning to achieve an optimal balance between bias and variance.

In addition to existing computational efficiencies, further acceleration can be achieved by applying stochastic gradient descent or mini-batch methods for faster approximate gradient evaluation. Also, parallel computing frameworks such as CuPy, Numba, PyCUDA, or PyTorch can be utilized. Additionally, deploying a caching strategy for SVD and storing the compact diagonal sparse matrix S and the dense square matrix V in a dictionary can enable efficient reuse. Matrix U can be computed using Eq. (41) once diagonal matrix S and square matrix V are retrieved using the hash code of input data X. This is particularly useful for repeated training sessions when performing best-parameter searches in scikit-learn. Originally conceived for binary labels, our method possesses sufficient flexibility to accommodate multinomial classifications using the inherent functions provided by scikit-learn. Nonetheless, our approach offers no advantage over scikit-learn’s logistic regression for handling missing values. Also, similar to scikit-learn’s logistic regression, our method is limited to handling linearly separable data. The matrix-vector form of our cost function, resembling Ridge Regression, allows our method to incorporate kernel methods like Kernel Ridge Regression, enabling nonlinear data handling while supporting linear data in its current form. Such kernel method extension can benefit from techniques such as Random Fourier Features (RFF) (Rahimi & Recht, 2007) to increase execution time performance further.

Appendix

Approximation of soft-plus

The quadratic approximation of log(1+eτ) at τ^ is defined as:

log(1+eτ)zτ2+qτ+twhere τ^ is a proxy constant for τ and z, q, and t are all constant coefficients.

First, we determine t by substituting 0 to τ as in Eq. (24):

log(1+e0)z02+q0+t

t=log(2).

Then, for τ=0, we have z=18 and q=12 using Maclaurin series. Next, for τ0, to find the value of q and z, we compute the equation at τ^ and τ^

log(1+eτ^)zτ^2+qτ^+log2

log(1+eτ^)zτ^2qτ^+log2.

By subtracting the Eq. (27) from the Eq. (26) we obtain

q=log(1+eτ^)log(1+eτ^)2τ^.

Note that,

log(1+eτ^)=log(1+eτ^)τ^.

Then, we obtain q=12 by plugging Eq. (29) into Eq. (28).

Finally plugging q and t into Eq. (26), we obtain

z=log(1+eτ^)log2τ^212τ^which leads to the below approximation at point τ^:

log(1+eτ)zτ2+12τ+log2wherez={18forτ^=0log(1+eτ^)log2τ^212τ^otherwise.

Note that our quadratic Soft-Plus approximation is specifically designed for minimizing logl(w) and differs from second-order Taylor series-based approximation. Compared to the second-order Taylor approximation, our Soft-Plus approximation more effectively adapts to the shape of the Soft-Plus function, being exact at both the approximation point and the origin. This provides a more accurate quadratic approximation for minimizing the cost function. The second-order Taylor approximation of Eq. (23) at τ^ is given as below which is different than our approximation:

log(1+eτ)(eτ^2(1+eτ^)2)τ2+(eτ^1+eτ^τ^eτ^(1+eτ^)2)τ+(log(1+eτ^)τ^eτ^1+eτ^+τ^2eτ^2(1+eτ^)2).

Empirical quadratic approximation of soft-plus

The employed quadratic approximation of the Soft-Plus function given in Eq. (23) at point τ^=0 and τ^=5 are shown in Fig. 4. These approximations are exact at both the approximated point ( τ^) and the origin ( τ=0), with high accuracy between them, but become less accurate outside this range. In contrast, the second-order Taylor approximation only guarantees exactness at the approximation point τ^.

Quadratic approximation of Soft-Plus at 
$\hat \tau = 0$τ^=0
 and 
$\hat \tau = - 5$τ^=−5
.

Figure 4: Quadratic approximation of Soft-Plus at τ^=0 and τ^=5.

Approximation Of Lf-norm

We also approximated smooth Lf(ρ)=ρ2(|ρ|2f+ϵ)1 quadratically at a point ρ^ where u is defined as a constant. Here, ρ^ is a proxy constant for ρ. Quadratic approximation of smooth Lf(ρ) function is:

Lf(ρ)ρ2(|ρ^|2f+ϵ)1=uρ2whereu=(|ρ^|2f+ϵ)1.

Optimization problems with L0 terms are combinatorial, requiring discrete decisions to minimize non-zero elements, making direct approaches computationally impractical for large-scale issues. Smooth approximations are employed as a solution, substituting the L0 with a continuous, differentiable function that mimics the counting of non-zero elements. Similarly, our Lf-norm approximation smoothly mirrors the L1-norm for f=1 and L0-norm for f=0, as illustrated in Fig. 5. With our smooth approximation, as ϵ gets smaller approximation gets better (default ϵ=1010).

Smooth approximation of 
${L_f}$Lf
-norm for 
${L_1}$L1
-norm and 
${L_0}$L0
-norm.

Figure 5: Smooth approximation of Lf-norm for L1-norm and L0-norm.

Despite the L1-norm and L0-norm approximation in Fig. 5 being precise even for relatively large ϵ values ( ϵ=102 and ϵ=104), its direct use is unsuitable within our cost function, leading us to employ a quadratic approximation over this smooth approximation. The quadratic approximations of the Lf-norm for L1-norm ( f=1) and for L0-norm ( f=0) are depicted in Fig. 6.

Quadratic approximation of 
${L_f}$Lf
-norm for 
${L_1}$L1
-norm and 
${L_0}$L0
-norm.

Figure 6: Quadratic approximation of Lf-norm for L1-norm and L0-norm.

SVD with row reduction

In the case of a large data matrix XRn×d, truncated SVD is computationally demanding due to its complexity O(nd2). In scenarios where d2 becomes large, the computational burden of applying SVD is mitigated by utilizing randomized SVD (Halko, Martinsson & Tropp, 2010; Halko et al., 2011; Martinsson et al., 2010). However, this method is less effective in dealing with cases where n is large. In such instances, a large subset of the matrix X can sufficiently mimic the statistical properties of the full data set. Hence, we sub-sample matrix X to create X~, consisting of m rows. This sub-sampling involves randomly or sequentially selecting m rows from X where we just sub-sampled the first m rows. Note that one can employ a sampling mechanism proposed in Menon & Elkan (2011), such as the Drineas, Kannan, and Mahoney method, to obtain better statistical guarantees for the sampled subset. We use a simple sampling mechanism where we set m to a sufficiently large value, 105, ensuring X’s statistical preservation and reducing computational demands for n>2m, thus outperforming direct SVD on the full matrix. However, U returned by SVD applied on X~ is now m×d instead of n×d and also S requires normalization. To address these issues, we define the covariance matrices for the full matrix X and the subsampled matrix X~ as follows:

C=1n1XX

C~=1m1X~X~.

Assuming X~, a large subset of X, reflects the dataset’s statistics, their covariances are nearly equal ( CC~):

1n1XX1m1X~X~.

Applying the definition of SVD to both matrices, represented as X=USV and X~=U~S~V~ respectively, the aforementioned equation can be reformulated as follows:

1n1(USV)(USV)1m1(U~S~V~)(U~S~V~).

The above equation can be simplified to:

1n1(VSSV)1m1(V~S~S~V~)where U and U~ are eliminated since UU=I and U~U~=I. We assume the same eigenvectors for both data matrices:

V=V~1n1(VSSV)1m1(VS~S~V).

Multiplying both sides of the above equation by V from left and right and applying simple algebraic simplifications:

Sn1m1S~.

Finally, to compute U, we use SVD of X given as X=USV, then we multiply both sides by VS1, leading to:

U=XVS1.

Thereby, we efficiently compute S~ and V~ using truncated SVD on the sub-sampled smaller data matrix X~ and then we also efficiently compute U matrix using Eq. (41), S matrix using Eq. (40), and V matrix using Eq. (39). Thus, proposed SVD-RR enables computing SVD of a large data matrix X.

Randomized SVD

For handling a large matrix X, where truncated SVD may become computationally slow or unfeasible, the utilization of randomized SVD offers a viable alternative, as described in works by Martinsson et al. (2010) and Halko, Martinsson & Tropp (2010), Halko et al. (2011). Randomized SVD have many uses in machine learning with datasets that are too large to be stored in memory (Halko et al., 2011). In comparison to classic SVD algorithms, randomized techniques for computing low-rank approximations are frequently faster and, unexpectedly, more robust (Halko, Martinsson & Tropp, 2010). Thus, for large datasets where truncated SVD is slow or impractical, so using randomized SVD is suggested by Martinsson et al. (2010).

A critical decision is to determine when to use randomized SVD instead of truncated SVD, taking into account the number of rows, n, and the number of features, d. To address this issue, we propose to leverage the computational complexities of both approaches as a criterion for decision-making. Computational complexity of the truncated SVD is given as below:

O(nd2).

Similarly, computational complexity of the randomized SVD is:

O(ndr+nr2+r2d)where r is the inherent rank of the matrix X. Taking the ratio of randomized and truncated SVD and comparing it with a threshold, we obtained below decision rule (default T is 0.90):

nd2ndr^+nr^2+r^2d2T.

However, r^ is unknown until SVD is taken. So, as a practical solution, we estimate r^ using d as shown below:

r^=min((dlimitd),d)where dlimit=500 serves as the soft threshold beyond which truncated SVD may begins to exhibit slower performance as d exceeds this value. If n is larger than 105 then SVD (∗) computes SVD of X using SVD-RR (“SVD with Row Reduction”). Otherwise, SVD (∗) method gets the X as input, uses Eq. (44) to decide to compute truncated SVD or randomized SVD of the data matrix X.

Determining the rank

Finally, we developed Algorithm 3 which determines the rank ( r) using the energy-percentile ( ξ) and eigenvalues obtained by SVD(∗).

Algorithm 3 :
Determine rank using S.
1: procedure RANK{ S, ξ}
2:       slog(Diagonal(S)+1)
3:       cCMF(s)i=1dsi                   CMF: Cumulative Mass Function
4:      rξargmini(ci>ξandsi>1010)
5:      rmin(rξ,d)
6:    return r
7: end procedure
DOI: 10.7717/peerj-cs.2579/table-103
L0 is not a norm but a pseudo-norm since it does not satisfy all norm axioms
BLAS is a collection of low-level routines that accelerate linear algebra computations. BLAS underpins the method’s ability to perform computationally intensive tasks more efficiently.
5 Citations   Views   Downloads