Data Decomposition: From Independent Component Analysis to Sparse Representations

Abstract

A' B' Figure 1: Seeking structure in data. Component analysis can be viewed as a family of data representation methods. The challenging task is to find informative directions in data space. These correspond to the column vectors of the observation (transformation, or 'mixing') matrix and form a new coordinate system. Their directions are non-orthogonal in general. (Left) Rotational invariance of the distribution of independent Gaussian random variables with equal variance. A scatterplot (point cloud) drawn from two such Gaussian sources illustrates the fact that there is not enough structure in the data in order to find characteristic directions in data space. Algebraically, we can only estimate the linear map up to an orthogonal transformation. (Center) Point cloud generated from a non-Gaussian distribution. (Right) The data cloud contains more structure in this case, which we want to exploit. In particular, the geometric shape of the point cloud of this figure is an example of a dataset that is sparse with respect to the coordinate axes shown by the two arrows.  Let X be a M × N rectangular data matrix, where each row is a data point and each column is a "feature", and assume without loss of generality that M ≥ N . The singular value decomposition (SVD) is a factorization of matrix X such that order, such that σ 1 ≥ σ 2 ≥ · · · ≥ σ N , forming the spectrum of X.

59
The singular value decomposition of X can be also written as where u i is the i-th eigenvector of XX T and v i is the i-th eigenvector of X T X, as 60 above, and r ≤ N is the rank of X. In other words, a matrix, X, can be written 61 as a linear superposition of its eigenimages, i.e. a sum of the outer products 62 of its left and right eigenvectors, u i v T i , weighted by the square roots of the ICA is a family of data analysis methods that aims at decomposing datasets into maximally statistically independent components. In the noiseless setting, the observation model for linear ICA is where we have assumed that the observations have been de-meaned (i.e. we have translated the coordinate system to the data centroid). ICA employs the principle of redundancy reduction (Barlow,[5]) embodied in the requirement of statistical independence among the components (Nadal and Parga,[50]). In statistical language, this means that the joint density factorizes over latent sources: where P (s) is the assumed distribution of the sources, s = (s 1 , . . . , s L ), regarded , or, equivalently, the "distance" between the distribution p(s) and the fully factorized one, l p(s l ), measured in terms of the Kullback-Leibler divergence, KL p(s)|| l p(s l ) . This is defined as KL p(x)||q(x) = E p(x) log p(x) q(x) . This 2 The Darmois-Skitovitch theorem reads: Theorem 1 (Darmois-Skitovitch) Let ξ 1 , . . . , ξn be independent random variables and let α i and β i , i = 1, . . . , n be nonzero real numbers such that the random variables n i=1 α i ξ i and n i=1 β i ξ i are independent. Then the ξ i 's are Gaussian. See, for example, V. Bogachev, 'Gaussian measures' [9], p. 13. 3 For two stochastic variables X and Y to be independent, it is necessary and sufficient that their mutual information equals zero: where the quantity H(Z) is the 'differential' entropy of the random variable Z. enables ICA algorithms to separate statistically independent sources, up to possible permutations and scalings of the components [16]. The mutual information ("redundancy") can be equivalently computed as where the first term at the RHS is the sum of the entropies of the individual 122 sources and the second the joint entropy of (s 1 , . . . , s L ). As shown by Bell

123
& Sejnowski (1995), independence can lead to separation because the method 124 exploits higher-order statistics in the data, something that cannot be done with 125 methods such as PCA.

126
In practice, many ICA algorithms minimize a variety of 'proxy' functionals. Bell and Sejnowski's ICA approach uses the InfoMax principle (Linsker,[46]), maximizing information transfer in a network of nonlinear units (Bell & Sejnowski,[8]). Based on this, Bell and Sejnowski derive their very successful Infomax-ICA algorithm. The sources are estimated aŝ where W is the separating (unmixing) matrix that is iteratively learned by the rule where the vector valued map φ(u) = (φ 1 (u 1 ), . . . , φ L (u L )) is an appropriate that optimal information transfer, that is maximum mutual information be-134 tween inputs and outputs, or equivalently maximum entropy for the output, is 135 obtained when highly-slopping parts of the transfer function are aligned with 136 high-density parts of the probability density function of the inputs.

137
Hyvärinen chooses to focus explicitly on non-Gaussianity and derives a fixed-point algorithm, dubbed FastICA [33]. Non-Gaussianity can be quantified using the negentropy, J, where u Gauss is a Gaussian random variable with the same covariance as u. The FastICA algorihm maximizes an approximation of J using the estimate where G(·) is an appropriate nonlinearity, such as the non-quadratic function G(z) = z 4 , and that is implicitly related to the source distributions (see below), u Gauss is a standardized Gaussian r.v., and u 1 , . . . , u l , . . . , u L are also of mean zero and unit variance. The unknown sources, {u l } L l=1 , are again estimated using the projections u l = w T l x, where w l is the l-th separating vector (column of W), found by the iteration where g(·) is the derivative of G(·) and g ′ (·) is the derivative of g(·) and w is An important result in the theory of ICA, with practical value, is that the ICA decomposition can be written as a factorization of an "unfolding" matrix times a rotation matrix. The former is usually implemented by pre-whitening (presphering) the observations, such that E xx T = I D , wherex now denotes the whitened observations:x = W sph x .
W sph can be computed from the eigendecomposition of the data covariance matrix, C xx = E xx T . = UΛU T , where the matrix U is a unitary matrix 4 containing the eigenvectors of C xx and Λ = diag(λ 1 , . . . , λ D ) is the diagonal matrix of eigenvalues. Then the decomposition problem can be written (taking the "square root" and inverting) as That Λ − 1 2 U T spheres the data can be seen by simply performing the operations for E xx T , taking into account that U is an orthogonal matrix [29]. The above whitening operation transforms the original data vectors to the space of the eigenvalues and rescales the axes by the singular values. Alternatively, one may use UΛ − 1 2 U T for whitening, which maps the data back to the original space. This often makes further processing easier. In any case, since the whitening transformation removes any second-order statistics (correlations) in the data, learning the ICA matrixÃ is equivalent to learning a pure orthogonal rotation matrix:

143
Note that until now, while we have used probabilistic concepts to define informationtheoretic quantities such as the negentropy and the mutual information, we have taken the view that the solution of the blind source separation problem can be where an L-dimensional vector of latent variables, s, is linearly related to a D-dimensional vector of observations via the observation operator A. Observation noise, ε, may in general be added to the observations. In other words, the observed data is 'explained' by the unobserved latent variables, while the mismatch between the observations and the model predictions, x − Aŝ, is explained by the additive noise. The fundamental equation of ICA, which we write again below, can be seen as a modelling assumption, i.e. a working hypothesis, as a fac- assuming spherical Gaussian noise, ε ∼ N 0, σ 2 I L , for example, in order to 160 reconstruct the sources from the inputs at their most probable value.

161
As shown by MacKay [47] and Pearlmutter and Parra [54], Infomax-ICA can be interpreted as a maximum likelihood model. Assuming square mixing (i.e. as many latent dimensions as observations, L = D), and invertibility of the mixing matrix, the separating matrix is W = A −1 . We can then immediately write down the probability of the data, as where J is the Jacobian matrix of the transformation, with J li = ∂s l ∂xi . Under the linear model, and using the fundamental assumption of ICA, of mutual independence of the latent variables, p(s) = L l=1 p(s l ), we have Then, the log-likelihood of an i.i.d. data set, X = {x n } N n=1 , under the model can then be written as where we have substituted s l,n with u l,n = w T l x n = D i=1 w l,i x i,n . The param-162 eter vector, θ, here contains the matrix, A, or equivelently the unmixing one, 163 W = A −1 , since these are uniquely related in this case. 164 We can now derive a maximum likelihood algorithm for ICA via gradient descent, in order to learn the separating matrix, W. Taking the derivative of L(θ) with respect to W and using well-known derivative rules we finally find the learning rule where we have used the shorthand notation z l = φ l (u l ), where the ICA nonlin- Here, the term 'denoising' is interpreted as filtering out irrelevant information.

177
It is worth going through the main steps of the derivation.

178
The EM Algorithm The general idea of the EM algorithm is to estimate the latent variables, Y, and model parameters, θ, of a probabilistic model (which in this case are the sources, S, and mixing matrix, A, of the BSS problem, respectively), in two alternating steps. The 'E' (expectation) step computes the expectation of the log-likelihood with respect to the posterior distribution p Y X, θ (τ ) , using the current (τ th) estimate of the parameters, this is a function of θ only. (Recall that X is observed and θ (τ ) is temporarily fixed to it current point estimate.) The 'M' (maximization) step then computes the model parameters that maximize the expected log-likelihood, This scheme is iterated until the algorithm converges. It can be shown that the EM algorithm is guaranteed to increase the observed data likelihood at each iteration [20].

179
Applying the above generic EM recipe, we can compute the maximum likelihood estimate of the mixing matrix of our ICA model aŝ where the expected sufficient statistics 5 of the sources, E [S] and E SS T , are computed with respect to their posterior 6 . In the low sensor noise (σ 2 → 0) and square-mixing case of FastICA, Lappalainen approximates the posterior mean 5 A sufficient statistic is the minimal statistic that provides sufficient information about a statistical model. Typically, the sufficient statistic is a simple function of the data, e.g. the sum of all the data points, sum of squares of the data points, etc.
6 These are relationships that will become useful later in the paper as well.
of the sources aŝ where s 0 def = A −1 x and the function φ(·) is defined as before, as the vector 180 of the logarithmic derivatives of p l (s l ). For prewhitened data, this expression 181 simplifies even more, since A is orthogonal, and therefore, Now Lappalainen makes the crucial observation that while the EM algorithm has not yet converged to the optimal values, the sources, s 0 , can be written as a "mixture" where the "noise" s G is mostly due to the other sources not having been perfectly unmixed. When far from the optimal solution, we have β ≈ 1 and α ≈ 0. Using an argument based on the central limit theorem, as the number of the other sources becomes large he then approximates the mixing matrix corresponding to those other sources asâ where X G are Gaussian-distributed "sources" with the same covariance as X, as is done in the standard FastICA algorithm, and the sources s 0G are s 0G def = a T X G . Then the update equation for the mixing matrix, normalized to unity, is estimated bŷ Lappalainen interprets the above E-step as filtering Gaussian noise.

184
The final step that will bring us to the standard FastICA is to note that the of Gaussians and perform model order estimation using a variety of techniques. 207 We can now select an appropriate functional form for the individual marginal in separating sources in an image processing context is given in Fig. 4 (from 236 Tonazzini et al., [60]).
where the symbol W denotes the separating operator from observation space Formally, the model of Olshausen and Field is described by: selective to structure at different spatial scales (Fig. 7).
where the derivative of the sparsity activation function S(·) induces non-linear self-inhibition, and the multiplier λ ≥ 0 is a regularization parameter. This enfoces sparsity, as it drives activities towards zero. The regularization parameter balances the first, data fidelity term, which ensures accurate reconstruction. During the 'analysis' ("filtering") phase, a given image, I(x), is decomposed in a dictionary, Φ, and its corresponding coefficients, a i , are computed. During the 'synthesis' phase a learned dictionary predicts an estimate of an image,Î(x), with residuals r(x). The optimal value of each a i is determined from the corresponding equilibrium solution.

Original image
Image estimate, $hat x$  = 2), sparse data mapped in to 322 the latent space produce a highly-peaked and heavy-tailed distribution for both 323 axes ( Fig. 8 (lower right)). This is indeed a result of the sparseness property of 324 the dataset: the two 'arms' of the sparse data cloud are tightly packed around 325 the directions of the unmixing vectors, a l . Algebraically, this means that for 326 a particular point, n, either the coefficient s 1,n (l = 1) or the coefficient s 2,n 327 (l = 2) is almost zero, as the particular datum is well described by the a 2 or 328 the a 1 "regressor", respectively. On the contrary, non-sparse data will typically 329 produce a projection that corresponds to a "fat" empirical histogram, as shown 330 in Fig. 8 (upper-right).  . We now seek a linear transformation to a latent space, uv, such that it optimizes some suitable criterion (this is shown in the right part of the figure). Sparse data mapped in the latent space produce heavy-tailed distributions for both latent dimensions (lower right), while for non-sparse data this is not the case (upper right).
With respect to the soft clustering view of component analysis (Miskin,[36]), discussed in the Introduction of the paper, if the data vectors are sufficiently sparse, their images on the unit hypersphere S D−1 , i.e. the radial sections of their position vectors with the unit hypersphere, mapped as where the projection operator P : u →û = u u maps vectors along their radii,  Uniqueness. Importantly, Li et al. [45] also show that the above problem has a unique solution. While in general there would be an infinite number of solutions for the underdetermined system of equations where the D×L matrix A (observation operator) with L > D maps the unknown signal s in to the observed signal x, the sparsity constraint makes the particular linear inverse problem well-posed. A geometric interpretation of why ℓ 1 -type sparsity regularization works well for signal recovery under sparsity constraints is shown in Fig. 10. We want to find the optimal x as the minimum-norm vector that satisfies the constraint x = As, i.e. such that the hyperplane does not intersect the ℓ 1 ball. More generally, the problem can be stated (in the deterministic framework) as:  Figure 10: Why ℓ 1 works: A geometric intuition into sparse priors. We seek the sparsest vector x ∈ R N under the ℓ 1 norm, in this case, that satisfies the linear constraint y = Φx, where Φ is a dictionary. The ℓ 1 penalty corresponds geometrically to a cross-polytope (the 'ℓ 1 ball' in R N ) and the linear constraint to a hyperplane. The shape of the polytope dictates the form of the solution. The optimal vector,x, is the one that touches the hyperplane without the latter intersecting the cross-polytope. Mathematically, this is the solution to the problemx = argmin y=Φx x 1 . As can be seen from the figure, the inclusion of ℓ 1 norm necessarily drives all components of x but one towards zero, leading to sparse solutions.
The probability of a single data point is obtained by integrating out the unknown signals, s: In order to derive a tractable algorithm, they make a Laplace approximation to the data likelihood, by assuming that the posterior is Gaussian around the posterior mode. This involves computing the Hessian To make a smooth approximation of the derivative of the log-prior, and a diagonal approximation to the Hessian, they then take p(s l ) ≈ cosh −θ/β (βs l ), which asymptotically approximates the Laplacian prior for β → ∞. Moreover, a low noise level is assumed. The above approximations finally lead to the gradient learning rule where, again, z l = ∂ log p(s l )/∂s l . Note that this has the same functional form  Girolami [26] proposes a variational method for learning sparse represenations. In particular, his method offers a solution to the problem of analytically integrating the data likelihood, for a range of heavy-tailed distributions. Starting from the heavy-tailed distribution p(s) ∝ cosh − 1 β (βs), he derives a variational approximation to the Laplacian prior by introducing a variational parameter, ξ = (ξ 1 , . . . , ξ L ), such that the prior p(s) = L l=1 exp(−|s l |) becomes p(s; ξ), with s|ξ ∼ N (s; 0, Λ) and Λ = diag (|ξ l |). Then p(s) is the supremum Sejnowski.

404
The problem of sparsely representating a data matrix described above is We also define the conditional probability, which can be thought of as "a probability within a probability", P (A|B) = P (A ∩ B) P (B) , P (B) = 0 .
Then random variables (r.v.'s) are defined as functions from Ω to a range, R, e.g. a subset of R or N, etc. These can, inversely, define events as: where · denotes a "predicate" 8 (e.g. the event 'x > 2'), and therefore act as 439 "filters" of certain experimental outcomes.

440
Probability densities are defined as densities of probability measures: Finally joint densities (e.g. for the case of two random variables X, Y ) are defined as p XY (x, y) = p ω : X(ω) = x ∧ Y (ω) = y .
Joint densities of more than two r.v.'s are defined analogously.

A.2 Three Simple Rules
Probability theory is a mathematically elegant theory. The whole construction 443 can be based on the following three simple rules: This can be generalized for N events, giving the chain rule This will be valuable for reasoning in Bayesian networks later.

445
2. Bayes' rule, which is a recipe that tells us how to update our knowledge in the presence of new information, and can directly be derived from the definition of conditional probability and the product rule, P (A|B) = P (B|A)P (A) P (B) , P (B) = 0 .
3. Marginalization: given a joint density, p XY (x, y), get the marginal density of X or Y by integration (i.e. 'integrate out' the uncertainty in one variable): p XY (x, y)dy .
In principle, this is everything we need to know in order to perform proba-446 bilistic modelling and inference.