SparsePCA#

class SparsePCA(n_modes: int = 2, alpha: float = 0.001, beta: float = 0.001, robust: bool = False, regularizer: str = 'l1', max_iter: int = 500, tol: float = 1e-06, oversample: int = 10, n_subspace: int = 1, n_blocks: int = 1, center: bool = True, standardize: bool = False, use_coslat: bool = False, sample_name: str = 'sample', feature_name: str = 'feature', check_nans=True, compute: bool = True, random_state: int | None = None, solver: str = 'auto', solver_kwargs: dict = {}, **kwargs)#

Sparse PCA via Variable Projection.

Given a data matrix, Sparse PCA via Variable Projection [1] computes a set of sparse components that can optimally reconstruct the input data. The amount of sparsity is controlled by the L1 penalty coefficient, specified by the parameter alpha. Additionally, ridge shrinkage can be applied to improve conditioning.

This implementation uses randomized methods for linear algebra to accelerate computations. The quality of the approximation can be controlled via the oversampling parameter oversample and n_subspace, which specifies the number of subspace iterations.

Variable Projection for Sparse PCA solves the following optimization problem:

minimize \(\frac{1}{2} \| X - X B A^T \|^2 + \alpha \|B\|_1 + \frac{1}{2} \beta \|B\|^2\)

Parameters:
  • n_modes (int, default=2) – Number of modes to calculate.

  • alpha (float, default=0.001) – Sparsity controlling parameter. Higher values lead to sparser components.

  • beta (float, default=0.001) – Amount of ridge shrinkage to apply in order to improve conditioning.

  • regularizer ({'l0', 'l1'}, default='l1') – Type of sparsity-inducing regularizer. The L1 norm (also known as LASSO) leads to a soft-threshold operator (default). The L0 norm is implemented via a hard-threshold operator.

  • max_iter (int, default=500) – Maximum number of iterations to perform before exiting.

  • tol (float, default=1e-5) – Stopping tolerance for reconstruction error.

  • robust (bool, default=False) – Use a robust algorithm to compute the sparse PCA.

  • oversample (int, default=10) – Controls the oversampling of column space. Increasing this parameter may improve numerical accuracy.

  • n_subspace (int, default=2) – Parameter to control the number of subspace iterations. Increasing this parameter may improve numerical accuracy.

  • n_blocks (int, default=2) – Parameter to control how many blocks of columns the input matrix should be split into. A larger number requires less fast memory, but increases computational time.

  • standardize (bool, default=False) – Whether to standardize the input data, i.e., each feature will have a variance of 1.

  • use_coslat (bool, default=False) – Whether to use cosine of latitude for scaling.

  • sample_name (str, default="sample") – Name of the sample dimension.

  • feature_name (str, default="feature") – Name of the feature dimension.

  • compute (bool, default=True) – Whether to compute elements of the model eagerly, or to defer computation. If True, the following steps will be computed sequentially: 1) the preprocessor scaler, 2) optional NaN checks, 3) SVD decomposition, 4) scores and components.

  • random_state (int, optional) – Seed for the random number generator.

  • solver ({"auto", "full", "randomized"}, default="randomized") – Solver to use for the SVD computation.

  • solver_kwargs (dict, default={}) – Additional keyword arguments to be passed to the SVD solver.

References

Notes

This implementation is adapted from the code provided by the authors of the paper [1], which is part of the Ristretto library (erichson/ristretto).

Examples

>>> model = xe.single.SparsePCA(n_modes=2, alpha=1e-4)
>>> model.fit(data, "time")
>>> components = model.components()
__init__(n_modes: int = 2, alpha: float = 0.001, beta: float = 0.001, robust: bool = False, regularizer: str = 'l1', max_iter: int = 500, tol: float = 1e-06, oversample: int = 10, n_subspace: int = 1, n_blocks: int = 1, center: bool = True, standardize: bool = False, use_coslat: bool = False, sample_name: str = 'sample', feature_name: str = 'feature', check_nans=True, compute: bool = True, random_state: int | None = None, solver: str = 'auto', solver_kwargs: dict = {}, **kwargs)#

Methods

__init__([n_modes, alpha, beta, robust, ...])

components()

Return the sparse components.

compute(**kwargs)

Compute and load delayed model results.

deserialize(dt)

Deserialize the model and its preprocessors from a DataTree.

explained_variance()

Return the explained variance.

explained_variance_ratio()

Return explained variance ratio.

fit(X, dim[, weights])

Fit the model to the input data.

fit_transform(data, dim[, weights])

Fit the model to the input data and project the data onto the components.

get_params()

Get the model parameters.

get_serialization_attrs()

Get the attributes to serialize.

inverse_transform(scores[, normalized])

Reconstruct the original data from transformed data.

load(path[, engine])

Load a saved model.

save(path[, overwrite, save_data, engine])

Save the model.

scores([normalized])

Return the component scores.

serialize()

Serialize a complete model with its preprocessor.

transform(data[, normalized])

Project data onto the components.

components() DataArray | Dataset | list[DataArray | Dataset]#

Return the sparse components.

The components are given by the sparse weight matrix B in the optimization problem.

Returns:

components – Components of the fitted model.

Return type:

DataArray | Dataset | list[DataArray]

compute(**kwargs)#

Compute and load delayed model results.

Parameters:

**kwargs – Additional keyword arguments to pass to dask.compute().

classmethod deserialize(dt: DataTree) Self#

Deserialize the model and its preprocessors from a DataTree.

explained_variance() DataArray#

Return the explained variance.

Returns:

explained_variance – Explained variance.

Return type:

DataArray

explained_variance_ratio() DataArray#

Return explained variance ratio.

The explained variance ratio \(\gamma_i\) is the variance explained by each mode normalized by the total variance. It is defined as

\[\gamma_i = \frac{\lambda_i}{\sum_{j=1}^M \lambda_j}\]

where \(\lambda_i\) is the explained variance of the \(i\)-th mode and \(M\) is the total number of modes.

Returns:

explained_variance_ratio – Explained variance ratio.

Return type:

DataArray

fit(X: DataArray | Dataset | list[DataArray | Dataset], dim: Sequence[Hashable] | Hashable, weights: DataArray | Dataset | list[DataArray | Dataset] | None = None) Self#

Fit the model to the input data.

Parameters:
  • X (DataObject) – Input data.

  • dim (Sequence[Hashable] | Hashable) – Specify the sample dimensions. The remaining dimensions will be treated as feature dimensions.

  • weights (DataObject | None, default=None) – Weighting factors for the input data.

fit_transform(data: DataArray | Dataset | list[DataArray | Dataset], dim: Sequence[Hashable] | Hashable, weights: DataArray | Dataset | list[DataArray | Dataset] | None = None, **kwargs) DataArray#

Fit the model to the input data and project the data onto the components.

Parameters:
  • data (DataObject) – Input data.

  • dim (Sequence[Hashable] | Hashable) – Specify the sample dimensions. The remaining dimensions will be treated as feature dimensions.

  • weights (DataObject | None, default=None) – Weighting factors for the input data.

  • **kwargs – Additional keyword arguments to pass to the transform method.

Returns:

projections – Projections of the data onto the components.

Return type:

DataArray

get_params() dict[str, Any]#

Get the model parameters.

get_serialization_attrs() dict#

Get the attributes to serialize.

inverse_transform(scores: DataArray, normalized: bool = False) DataArray | Dataset | list[DataArray | Dataset]#

Reconstruct the original data from transformed data.

Parameters:
  • scores (DataArray) – Transformed data to be reconstructed. This could be a subset of the scores data of a fitted model, or unseen data. Must have a ‘mode’ dimension.

  • normalized (bool, default=False) – Whether the scores data have been normalized by the L2 norm.

Returns:

data – Reconstructed data.

Return type:

DataObject

classmethod load(path: str, engine: Literal['zarr', 'netcdf4', 'h5netcdf'] = 'zarr', **kwargs) Self#

Load a saved model.

Parameters:
  • path (str) – Path to the saved model.

  • engine ({"zarr", "netcdf4", "h5netcdf"}, default="zarr") – Xarray backend engine to use for reading the saved model.

  • **kwargs – Additional keyword arguments to pass to open_datatree().

Returns:

model – The loaded model.

Return type:

BaseModel

save(path: str, overwrite: bool = False, save_data: bool = False, engine: Literal['zarr', 'netcdf4', 'h5netcdf'] = 'zarr', **kwargs)#

Save the model.

Parameters:
  • path (str) – Path to save the model.

  • overwrite (bool, default=False) – Whether or not to overwrite the existing path if it already exists. Ignored unless engine=”zarr”.

  • save_data (str) – Whether or not to save the full input data along with the fitted components.

  • engine ({"zarr", "netcdf4", "h5netcdf"}, default="zarr") – Xarray backend engine to use for writing the saved model.

  • **kwargs – Additional keyword arguments to pass to DataTree.to_netcdf() or DataTree.to_zarr().

scores(normalized: bool = False) DataArray#

Return the component scores.

The component scores \(U\) are defined as the projection of the fitted data \(X\) onto the sparse weight components \(B\).

\[U = X B\]
Parameters:

normalized (bool, default=False) – Whether to normalize the scores by the L2 norm.

Returns:

components – Scores of the fitted model.

Return type:

DataArray | Dataset | list[DataArray]

serialize() DataTree#

Serialize a complete model with its preprocessor.

transform(data: DataArray | Dataset | list[DataArray | Dataset], normalized=False) DataArray#

Project data onto the components.

Parameters:
  • data (DataObject) – Data to be transformed.

  • normalized (bool, default=False) – Whether to normalize the scores by the L2 norm.

Returns:

projections – Projections of the data onto the components.

Return type:

DataArray