xeofs.models.GWPCA#

class xeofs.models.GWPCA(n_modes: int, bandwidth: float, metric: str = 'haversine', kernel: str = 'bisquare', center: bool = True, standardize: bool = False, use_coslat: bool = False, check_nans: bool = True, sample_name: str = 'sample', feature_name: str = 'feature')#

Bases: _BaseModel

Geographically weighted PCA.

Geographically weighted PCA (GWPCA) [1] uses a geographically weighted approach to perform PCA for each observation in the dataset based on its local neighbors.

The neighbors for each observation are determined based on the provided bandwidth and metric. Each neighbor is weighted based on its distance from the observation using the provided kernel function.

Parameters:
  • n_modes (int) – Number of modes to calculate.

  • bandwidth (float) – Bandwidth of the kernel function. Must be > 0.

  • metric (str, default="haversine") – Distance metric to use. Great circle distance (haversine) is always expressed in kilometers. All other distance metrics are reported in the unit of the input data. See scipy.spatial.distance.cdist for a list of available metrics.

  • kernel (str, default="bisquare") – Kernel function to use. Must be one of [‘bisquare’, ‘gaussian’, ‘exponential’].

  • center (bool, default=True) – If True, the data is centered by subtracting the mean (feature-wise).

  • standardize (bool, default=False) – If True, the data is divided by the standard deviation (feature-wise).

  • use_coslat (bool, default=False) – If True, the data is weighted by the square root of cosine of latitudes.

  • sample_name (str, default="sample") – Name of the sample dimension.

  • feature_name (str, default="feature") – Name of the feature dimension.

bandwidth#

Bandwidth of the kernel function.

Type:

float

metric#

Distance metric to use.

Type:

str

kernel#

Kernel function to use.

Type:

str

Methods#
--------
fit(X)#
Type:

Fit the model with input data.

explained_variance#
Type:

Return the explained variance of the local components.

explained_variance_ratio#
Type:

Return the explained variance ratio of the local components.

largest_locally_weighted_components#
Type:

Return the largest locally weighted components.

Notes

GWPCA is computationally expensive since it performs PCA for each sample. This implementation leverages numba to speed up the computation on CPUs. However, for moderate to large datasets, this won’t be sufficient. Currently, GPU support is not implemented. If you’re dataset is too large to be processed on a CPU, consider using the R package GWmodel [2], which provides a GPU implementation of GWPCA.

References

__init__(n_modes: int, bandwidth: float, metric: str = 'haversine', kernel: str = 'bisquare', center: bool = True, standardize: bool = False, use_coslat: bool = False, check_nans: bool = True, sample_name: str = 'sample', feature_name: str = 'feature')#

Methods

__init__(n_modes, bandwidth[, metric, ...])

components()

Get the components.

compute([verbose])

Compute and load delayed model results.

deserialize(dt)

Deserialize the model and its preprocessors from a DataTree.

explained_variance()

explained_variance_ratio()

fit(X, dim[, weights])

Fit the model to the input data.

fit_transform(data, dim[, weights])

Fit the model to the input data and project the data onto the components.

get_params()

Get the model parameters.

get_serialization_attrs()

inverse_transform(scores[, normalized])

Reconstruct the original data from transformed data.

largest_locally_weighted_components()

load(path[, engine])

Load a saved model.

save(path[, overwrite, save_data, engine])

Save the model.

scores()

Get the scores.

serialize()

Serialize a complete model with its preprocessor.

transform(data[, normalized])

Project data onto the components.

components() DataArray | Dataset | List[DataArray | Dataset]#

Get the components.

compute(verbose: bool = False, **kwargs)#

Compute and load delayed model results.

Parameters:
  • verbose (bool) – Whether or not to provide additional information about the computing progress.

  • **kwargs – Additional keyword arguments to pass to dask.compute().

classmethod deserialize(dt: DataTree) Self#

Deserialize the model and its preprocessors from a DataTree.

fit(X: List[DataArray | Dataset] | DataArray | Dataset, dim: Sequence[Hashable] | Hashable, weights: List[DataArray | Dataset] | DataArray | Dataset | None = None) Self#

Fit the model to the input data.

Parameters:
  • X (DataArray | Dataset | List[DataArray]) – Input data.

  • dim (Sequence[Hashable] | Hashable) – Specify the sample dimensions. The remaining dimensions will be treated as feature dimensions.

  • weights (Optional[DataArray | Dataset | List[DataArray]]) – Weighting factors for the input data.

fit_transform(data: List[DataArray | Dataset] | DataArray | Dataset, dim: Sequence[Hashable] | Hashable, weights: List[DataArray | Dataset] | DataArray | Dataset | None = None, **kwargs) DataArray#

Fit the model to the input data and project the data onto the components.

Parameters:
  • data (DataObject) – Input data.

  • dim (Sequence[Hashable] | Hashable) – Specify the sample dimensions. The remaining dimensions will be treated as feature dimensions.

  • weights (Optional[DataObject]) – Weighting factors for the input data.

  • **kwargs – Additional keyword arguments to pass to the transform method.

Returns:

projections – Projections of the data onto the components.

Return type:

DataArray

get_params() Dict[str, Any]#

Get the model parameters.

inverse_transform(scores: DataArray, normalized: bool = True) DataArray | Dataset | List[DataArray | Dataset]#

Reconstruct the original data from transformed data.

Parameters:
  • scores (DataArray) – Transformed data to be reconstructed. This could be a subset of the scores data of a fitted model, or unseen data. Must have a ‘mode’ dimension.

  • normalized (bool, default=True) – Whether the scores data have been normalized by the L2 norm.

Returns:

data – Reconstructed data.

Return type:

DataArray | Dataset | List[DataArray]

classmethod load(path: str, engine: Literal['zarr', 'netcdf4', 'h5netcdf'] = 'zarr', **kwargs) Self#

Load a saved model.

Parameters:
  • path (str) – Path to the saved model.

  • engine ({"zarr", "netcdf4", "h5netcdf"}, default="zarr") – Xarray backend engine to use for reading the saved model.

  • **kwargs – Additional keyword arguments to pass to open_datatree().

Returns:

model – The loaded model.

Return type:

_BaseModel

save(path: str, overwrite: bool = False, save_data: bool = False, engine: Literal['zarr', 'netcdf4', 'h5netcdf'] = 'zarr', **kwargs)#

Save the model.

Parameters:
  • path (str) – Path to save the model.

  • overwrite (bool, default=False) – Whether or not to overwrite the existing path if it already exists. Ignored unless engine=”zarr”.

  • save_data (str) – Whether or not to save the full input data along with the fitted components.

  • engine ({"zarr", "netcdf4", "h5netcdf"}, default="zarr") – Xarray backend engine to use for writing the saved model.

  • **kwargs – Additional keyword arguments to pass to DataTree.to_netcdf() or DataTree.to_zarr().

scores()#

Get the scores.

Parameters:

normalized (bool, default=True) – Whether to normalize the scores by the L2 norm.

serialize() DataTree#

Serialize a complete model with its preprocessor.

transform(data: List[DataArray | Dataset] | DataArray | Dataset, normalized=True) DataArray#

Project data onto the components.

Parameters:
  • data (DataArray | Dataset | List[DataArray]) – Data to be transformed.

  • normalized (bool, default=True) – Whether to normalize the scores by the L2 norm.

Returns:

projections – Projections of the data onto the components.

Return type:

DataArray