xeofs.models.MCA#

class xeofs.models.MCA(n_modes: int = 2, center: bool = True, standardize: bool = False, use_coslat: bool = False, check_nans: bool = True, n_pca_modes: int | None = None, compute: bool = True, sample_name: str = 'sample', feature_name: str = 'feature', solver: str = 'auto', random_state: int | None = None, solver_kwargs: Dict = {}, **kwargs)#

Bases: _BaseCrossModel

Maximum Covariance Analyis.

MCA is a statistical method that finds patterns of maximum covariance between two datasets.

Parameters:
  • n_modes (int, default=2) – Number of modes to calculate.

  • center (bool, default=True) – Whether to center the input data.

  • standardize (bool, default=False) – Whether to standardize the input data.

  • use_coslat (bool, default=False) – Whether to use cosine of latitude for scaling.

  • n_pca_modes (int, default=None) – The number of principal components to retain during the PCA preprocessing step applied to both data sets prior to executing MCA. If set to None, PCA preprocessing will be bypassed, and the MCA will be performed on the original datasets. Specifying an integer value greater than 0 for n_pca_modes will trigger the PCA preprocessing, retaining only the specified number of principal components. This reduction in dimensionality can be especially beneficial when dealing with high-dimensional data, where computing the cross-covariance matrix can become computationally intensive or in scenarios where multicollinearity is a concern.

  • compute (bool, default=True) – Whether to compute elements of the model eagerly, or to defer computation. If True, four pieces of the fit will be computed sequentially: 1) the preprocessor scaler, 2) optional NaN checks, 3) SVD decomposition, 4) scores and components.

  • sample_name (str, default="sample") – Name of the new sample dimension.

  • feature_name (str, default="feature") – Name of the new feature dimension.

  • solver ({"auto", "full", "randomized"}, default="auto") – Solver to use for the SVD computation.

  • random_state (int, default=None) – Seed for the random number generator.

  • solver_kwargs (dict, default={}) – Additional keyword arguments passed to the SVD solver function.

Notes

MCA is similar to Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA), but while PCA finds modes of maximum variance and CCA finds modes of maximum correlation, MCA finds modes of maximum covariance. See [1] [2] for more details.

References

Examples

>>> model = MCA(n_modes=5, standardize=True)
>>> model.fit(data1, data2)
__init__(n_modes: int = 2, center: bool = True, standardize: bool = False, use_coslat: bool = False, check_nans: bool = True, n_pca_modes: int | None = None, compute: bool = True, sample_name: str = 'sample', feature_name: str = 'feature', solver: str = 'auto', random_state: int | None = None, solver_kwargs: Dict = {}, **kwargs)#

Methods

__init__([n_modes, center, standardize, ...])

components()

Return the singular vectors of the left and right field.

compute([verbose])

Compute and load delayed model results.

covariance_fraction()

Get the covariance fraction (CF).

deserialize(dt)

Deserialize the model and its preprocessors from a DataTree.

fit(data1, data2, dim[, weights1, weights2])

Fit the model to the data.

get_params()

Get the model parameters.

get_serialization_attrs()

heterogeneous_patterns([correction, alpha])

Return the heterogeneous patterns of the left and right field.

homogeneous_patterns([correction, alpha])

Return the homogeneous patterns of the left and right field.

inverse_transform(scores1, scores2)

Reconstruct the original data from transformed data.

load(path[, engine])

Load a saved model.

save(path[, overwrite, save_data, engine])

Save the model.

scores()

Return the scores of the left and right field.

serialize()

Serialize a complete model with its preprocessors.

singular_values()

Get the singular values of the cross-covariance matrix.

squared_covariance()

Get the squared covariance.

squared_covariance_fraction()

Calculate the squared covariance fraction (SCF).

total_covariance()

Get the total covariance.

transform([data1, data2])

Get the expansion coefficients of "unseen" data.

components()#

Return the singular vectors of the left and right field.

Returns:

  • components1 (DataArray | Dataset | List[DataArray]) – Left components of the fitted model.

  • components2 (DataArray | Dataset | List[DataArray]) – Right components of the fitted model.

compute(verbose: bool = False, **kwargs)#

Compute and load delayed model results.

Parameters:
  • verbose (bool) – Whether or not to provide additional information about the computing progress.

  • **kwargs – Additional keyword arguments to pass to dask.compute().

covariance_fraction()#

Get the covariance fraction (CF).

Cheng and Dunkerton (1995) define the CF as follows:

\[CF_i = \frac{\sigma_i}{\sum_{i=1}^{m} \sigma_i}\]

where m is the total number of modes and \(\sigma_i\) is the ith singular value of the covariance matrix.

In this implementation the sum of singular values is estimated from the first n modes, therefore one should aim to retain as many modes as possible to get a good estimate of the covariance fraction.

Note

It is important to differentiate the CF from the squared covariance fraction (SCF). While the SCF is an invariant quantity in MCA, the CF is not. Therefore, the SCF is used to assess the relative importance of each mode. Cheng and Dunkerton (1995) introduced the CF in the context of Varimax-rotated MCA to compare the relative importance of each mode before and after rotation. In the special case of both data fields in MCA being identical, the CF is equivalent to the explained variance ratio in EOF analysis.

classmethod deserialize(dt: DataTree) Self#

Deserialize the model and its preprocessors from a DataTree.

fit(data1: DataArray | Dataset | List[DataArray | Dataset], data2: DataArray | Dataset | List[DataArray | Dataset], dim: Hashable | Sequence[Hashable], weights1: List[DataArray | Dataset] | DataArray | Dataset | None = None, weights2: List[DataArray | Dataset] | DataArray | Dataset | None = None) Self#

Fit the model to the data.

Parameters:
  • data1 (DataArray | Dataset | List[DataArray]) – Left input data.

  • data2 (DataArray | Dataset | List[DataArray]) – Right input data.

  • dim (Hashable | Sequence[Hashable]) – Define the sample dimensions. The remaining dimensions will be treated as feature dimensions.

  • weights1 (Optional[DataObject]) – Weights to be applied to the left input data.

  • weights2 (Optional[DataObject]) – Weights to be applied to the right input data.

get_params() Dict#

Get the model parameters.

heterogeneous_patterns(correction=None, alpha=0.05)#

Return the heterogeneous patterns of the left and right field.

The heterogeneous patterns are the correlation coefficients between the input data and the scores of the other field.

More precisely, the heterogeneous patterns r_{het} are defined as

\[r_{het, x} = corr \left(X, A_y \right)\]
\[r_{het, y} = corr \left(Y, A_x \right)\]

where \(X\) and \(Y\) are the input data, \(A_x\) and \(A_y\) are the scores of the left and right field, respectively.

Parameters:
  • correction (str, default=None) – Method to apply a multiple testing correction. If None, no correction is applied. Available methods are: - bonferroni : one-step correction - sidak : one-step correction - holm-sidak : step down method using Sidak adjustments - holm : step-down method using Bonferroni adjustments - simes-hochberg : step-up method (independent) - hommel : closed method based on Simes tests (non-negative) - fdr_bh : Benjamini/Hochberg (non-negative) (default) - fdr_by : Benjamini/Yekutieli (negative) - fdr_tsbh : two stage fdr correction (non-negative) - fdr_tsbky : two stage fdr correction (non-negative)

  • alpha (float, default=0.05) – The desired family-wise error rate. Not used if correction is None.

homogeneous_patterns(correction=None, alpha=0.05)#

Return the homogeneous patterns of the left and right field.

The homogeneous patterns are the correlation coefficients between the input data and the scores.

More precisely, the homogeneous patterns r_{hom} are defined as

\[r_{hom, x} = corr \left(X, A_x \right)\]
\[r_{hom, y} = corr \left(Y, A_y \right)\]

where \(X\) and \(Y\) are the input data, \(A_x\) and \(A_y\) are the scores of the left and right field, respectively.

Parameters:
  • correction (str, default=None) – Method to apply a multiple testing correction. If None, no correction is applied. Available methods are: - bonferroni : one-step correction - sidak : one-step correction - holm-sidak : step down method using Sidak adjustments - holm : step-down method using Bonferroni adjustments - simes-hochberg : step-up method (independent) - hommel : closed method based on Simes tests (non-negative) - fdr_bh : Benjamini/Hochberg (non-negative) (default) - fdr_by : Benjamini/Yekutieli (negative) - fdr_tsbh : two stage fdr correction (non-negative) - fdr_tsbky : two stage fdr correction (non-negative)

  • alpha (float, default=0.05) – The desired family-wise error rate. Not used if correction is None.

Returns:

  • patterns1 (DataArray | Dataset | List[DataArray]) – Left homogenous patterns.

  • patterns2 (DataArray | Dataset | List[DataArray]) – Right homogenous patterns.

  • pvals1 (DataArray | Dataset | List[DataArray]) – Left p-values.

  • pvals2 (DataArray | Dataset | List[DataArray]) – Right p-values.

inverse_transform(scores1: DataArray, scores2: DataArray) Tuple[DataArray | Dataset | List[DataArray | Dataset], DataArray | Dataset | List[DataArray | Dataset]]#

Reconstruct the original data from transformed data.

Parameters:
  • scores1 (DataObject) – Transformed left field data to be reconstructed. This could be a subset of the scores data of a fitted model, or unseen data. Must have a ‘mode’ dimension.

  • scores2 (DataObject) – Transformed right field data to be reconstructed. This could be a subset of the scores data of a fitted model, or unseen data. Must have a ‘mode’ dimension.

Returns:

  • Xrec1 (DataArray | Dataset | List[DataArray]) – Reconstructed data of left field.

  • Xrec2 (DataArray | Dataset | List[DataArray]) – Reconstructed data of right field.

classmethod load(path: str, engine: Literal['zarr', 'netcdf4', 'h5netcdf'] = 'zarr', **kwargs) Self#

Load a saved model.

Parameters:
  • path (str) – Path to the saved model.

  • engine ({"zarr", "netcdf4", "h5netcdf"}, default="zarr") – Xarray backend engine to use for reading the saved model.

  • **kwargs – Additional keyword arguments to pass to open_datatree().

Returns:

model – The loaded model.

Return type:

_BaseCrossModel

save(path: str, overwrite: bool = False, save_data: bool = False, engine: Literal['zarr', 'netcdf4', 'h5netcdf'] = 'zarr', **kwargs)#

Save the model.

Parameters:
  • path (str) – Path to save the model.

  • overwrite (bool, default=False) – Whether or not to overwrite the existing path if it already exists. Ignored unless engine=”zarr”.

  • save_data (str) – Whether or not to save the full input data along with the fitted components.

  • engine ({"zarr", "netcdf4", "h5netcdf"}, default="zarr") – Xarray backend engine to use for writing the saved model.

  • **kwargs – Additional keyword arguments to pass to DataTree.to_netcdf() or DataTree.to_zarr().

scores()#

Return the scores of the left and right field.

The scores in MCA are the projection of the left and right field onto the left and right singular vector of the cross-covariance matrix.

Returns:

  • scores1 (DataArray) – Left scores.

  • scores2 (DataArray) – Right scores.

serialize() DataTree#

Serialize a complete model with its preprocessors.

singular_values()#

Get the singular values of the cross-covariance matrix.

squared_covariance()#

Get the squared covariance.

The squared covariance corresponds to the explained variance in PCA and is given by the squared singular values of the covariance matrix.

squared_covariance_fraction()#

Calculate the squared covariance fraction (SCF).

The SCF is a measure of the proportion of the total squared covariance that is explained by each mode i. It is computed as follows:

\[SCF_i = \frac{\sigma_i^2}{\sum_{i=1}^{m} \sigma_i^2}\]

where m is the total number of modes and \(\sigma_i\) is the ith singular value of the covariance matrix.

total_covariance() DataArray#

Get the total covariance.

This measure follows the defintion of Cheng and Dunkerton (1995). Note that this measure is not an invariant in MCA.

transform(data1: List[DataArray | Dataset] | DataArray | Dataset | None = None, data2: List[DataArray | Dataset] | DataArray | Dataset | None = None) Sequence[DataArray]#

Get the expansion coefficients of “unseen” data.

The expansion coefficients are obtained by projecting data onto the singular vectors.

Parameters:
  • data1 (DataArray | Dataset | List[DataArray]) – Left input data. Must be provided if data2 is not provided.

  • data2 (DataArray | Dataset | List[DataArray]) – Right input data. Must be provided if data1 is not provided.

Returns:

  • scores1 (DataArray | Dataset | List[DataArray]) – Left scores.

  • scores2 (DataArray | Dataset | List[DataArray]) – Right scores.