Key Features#
There are numerous advantages of using xeofs
, including:
Multi-Dimensional: Designed for
xarray
objects, it applies dimensionality reduction to multi-dimensional data while maintaining data labels.Dask-Integrated: Supports large datasets via
Dask
xarray objectsExtensive Methods: Offers various dimensionality reduction techniques
Adaptable Output: Provides output corresponding to the type of input, whether single or list of
xr.DataArray
orxr.Dataset
Missing Values: Handles
NaN
values within the dataBootstrapping: Comes with a user-friendly interface for model evaluation using bootstrapping
Efficient: Ensures computational efficiency, particularly with large datasets through randomized SVD
Modular: Allows users to implement and incorporate new dimensionality reduction methods
Labeled & Multi-Dimensional#
xeofs
is specifically designed for xarray
objects. This design choice enables you
to apply dimensionality reduction techniques to multi-dimensional data, while still maintaining
data labels. The only requirement for the user is to specify the sample dimensions. By
default, xeofs
assumes any other dimensions present in the data as feature dimensions.
For instance,
model.fit(data, dim="time")
will assume that the data
object has a sample dimension named time
, while all other
dimensions are feature dimensions to be expressed as latent variables.
Note
In this context, sample refers to the dimension representing the number of observations. In contrast, feature dimensions denote the variables. The main goal of dimensionality reduction techniques is to minimize the number of feature dimensions without altering the sample dimensions.
Note
Multiple sample dimensions can be specified as long as at least one feature dimension remains.
Dask Support#
If you handle large datasets that exceed memory capacity, xeofs
is designed to work with dask
-backed
xarray
objects from end-to-end. By default, xeofs
computes models eagerly, which in some
cases can lead to better performance. However, it is also possible to build and fit models “lazily”, meaning
no computation will be carried out until the user calls .compute()
. To enable lazy computation, specify
compute=False
when initializing the model.
Note
Importantly, xeofs
never loads the input dataset(s) into memory.
Lazy Evaluation#
There are a few tricks, and features that need to be explicitly disabled for lazy evaluation to work. First
is the check_nans
option, which skips checking for full or isolated NaNs
in the data. In this case,
the user is responsible for ensuring that the data is free of NaNs
by first applying e.g. .dropna()
or .fillna()
. Second is that lazy mode is incompatible with assessing the fit of a rotator class during
evaluation, becaue the entire dask
task graph must be built up front. Therefore, a lazy rotator model will
run out to the full max_iter
regardless of the specified rtol
. For that reason it is recommended to
reduce the number of iterations.
As an example, the following lazily creates a rotated EOF model for an 80GB dataset in about a second, which can
then be evaluated later using .compute()
.
import dask.array as da
import numpy as np
import xarray as xr
from xeofs.models import EOF, EOFRotator
data = xr.DataArray(
da.random.random((10000, 1440, 720), chunks=(100, 100, 100)),
dims=["time", "lon", "lat"],
coords={
"time": xr.date_range("2000-01-01", periods=10000, freq="D"),
"lon": np.linspace(-180, 180, 1440),
"lat": np.linspace(-90, 90, 720),
},
)
model = EOF(n_modes=5, check_nans=False, compute=False)
model.fit(data, dim="time")
rotator = EOFRotator(compute=False, max_iter=20)
rotator.fit(model)
# Later, you can compute the model
# rotator.compute()
Available Methods#
As a multifaceted toolbox, xeofs
aims to host a variety of dimensionality reduction
and related decomposition techniques.
Refer to the API section to discover the methods currently available.
Note
Please note that xeofs
is in its developmental phase. If there’s a specific method
you’d like to see included, we encourage you to open an issue on GitHub.
Model Serialization#
xeofs
models offer convenient save()
and load()
methods for serializing
fitted models to a portable format.
from xeofs.models import EOF
model = EOF()
model.fit(data, dim="time")
model.save("my_model.zarr")
# Later, you can load the model
loaded_model = EOF.load("my_model.zarr")
Input Data Compatibility#
xeofs
is tailored to function harmoniously with xarray objects. Currently, it supports:
Single instances of
xr.DataArray
orxr.Dataset
Lists comprising
xr.DataArray
orxr.Dataset
An intelligent feature of xeofs
is its ability to deliver the appropriate output based on
the input type. For instance, executing PCA on a singular xr.DataArray
will yield a single
xr.DataArray
for the PC components. Conversely, if a list of xr.DataArray
is inputted,
xeofs
will return a list of xr.DataArray
as PC components.
Warning
A mixed list containing both xr.DataArray
and xr.Dataset
objects is not currently supported.
Note
Some methods like PCA/EOF analysis can also handle complex-valued input data. However, due to a limitation in the
underlying dask
library, complex-valued data is not supported when using dask
.
Handling Missing Values#
Conventional SVD algorithms aren’t typically configured to manage missing values. To address this,
xeofs
will take of missing values (NaN
) within your data. There are two primary types of missing values:
Full-dimensional:
NaNs
spanning all samples for a specific feature or vice versa.Isolated: Occasional or sporadic
NaNs
within the dataset.
Consider a 3D dataset with dimensions (time, lon, lat). A full-dimensional NaN
might represent a
grid point (lon, lat) exhibiting NaNs
across all time steps. Conversely, an isolated
NaN
might indicate a grid point (lon, lat) displaying NaNs
for only certain time steps.
xeofs
is adept at handling full-dimensional NaNs
. However, it cannot manage isolated NaNs
,
which requires the user to make a decision about how to fill or remove features or samples containing isolated
NaNs
. xeofs
does provide an optional runtime check which will raise an error if isolated NaNs
are
detected, which is enabled by default.
Model Evaluation#
xeofs
is dedicated to providing a user-friendly interface for model evaluations using bootstrapping.
Currently, only bootstrapping for PCA/EOF analysis is supported
(for a practical example, see 3 | Validation).
Computationally Efficient#
Regardless of whether you’re dealing with in-memory or out-of-memory data, xeofs
ensures computational efficiency.
This is achieved using randomized SVD which tends to be faster for large matrices than a full SVD. For more details
you can refer to the sklearn documentation on PCA.
A comparative analysis demonstrates the performance of xeofs
against eofs
on a standard laptop using a 3D dataset with time, longitude, and latitude
dimensions. Results indicate that xeofs
computes moderate datasets
(10,000 samples by 100,000 features) in under a minute. While eofs
is
faster for smaller datasets, xeofs
excels with larger datasets, offering
significant speed advantages. The dashed line marks data sets with about 3 MiB;
xeofs
outpaces eofs
above this size, whereas eofs
is preferable for smaller data sets.
Note
You can find the script to run the performance tests here.
Implement Your Own Model#
The xeofs
package has been designed with modularity in mind, allowing you to seamlessly incorporate new methods.
For instance, if you’d like to introduce a new dimensionality reduction technique named MyModel
,
you can achieve this by inheriting the _BaseModel
class and implementing its _fit_algorithm()
method.
Here’s a detailed walkthrough on how to incorporate a new model:
1. Inherit the BaseModel#
Your new model should inherit from the _BaseModel class. This abstract base class enables
the transformation of any ND xarray
object into a 2D xarray.DataArray
with dimensions
(sample, feature) and back. Additionally, it grants access to handy parameters like
n_modes
, standardize
, and use_coslat
.
from xeofs.models._base_model import _BaseModel
from xeofs.models.decomposer import Decomposer
class MyModel(_BaseModel):
def __init__(self, **kwargs):
super().__init__(**kwargs)
2. Define the Fit Algorithm#
Your chosen method’s entire operation should be encapsulated within the
_fit_algorithm()
. This function is triggered by fit()
and handles the model fitting.
By this stage, xeofs
has already processed essential preprocessing steps, ranging from
centering and weighting to stacking and handling NaN
values.
Here’s a basic PCA example to illustrate the process:
def _fit_algorithm(self, data):
# NOTE: The `data` here is a 2D xarray.DataArray with dimensions (sample, feature).
# We'll illustrate with a simplified PCA.
# The goal is to perform an SVD on the `data` matrix.
decomposer = Decomposer(n_modes=self.n_modes)
decomposer.fit(data)
# Extract the necessary components from the decomposer.
scores = decomposer.U_
components = decomposer.V_
singular_values = decomposer.s_
# Store the data for later access using the internal DataContainer class.
self.data.add(name="my_singular_values", data=singular_values)
self.data.add(name="my_components", data=components)
self.data.add(name="my_scores", data=scores)
# (Optional) Attach model parameters as attributes to your data.
self.data.set_attrs(self.attrs)
3. Access the Results#
After fitting your model, results can be retrieved by creating a method for each data
piece. The internal Preprocessor
class can assist with this task,
ensuring that the retrieved data retains the correct format.
Depending on their dimensions, data types are categorized into four groups:
(sample, feature, …)
(sample, …)
(feature, …)
(…)
The Preprocessor class offers methods corresponding to the first three data groups:
inverse_transform_data
for (sample, feature, …)inverse_transform_scores
for (sample, …)inverse_transform_components
for (feature, …)
For group (4), data can be accessed directly since there’s no need for back transformation.
def my_singular_values(self):
return self.data.get("my_singular_values")
def my_components(self):
return self.preprocessor.inverse_transform_components(
self.data.get("my_components")
)
def my_scores(self):
return self.preprocessor.inverse_transform_scores(self.data.get("my_scores"))
4. Optional: Implement Transform and Inverse Transform Methods#
While it’s required to implement the transform
and inverse_transform
methods for a complete model,
we’ll merely indicate their absence for this example.
def _transform_algorithm(self, data):
raise NotImplementedError("This model does not support transform.")
def _inverse_transform_algorithm(self, scores):
raise NotImplementedError("This model does not support inverse transform.")
5. Execute the Model#
With all parts in place, you can now initialize and use the new model:
model = MyModel(n_modes=3)
model.fit(t2m, dim="time")
model.my_components()
Comparison With Other Packages#
xeofs
stands among a suite of Python packages dedicated to dimensionality reduction.
Its development has been influenced by other notable packages, each boasting unique and robust features.
For instance, eofs, crafted by Andrew Dawson, is renowned for its compatibility with Dask
and xarray
,
offering an intuitive EOF analysis interface with a 1D sample dimension. xMCA is another cherished
tool, presenting an interface for Maximum Covariance Analysis in xarray
. In contrast, pyEOF is
tailored for Varimax-rotated EOF analysis but is restricted to 2D (pandas
) input data. While all these
tools are useful in their specific realms, they possess limitations. xeofs
aspires to present a more general
toolkit for dimensionality reduction techniques.
xeofs |
eofs |
pyEOF |
xMCA |
|
---|---|---|---|---|
xarray Interface |
✅ |
✅ |
❌ |
✅ |
Dask Support |
✅ |
✅ |
❌ |
❌ |
Multi-Dimensional |
✅ |
Only 1D sample dim |
2D input only |
Only 1D sample dim |
Missing Values |
✅ |
✅ |
❌ |
✅ |
Support for |
✅ |
❌ |
❌ |
❌ |
Algorithm1 |
Randomized SVD |
Full SVD |
Randomized SVD |
Full SVD |
Extensible Code Structure |
✅ |
❌ |
❌ |
❌ |
Validation |
||||
Bootstrapping |
✅ |
❌ |
❌ |
❌ |
1Note on the algorithm: The computational burden of a full SVD decomposition for an m x n matrix is O(min(mn², m²n)). However, the randomized SVD, which identifies only the initial k singular values, notably curtails this complexity to O(m n log(k)), making the randomized SVD, as utilized by xeofs
, more suitable for expansive datasets. For an in-depth exploration, refer to the `sklearn docs on PCA`_.