Quickstart#

We begin with a straightforward example: PCA (EOF analysis) of a 3D xarray.Dataset.

Import the package#

We start by importing the xarray and xeofs package.

[1]:

import xarray as xr
import xeofs as xe

xr.set_options(display_expand_attrs=False)

[1]:

<xarray.core.options.set_options at 0x7fa282d028d0>

Load the data#

Next, we fetch the data from the xarray tutorial repository. The data is a 3D dataset of 6 hourly surface air temperature over North America between 2013 and 2014.

[2]:

t2m = xr.tutorial.open_dataset('air_temperature')
t2m

[2]:

<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes: (5)

xarray.Dataset

Dimensions:
- lat: 25
- time: 2920
- lon: 53

Coordinates: (3)

lat

(lat)

float32

75.0 72.5 70.0 ... 20.0 17.5 15.0

standard_name :: latitude
long_name :: Latitude
units :: degrees_north
axis :: Y

array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
       45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
       15. ], dtype=float32)

lon

(lon)

float32

200.0 202.5 205.0 ... 327.5 330.0

standard_name :: longitude
long_name :: Longitude
units :: degrees_east
axis :: X

array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
       225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
       250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
       275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
       300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
       325. , 327.5, 330. ], dtype=float32)

time

(time)

datetime64[ns]

2013-01-01 ... 2014-12-31T18:00:00

standard_name :: time
long_name :: Time

array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
       '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
       '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
      dtype='datetime64[ns]')

Data variables: (1)
- air
  (time, lat, lon)
  float64
  ...
  long_name :
  4xDaily Air temperature at sigma level 995
  units :
  degK
  precision :
  2
  GRIB_id :
  11
  GRIB_name :
  TMP
  var_desc :
  Air temperature
  dataset :
  NMC Reanalysis
  level_desc :
  Surface
  statistic :
  Individual Obs
  parent_stat :
  Other
  actual_range :
  [185.16 322.1 ]
```
[3869000 values with dtype=float64]
```

Indexes: (3)

lat

PandasIndex

PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
       45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
       15.0],
      dtype='float32', name='lat'))

lon

PandasIndex

PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
       225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
       250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
       275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
       300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
       325.0, 327.5, 330.0],
      dtype='float32', name='lon'))

time

PandasIndex

PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
               '2013-01-01 12:00:00', '2013-01-01 18:00:00',
               '2013-01-02 00:00:00', '2013-01-02 06:00:00',
               '2013-01-02 12:00:00', '2013-01-02 18:00:00',
               '2013-01-03 00:00:00', '2013-01-03 06:00:00',
               ...
               '2014-12-29 12:00:00', '2014-12-29 18:00:00',
               '2014-12-30 00:00:00', '2014-12-30 06:00:00',
               '2014-12-30 12:00:00', '2014-12-30 18:00:00',
               '2014-12-31 00:00:00', '2014-12-31 06:00:00',
               '2014-12-31 12:00:00', '2014-12-31 18:00:00'],
              dtype='datetime64[ns]', name='time', length=2920, freq=None))

Attributes: (5)
Conventions :
COARDS
title :
4x daily NMC reanalysis (1948)
description :
Data is from NMC initialized reanalysis (4x/day). These are the 0.9950 sigma level values.
platform :
Model
references :
http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html

Fit the model#

In order to apply PCA to the data, we first have to create an EOF object.

[3]:

model = xe.models.EOF()

We will now fit the model to the data. If you’ve worked with sklearn before, this process will seem familiar. However, there’s an important difference: while sklearn fit method typically assumes 2D input data shaped as (sample x feature), our scenario is less straightforward. For any model, including PCA, we must specify the sample dimension. With this information, xeofs will interpret all other dimensions as feature dimensions.

In climate science, it’s common to maximize variance along the time dimension when applying PCA. Yet, this isn’t the sole approach. For instance, Compagnucci & Richmann (2007) discuss alternative applications.

xeofs offers flexibility in this aspect. You can designate multiple sample dimensions, provided at least one feature dimension remains. For our purposes, we’ll set time as our sample dimension and then fit the model to our data.

[4]:

model.fit(t2m, dim='time')

[4]:

<xeofs.models.eof.EOF at 0x7fa27b48ed90>

Inspect the results#

Now that the model has been fitted, we can examine the result. For example, one typically starts by looking at the explained variance (ratio).

[5]:

model.explained_variance_ratio()

[5]:

<xarray.DataArray 'explained_variance_ratio' (mode: 2)> Size: 16B
array([0.7968353, 0.0270206])
Coordinates:
  * mode     (mode) int64 16B 1 2
Attributes: (16)

xarray.DataArray

'explained_variance_ratio'

mode: 2

0.7968 0.02702
```
array([0.7968353, 0.0270206])
```
Coordinates: (1)
- mode
  (mode)
  int64
  1 2
```
array([1, 2])
```

Indexes: (1)

mode

PandasIndex

PandasIndex(Index([1, 2], dtype='int64', name='mode'))

Attributes: (16)
model :
EOF analysis
software :
xeofs
version :
2.3.2
date :
2024-04-14 19:12:16
n_modes :
2
center :
True
standardize :
False
use_coslat :
False
check_nans :
True
sample_name :
sample
feature_name :
feature
random_state :
None
verbose :
False
compute :
True
solver :
auto
solver_kwargs :
{}

We can next examine the spatial patterns, which are the eigenvectors of the covariance matrix, often referred to as EOFs or principal components.

NOTE: The xeofs library aims to adhere to the convention where the primary patterns obtained from dimensionality reduction (which typically exclude the sample dimension) are termed components (akin to principal components). When data is projected onto these patterns, for instance using the transform method, the outcome is termed scores (similar to principal component scores). However, this terminology is more of a guideline than a strict rule.

[6]:

components = model.components()
components

[6]:

<xarray.Dataset> Size: 22kB
Dimensions:  (lat: 25, lon: 53, mode: 2)
Coordinates:
  * lat      (lat) float32 100B 15.0 17.5 20.0 22.5 25.0 ... 67.5 70.0 72.5 75.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * mode     (mode) int64 16B 1 2
Data variables:
    air      (mode, lat, lon) float64 21kB 0.0022 0.002131 ... 0.02168 0.0221

xarray.Dataset

Dimensions:
- lat: 25
- lon: 53
- mode: 2

Coordinates: (3)

lat

(lat)

float32

15.0 17.5 20.0 ... 70.0 72.5 75.0

array([15. , 17.5, 20. , 22.5, 25. , 27.5, 30. , 32.5, 35. , 37.5, 40. , 42.5,
       45. , 47.5, 50. , 52.5, 55. , 57.5, 60. , 62.5, 65. , 67.5, 70. , 72.5,
       75. ], dtype=float32)

lon

(lon)

float32

200.0 202.5 205.0 ... 327.5 330.0

array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
       225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
       250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
       275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
       300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
       325. , 327.5, 330. ], dtype=float32)

mode
(mode)
int64
1 2
```
array([1, 2])
```

Data variables: (1)

air

(mode, lat, lon)

float64

0.0022 0.002131 ... 0.02168 0.0221

model :: EOF analysis
software :: xeofs
version :: 2.3.2
date :: 2024-04-14 19:12:16
n_modes :: 2
center :: True
standardize :: False
use_coslat :: False
check_nans :: True
sample_name :: sample
feature_name :: feature
random_state :: None
verbose :: False
compute :: True
solver :: auto
solver_kwargs :: {}

array([[[ 2.20002150e-03,  2.13133087e-03,  2.10239253e-03, ...,
          2.91080805e-03,  3.23210889e-03,  3.46441253e-03],
        [ 3.02258224e-03,  2.74371837e-03,  2.50967898e-03, ...,
          3.23030002e-03,  3.59343174e-03,  3.88473039e-03],
        [ 4.08353068e-03,  3.80591048e-03,  3.15309092e-03, ...,
          3.67373052e-03,  3.70757212e-03,  3.82383531e-03],
        ...,
        [ 3.74319972e-02,  3.86396105e-02,  3.97720882e-02, ...,
          3.07409665e-02,  2.98739649e-02,  2.81111943e-02],
        [ 3.52974904e-02,  3.55119509e-02,  3.58718161e-02, ...,
          3.03521012e-02,  3.06796865e-02,  3.04967161e-02],
        [ 3.60489340e-02,  3.58918474e-02,  3.59083778e-02, ...,
          3.00986416e-02,  3.04192529e-02,  3.08476488e-02]],

       [[-7.95018748e-05, -2.74714829e-05, -2.21176212e-04, ...,
         -1.76601665e-03, -1.44536923e-03, -1.53148261e-03],
        [-9.53463717e-04, -8.83842472e-04, -8.39122092e-04, ...,
         -1.78692323e-03, -1.89654232e-03, -2.18930511e-03],
        [-2.13297274e-03, -1.57115828e-03, -3.55735003e-04, ...,
         -7.38460622e-04, -1.54631010e-03, -1.96907910e-03],
        ...,
        [ 6.39354113e-02,  6.74708863e-02,  7.01635527e-02, ...,
          2.49526150e-02,  2.48603339e-02,  2.37352966e-02],
        [ 4.63866412e-02,  4.72048826e-02,  4.80707150e-02, ...,
          2.09778237e-02,  2.23493268e-02,  2.35379536e-02],
        [ 4.22969698e-02,  4.38015193e-02,  4.52889554e-02, ...,
          2.16942162e-02,  2.16756295e-02,  2.21048904e-02]]])

Indexes: (3)

mode

PandasIndex

PandasIndex(Index([1, 2], dtype='int64', name='mode'))

lat

PandasIndex

PandasIndex(Index([15.0, 17.5, 20.0, 22.5, 25.0, 27.5, 30.0, 32.5, 35.0, 37.5, 40.0, 42.5,
       45.0, 47.5, 50.0, 52.5, 55.0, 57.5, 60.0, 62.5, 65.0, 67.5, 70.0, 72.5,
       75.0],
      dtype='float32', name='lat'))

lon

PandasIndex

PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
       225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
       250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
       275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
       300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
       325.0, 327.5, 330.0],
      dtype='float32', name='lon'))

Attributes: (0)

You’ll observe that the result is an xr.Dataset, mirroring the format of our original input data. To visualize the components, we can use typical methods for xarray objects. Now, let’s inspect the first component.

NOTE: xeofs is designed to match the data type of its input. For instance, if you provide an xr.DataArray as input, the components will also be of type xr.DataArray. Similarly, if the input is an xr.Dataset, the components will mirror that as an xr.Dataset. The same principle applies if the input is a list; the output components will be presented in a list format. This consistent behavior is maintained across all xeofs methods.

[7]:

components["air"].sel(mode=1).plot()

[7]:

<matplotlib.collections.QuadMesh at 0x7fa2732f75d0>

We can also examine the principal component scores, which represent the corresponding time series.

NOTE: When comparing the scores from xeofs to outputs from other PCA implementations like sklearn or eofs, you might spot discrepancies in the absolute values. This arises because xeofs typically returns scores normalized by the L2 norm. However, if you prefer unnormalized scores, simply set normalized=False when using the scores or transform method.”

[8]:

model.scores().sel(mode=1).plot()

[8]:

[<matplotlib.lines.Line2D at 0x7fa2733f8990>]