Geographically weighted PCA#

Geographically Weighted Principal Component Analysis (GWPCA) is a spatial analysis method that identifies and visualizes local spatial patterns and relationships in multivariate datasets across various geographic areas. It operates by applying PCA within a moving window over a geographical region, which enables the extraction of local principal components that can differ across locations.

TIn this demonstration, we’ll apply GWPCA to a dataset detailing the chemical compositions of soils from countries around the Baltic Sea [1]. This example is inspired by a tutorial originally crafted and published by Chris Brunsdon [2]. The dataset comprises 10 variables (chemical elements) and spans 768 samples. Here, each sample refers to a pair of latitude and longitude coordinates, representing specific sampling stations.

Note

The dataset we’re using is found in the R package mvoutlier. To access it, we’ll employ the Python package rpy2 which facilitates interaction with R packages from within Python.

Note

Presently, there’s no support for xarray.Dataset lacking an explicit feature dimension. As a workaround, xarray.DataArray.to_array can be used to convert the Dataset to an DataArray.

Warning

Bear in mind that GWPCA requires significant computational power. The xeofs implementation is optimized for CPU efficiency and is best suited for smaller to medium data sets. For more extensive datasets where parallel processing becomes essential, it’s advisable to turn to the R package GWmodel. This package harnesses CUDA to enable GPU-accelerated GWPCA for optimized performance.

Let’s import the necessary packages.

# For the analysis
import numpy as np
import xarray as xr
import xeofs as xe

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For accessing R packages
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

Next, we’ll install the R package mvoutlier using the rpy2 package.

xr.set_options(display_expand_data=False)
utils = importr("utils")
utils.chooseCRANmirror(ind=1)
utils.install_packages("mvoutlier")
R[write to console]: also installing the dependencies ‘DEoptimR’, ‘sgeostat’, ‘robustbase’


R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/DEoptimR_1.1-3.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 19222 bytes (18 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]:

R[write to console]: downloaded 18 KB


R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/sgeostat_1.0-27.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 35685 bytes (34 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]:

R[write to console]: downloaded 34 KB


R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/robustbase_0.99-2.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 2292086 bytes (2.2 MB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]:

R[write to console]: downloaded 2.2 MB


R[write to console]: trying URL 'https://cloud.r-project.org/src/contrib/mvoutlier_2.1.1.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 476636 bytes (465 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]:

R[write to console]: downloaded 465 KB


R[write to console]:

R[write to console]:
R[write to console]: The downloaded source packages are in
        ‘/tmp/Rtmpr8iXHn/downloaded_packages’
R[write to console]:
R[write to console]:

R[write to console]: Updating HTML index of packages in '.Library'

R[write to console]: Making 'packages.html' ...
R[write to console]:  done


<rpy2.rinterface_lib.sexp.NULLType object at 0x7fa747af2890> [0]

Let’s load the dataset and convert it into a pandas.DataFrame. Alongside, we’ll also load the background data that outlines the borders of countries in the Baltic Sea region. This will help us visually represent the GWPCA results.

ro.r(
    """
    require("mvoutlier")
    data(bsstop)
    Data <- bsstop[,1:14]
    background <- bss.background
    """
)
with (ro.default_converter + pandas2ri.converter).context():
    data_df = ro.conversion.get_conversion().rpy2py(ro.r["Data"])
    background_df = ro.conversion.get_conversion().rpy2py(ro.r["background"])
data_df.head()
R[write to console]: Loading required package: mvoutlier

R[write to console]: Loading required package: sgeostat
ID CNo XCOO YCOO SiO2_T TiO2_T Al2O3_T Fe2O3_T MnO_T MgO_T CaO_T Na2O_T K2O_T P2O5_T
1 5001.0 60.0 -619656.5 6805304.1 43.61 1.290 13.07 12.25 0.167 3.22 2.48 1.14 2.01 0.481
2 5002.0 120.0 214714.1 7745546.6 58.73 0.913 14.78 6.48 0.105 2.47 3.08 2.19 1.78 0.298
3 5003.0 33.0 -368415.5 7065039.2 58.14 0.902 11.89 5.70 0.126 2.44 3.17 2.13 1.16 0.408
4 5004.0 39.0 226609.0 6922431.0 43.98 0.524 10.00 4.08 0.052 1.00 1.37 1.60 1.82 0.395
5 5005.0 103.0 544050.0 7808760.0 60.90 0.702 13.20 6.37 0.079 2.59 3.13 2.97 1.35 0.139


Since xeofs uses xarray, we convert the data into an xarray.DataArray.

data_df = data_df.rename(columns={"ID": "station"}).set_index("station")
data = data_df.to_xarray()
data = data.rename({"XCOO": "x", "YCOO": "y"})
data = data.set_index(station=("x", "y"))
data = data.drop_vars("CNo")
da = data.to_array(dim="element")
da
<xarray.DataArray (element: 10, station: 768)> Size: 61kB
43.61 58.73 58.14 43.98 60.9 54.0 82.72 ... 0.196 0.202 0.207 0.109 0.141 0.185
Coordinates:
  * station  (station) object 6kB MultiIndex
  * x        (station) float64 6kB -6.197e+05 2.147e+05 ... -2.82e+05 -1.273e+05
  * y        (station) float64 6kB 6.805e+06 7.746e+06 ... 5.796e+06 6.523e+06
  * element  (element) object 80B 'SiO2_T' 'TiO2_T' ... 'K2O_T' 'P2O5_T'


Let’s dive into the GWPCA. First, initialize a GWPCA instance and fit it to the data. The station dimension serves as our sample dimension, along which the local PCAs will be applied. Since these PCAs need to gauge distances to adjacent stations, we must specify a distance metric. Our station data includes coordinates in meters, so we’ll choose the euclidean metric. If you have coordinates in degrees (like latitude and longitude), choose the haversine metric instead. We’re also using a bisquare kernel with a bandwidth of 1000 km. Note that the bandwidth unit always follows input data (which is in meters here), except when using the haversine metric, which always gives distances in kilometers. Lastly, we’ll standardize the input to ensure consistent scales for the chemical elements.

gwpca = xe.models.GWPCA(
    n_modes=5,
    standardize=True,
    metric="euclidean",
    kernel="bisquare",
    bandwidth=1000000.0,
)
gwpca.fit(da, "station")
gwpca.components()
<xarray.DataArray 'components' (mode: 5, element: 10, station: 768)> Size: 307kB
0.1813 -0.3584 0.1243 0.2 -0.3812 ... 0.1229 -0.2865 0.4732 0.4197 -0.4249
Coordinates:
  * mode     (mode) int64 40B 1 2 3 4 5
  * element  (element) object 80B 'SiO2_T' 'TiO2_T' ... 'K2O_T' 'P2O5_T'
  * station  (station) object 6kB MultiIndex
  * x        (station) float64 6kB -6.197e+05 2.147e+05 ... -2.82e+05 -1.273e+05
  * y        (station) float64 6kB 6.805e+06 7.746e+06 ... 5.796e+06 6.523e+06
Attributes: (12/16)
    model:          GWPCA
    software:       xeofs
    version:        2.3.2
    date:           2024-03-31 21:13:03
    n_modes:        5
    center:         True
    ...             ...
    feature_name:   feature
    random_state:   None
    verbose:        False
    compute:        True
    solver:         auto
    solver_kwargs:  {}


The components method returns the local principal components for each station. Note that the dimensionality of the returned array is [station, element, mode], so in practice we don’t really have reduced the dimensionality of the data set. However, we can extract the largest locally weighted components for each station which tells us which chemical elements dominate the local PCAs.

llwc = gwpca.largest_locally_weighted_components()
llwc
<xarray.DataArray 'largest_locally_weighted_components' (mode: 5, station: 768)> Size: 31kB
'MgO_T' 'Al2O3_T' 'MgO_T' 'TiO2_T' ... 'K2O_T' 'Fe2O3_T' 'Fe2O3_T' 'CaO_T'
Coordinates:
  * mode     (mode) int64 40B 1 2 3 4 5
  * station  (station) object 6kB MultiIndex
  * x        (station) float64 6kB -6.197e+05 2.147e+05 ... -2.82e+05 -1.273e+05
  * y        (station) float64 6kB 6.805e+06 7.746e+06 ... 5.796e+06 6.523e+06


Let’s visualize the spatial patterns of the chemical elements. As the stations are positioned on a irregular grid, we’ll transform the llwc DataArray into a pandas.DataFrame. After that, we can easily visualize it using the scatter method. For demonstation, we’ll concentrate on the first mode:

llwc1_df = llwc.sel(mode=1).to_dataframe()

elements = da.element.values
n_elements = len(elements)
colors = np.arange(n_elements)
col_dict = {el: col for el, col in zip(elements, colors)}

llwc1_df["colors"] = llwc1_df["largest_locally_weighted_components"].map(col_dict)
cmap = sns.color_palette("tab10", n_colors=n_elements, as_cmap=True)


fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
background_df.plot.scatter(ax=ax, x="V1", y="V2", color=".3", marker=".", s=1)
s = ax.scatter(
    x=llwc1_df["x"],
    y=llwc1_df["y"],
    c=llwc1_df["colors"],
    ec="w",
    s=40,
    cmap=cmap,
    vmin=-0.5,
    vmax=n_elements - 0.5,
)
cbar = fig.colorbar(mappable=s, ax=ax, label="Largest locally weighted component")
cbar.set_ticks(colors)
cbar.set_ticklabels(elements)
ax.set_title("Largest locally weighted element", loc="left", weight=800)
plt.show()
Largest locally weighted element

In the final step, let’s examine the explained variance. Like standard PCA, this gives us insight into the variance explained by each mode. But with a local PCA for every station, the explained variance varies spatially. Notably, the first mode’s explained variance differs across countries, ranging from roughly 40% to 70%.

exp_var_ratio = gwpca.explained_variance_ratio()
evr1_df = exp_var_ratio.sel(mode=1).to_dataframe()

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111)
background_df.plot.scatter(ax=ax, x="V1", y="V2", color=".3", marker=".", s=1)
evr1_df.plot.scatter(
    ax=ax, x="x", y="y", c="explained_variance_ratio", vmin=0.4, vmax=0.7
)
ax.set_title("Fraction of locally explained variance", loc="left", weight=800)
plt.show()
Fraction of locally explained variance

Total running time of the script: (0 minutes 34.593 seconds)

Gallery generated by Sphinx-Gallery