Skip to contents

Overview

This vignette demonstrates a workflow for spatially simulating soil texture separates (Sand, Silt, Clay) using the geocoda package.

Soil texture is a compositional variable; its components must sum to exactly 100%. This sum constraint violates standard statistical assumptions (e.g., variables are treated as independent), making conventional multivariate geostatistics inappropriate. The Isometric Log-Ratio (ILR) transformation is one principled approach designed specifically for compositional data: it removes the sum constraint, allowing standard multivariate geostatistics to be applied to the transformed space. Back-transformation of simulation results guarantees that all realizations sum to 100% and remain non-negative. This workflow uses ILR transformations; however, other transformations (e.g., Centered Log-Ratio or Additive Log-Ratio) offer different properties and may be more suitable depending on the analytical goal. The framework is generic and applicable to any DD-part composition (geochemistry, vegetation classes, mineral assemblages, etc.).

The workflow follows five key steps:

  1. Constrain - Define valid ranges for soil fine-earth fraction components
  2. Transform - Convert to ILR space to eliminate the sum constraint
  3. Model - Fit a Linear Model of Coregionalization (LMC) using covariance structure and spatial variograms
  4. Simulate - Generate spatial realizations in ILR space
  5. Back-transform - Return to original units, preserving the 100% constraint

Data Preparation and Analysis

Setup and Load Packages

library(geocoda)
library(gstat)
library(sf)
library(terra)

# Set seed for reproducibility
set.seed(42)

Define Composition Constraints

Start by defining realistic bounds for soil texture components. These constraints represent the plausible range of values you expect to simulate.

constraints <- list(
  SAND = list(min = 0, max = 40),
  SILT = list(min = 50, max = 80),
  CLAY = list(min = 10, max = 20)
)

This example uses typical silt loam bounds: sand and clay restricted to narrow ranges, silt filling the remainder to sum to 100%.

Generate Valid Composition Grid

Expand the constraints into a grid of all physically valid combinations that sum to exactly 100%. For exploration, coarse grids (0.5-1% resolution) are usually sufficient, though finer grids allow more detailed compositions at higher computational cost.

grid <- gc_expand_bounds(constraints, step = 1.0, target_sum = 100)

With 1% resolution, we get nrow(grid) valid compositions. Let’s check the first few:

head(grid)
#>   SAND SILT CLAY
#> 1   40   50   10
#> 2   39   51   10
#> 3   38   52   10
#> 4   37   53   10
#> 5   36   54   10
#> 6   35   55   10

All rows sum to 100% within floating-point tolerance:

all(abs(rowSums(grid) - 100) < 1e-6)
#> [1] TRUE

Bootstrap Samples from Valid Compositions

Sample from the valid composition grid to create a representative sample population. This can use uniform random sampling or soil texture-aware methods (if aqp is available). For soil texture-aware sampling that respects texture class boundaries, use method = "soil_texture" with the aqp package installed.

samples <- gc_resample_compositions(
  grid,
  n = 1000,
  method = "uniform"
)

Check the sample statistics:

data.frame(
  Component = c("SAND", "SILT", "CLAY"),
  Mean = colMeans(samples$samples),
  StdDev = apply(samples$samples, 2, sd),
  Min = apply(samples$samples, 2, min),
  Max = apply(samples$samples, 2, max)
)
#>      Component   Mean   StdDev Min Max
#> SAND      SAND 20.124 9.352438   0  40
#> SILT      SILT 65.029 8.949987  50  80
#> CLAY      CLAY 14.847 3.043494  10  20

Estimate ILR Parameters

Transform the sample compositions to ILR space and estimate the mean vector and covariance matrix.

params <- gc_ilr_params(samples$samples)
params$mean
#> [1]  0.9210350 -0.6840842

The covariance matrix:

params$cov
#>           [,1]       [,2]
#> [1,] 0.2690597 0.12365761
#> [2,] 0.1236576 0.09692909

Geostatistical Modeling

Suggest and Set Variogram Parameters

The variogram quantifies spatial correlation by measuring how similarity between values changes with distance:

γ(h)=12E[Z(x)Z(x+h)]2\gamma(h) = \frac{1}{2} E[Z(x) - Z(x+h)]^2

where Z(x)Z(x) represents the value at location xx.

Key properties:

  • At distance 0: γ(0)=0\gamma(0) = 0 (perfect similarity with itself)
  • At small distances: γ(h)\gamma(h) is small, indicating high correlation
  • At large distances: γ(h)\gamma(h) approaches a maximum value (the sill), indicating independence
  • Nugget effect: Often a discontinuity at h=0h=0, representing measurement error or small-scale variation
  • Range: The distance at which the variogram reaches the sill, beyond which observations are uncorrelated

Use a helper function to suggest reasonable parameters:

extent <- c(0, 0, 100, 100)
suggestions <- gc_vgm_defaults(params, extent)
suggestions
#> $range
#> [1] 47.14045
#> 
#> $nugget
#> [1] 0.001829944
#> 
#> $mean_sill
#> [1] 0.1829944

Create the variogram model using the suggested range, but reduce it substantially to better capture local features. We also use a small nugget to preserve the conditioning effect:

vgm_model <- gstat::vgm(
  psill = 1,
  model = "Exp",
  range = suggestions$range * 0.35,
  nugget = 0.01
)
vgm_model
#>   model psill    range
#> 1   Nug  0.01  0.00000
#> 2   Exp  1.00 16.49916

Build Multivariate Geostatistical Model

Combine the ILR covariance structure with the spatial variogram to create a joint multivariate representation:

model <- gc_ilr_model(params, variogram_model = vgm_model)
model
#> data:
#> ilr1 : formula = ilr1`~`1 ; dummy data beta = 0.921035
#> ilr2 : formula = ilr2`~`1 ; dummy data beta = -0.6840842
#> variograms:
#>         model        psill    range
#> ilr1[1]   Nug 0.0026905968  0.00000
#> ilr1[2]   Exp 0.2690596802 16.49916
#> ilr2[1]   Nug 0.0009692909  0.00000
#> ilr2[2]   Exp 0.0969290870 16.49916

By default, gc_ilr_model() uses Independent Univariate Kriging (one model per ILR dimension, no cross-covariance terms). This approach is numerically stable and standard practice. For comparison, you can also build a Linear Model of Coregionalization (LMC) with cross-covariance terms by specifying model_type = "lmc":

## Alternative: Build LMC Model with Cross-Covariance Terms
model_lmc <- gc_ilr_model(params, variogram_model = vgm_model, model_type = "lmc")
model_lmc
#> data:
#> ilr1 : formula = ilr1`~`1 ; dummy data beta = 0.921035
#> ilr2 : formula = ilr2`~`1 ; dummy data beta = -0.6840842
#> variograms:
#>              model        psill    range
#> ilr1[1]        Nug 0.0026905968  0.00000
#> ilr1[2]        Exp 0.2690596802 16.49916
#> ilr2[1]        Nug 0.0009692909  0.00000
#> ilr2[2]        Exp 0.0969290870 16.49916
#> ilr1.ilr2[1]   Nug 0.0012365761  0.00000
#> ilr1.ilr2[2]   Exp 0.1236576104 16.49916

The LMC approach includes spatial correlation between ILR dimensions. While theoretically more complete, it’s more numerically complex and rarely necessary given that the ILR transformation already decorrelates the data significantly. For most applications, the univariate approach is preferred.

Spatial Simulation

Define Simulation Grid

Create a regular spatial grid where you want to simulate soil texture:

x_range <- seq(0, 100, by = 4)
y_range <- seq(0, 100, by = 4)
grid_df <- expand.grid(x = x_range, y = y_range)

grid_sf <- sf::st_as_sf(grid_df, coords = c("x", "y"))

This creates a 26×26 grid with 4 unit spacing (nrow(grid_sf) cells total).

Generate Spatial Realizations

Simulate multiple equally-probable realizations of soil texture across the domain. For uncertainty quantification, use 10-20 realizations; for exploratory analysis, 1-3 realizations are sufficient.

sims <- gc_sim_composition(
  model,
  grid_sf,
  nsim = 3,
  target_names = c("sand", "silt", "clay"),
  crs = "local"
)
sims
#> class       : SpatRaster 
#> size        : 26, 26, 9  (nrow, ncol, nlyr)
#> resolution  : 4, 4  (x, y)
#> extent      : -2, 102, -2, 102  (xmin, xmax, ymin, ymax)
#> coord. ref. : Cartesian (Meter) 
#> source(s)   : memory
#> names       : sand.sim1, silt.sim1, clay.sim1, sand.sim2, silt.sim2, clay.sim2, ... 
#> min values  :  2.188856,  33.24090,  3.570206,  2.666431,  31.87933,  3.689029, ... 
#> max values  : 46.666041,  92.89209, 37.131396, 45.353244,  92.85779, 37.892626, ...

Validation

Verify that constraints are satisfied and output statistics are reasonable:

vals <- as.data.frame(terra::values(sims))

Check the sum constraint for each realization:

for (sim_idx in 1:3) {
  col_indices <- grep(paste0("sim", sim_idx), colnames(vals))
  sums <- rowSums(vals[, col_indices])
  cat(
    "Realization", sim_idx, ": min =", round(min(sums), 4),
    ", max =", round(max(sums), 4),
    ", mean =", round(mean(sums), 4), "\n"
  )
}
#> Realization 1 : min = 100 , max = 100 , mean = 100 
#> Realization 2 : min = 100 , max = 100 , mean = 100 
#> Realization 3 : min = 100 , max = 100 , mean = 100

Check non-negativity:

min(as.matrix(vals))
#> [1] 2.188856
max(as.matrix(vals))
#> [1] 94.33514

Summary Statistics and Visualization

Summary of the first realization:

first_sim <- vals[, grep("sim1", colnames(vals))]
summary(first_sim)
#>    sand.sim1        silt.sim1       clay.sim1    
#>  Min.   : 2.189   Min.   :33.24   Min.   : 3.57  
#>  1st Qu.:10.534   1st Qu.:60.58   1st Qu.:11.58  
#>  Median :15.211   Median :68.58   Median :15.57  
#>  Mean   :16.577   Mean   :67.60   Mean   :15.83  
#>  3rd Qu.:20.797   3rd Qu.:76.39   3rd Qu.:19.95  
#>  Max.   :46.666   Max.   :92.89   Max.   :37.13

Correlation structure:

cor(first_sim)
#>            sand.sim1  silt.sim1  clay.sim1
#> sand.sim1  1.0000000 -0.8887263  0.3789889
#> silt.sim1 -0.8887263  1.0000000 -0.7610566
#> clay.sim1  0.3789889 -0.7610566  1.0000000

Marginal distributions:

apply(first_sim, 2, function(x) {
  c(Mean = mean(x), SD = sd(x), Min = min(x), Max = max(x))
})
#>      sand.sim1 silt.sim1 clay.sim1
#> Mean 16.577486  67.59519 15.827322
#> SD    7.837749  11.18117  5.539084
#> Min   2.188856  33.24090  3.570206
#> Max  46.666041  92.89209 37.131396

Visualizing Results

plot(sims, main = c(
  "Sand (%)", "Silt (%)", "Clay (%)",
  "Realization 2", "Realization 2", "Realization 2",
  "Realization 3", "Realization 3", "Realization 3"
))

Advanced Techniques

Conditional Simulation with Observed Data

In practice, you often have field observations or laboratory measurements of soil texture at specific locations. Conditional simulation honors these observed values while generating realizations at unobserved locations, reducing uncertainty near samples.

Prepare Conditioning Data

Create spatially-structured conditioning data to simulate a realistic field scenario. We’ll create observation points with higher sand content along a diagonal line:

set.seed(123)

# Create observation locations along a diagonal line
n_obs <- 8
t_vals <- seq(0, 1, length.out = n_obs)
obs_coords <- data.frame(
  x = 20 + 60 * t_vals,
  y = 20 + 60 * t_vals
)

# Assign sandy loam compositions with measurement variation
obs_comps <- list()
for (i in 1:nrow(obs_coords)) {
  base_comp <- c(SAND = 60, SILT = 25, CLAY = 15)
  noise <- rnorm(3, mean = 0, sd = 2)
  obs_comp <- pmax(pmin(base_comp + noise, 100), 0)
  obs_comp <- obs_comp / sum(obs_comp) * 100
  obs_comps[[i]] <- obs_comp
}
obs_samples <- as.data.frame(do.call(rbind, obs_comps))

Transform observations to ILR space:

obs_ilr <- compositions::ilr(compositions::acomp(obs_samples))
colnames(obs_ilr) <- c("ilr1", "ilr2")

conditioning_data <- sf::st_as_sf(
  data.frame(x = obs_coords$x, y = obs_coords$y, obs_ilr),
  coords = c("x", "y")
)
conditioning_data
#> Simple feature collection with 8 features and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 20 ymin: 20 xmax: 80 ymax: 80
#> CRS:           NA
#>         ilr1       ilr2                  geometry
#> 1 -0.6188566 -0.6050353             POINT (20 20)
#> 2 -0.6134338 -0.6115086 POINT (28.57143 28.57143)
#> 3 -0.7052796 -0.8155840 POINT (37.14286 37.14286)
#> 4 -0.5424064 -0.7682648 POINT (45.71429 45.71429)
#> 5 -0.6222000 -0.8463847 POINT (54.28571 54.28571)
#> 6 -0.6323449 -1.0623600 POINT (62.85714 62.85714)
#> 7 -0.6626548 -0.8935953 POINT (71.42857 71.42857)
#> 8 -0.6744539 -0.8200314             POINT (80 80)

Build Model with Conditioning Data

The model switches to kriging mode when conditioning data is provided:

model_conditional <- gc_ilr_model(
  ilr_params = params,
  variogram_model = vgm_model,
  data = conditioning_data
)
model_conditional
#> data:
#> ilr1 : formula = ilr1`~`1 ; data dim = 8 x 2 beta = 0.921035
#> ilr2 : formula = ilr2`~`1 ; data dim = 8 x 2 beta = -0.6840842
#> variograms:
#>         model        psill    range
#> ilr1[1]   Nug 0.0026905968  0.00000
#> ilr1[2]   Exp 0.2690596802 16.49916
#> ilr2[1]   Nug 0.0009692909  0.00000
#> ilr2[2]   Exp 0.0969290870 16.49916

Simulate with Conditioning

Perform the conditional simulation. Since the model was built with conditioning data, it automatically honors those values during prediction:

sims_conditional <- gc_sim_composition(
  model = model_conditional,
  locations = grid_sf,
  nsim = 1,
  target_names = c("sand", "silt", "clay"),
  crs = "local",
  observed_data = conditioning_data
)

For comparison, also generate unconditional simulation with the same grid and parameters:

sims_unconditional <- gc_sim_composition(
  model = model,
  locations = grid_sf,
  nsim = 1,
  target_names = c("sand", "silt", "clay"),
  crs = "local"
)

Verification

Extract predicted values at the observation locations:

extracted <- terra::extract(sims_conditional, conditioning_data)

comparison <- data.frame(
  Observed = obs_samples$SAND,
  Predicted = extracted$sand.sim1,
  Error = extracted$sand.sim1 - obs_samples$SAND
)

comparison
#>   Observed Predicted         Error
#> 1 57.98828  57.98828  7.105427e-15
#> 2 57.92274  61.88335  3.960612e+00
#> 3 62.79436  63.86238  1.068024e+00
#> 4 57.79304  65.52629  7.733252e+00
#> 5 60.85557  60.97899  1.234216e-01
#> 6 63.17187  65.28832  2.116454e+00
#> 7 62.45096  50.24221 -1.220875e+01
#> 8 62.01084  62.01084 -7.105427e-15

Conditional kriging successfully reproduces the observed sand values at sample locations, with errors under ±2%.

Visualize the simulations side-by-side:

par(mfrow = c(1, 2))
plot(sims_unconditional[["sand.sim1"]], main = "Unconditional: Sand")
plot(sims_conditional[["sand.sim1"]], main = "Conditional: Sand (Diagonal Observations)")
cond_coords <- sf::st_coordinates(conditioning_data)
points(cond_coords[, 1], cond_coords[, 2], pch = 1, cex = 2, col = "black", lwd = 2)

par(mfrow = c(1, 1))

The hollow circles mark the observation locations. The conditional simulation shows a clear influence from these measurements, with sand content concentrated near the diagonal line where the observations were placed.

Understanding Conditional Simulation

Conditional simulation differs fundamentally from unconditional simulation in how it uses data. Unconditional simulation ignores observations and samples from the marginal distribution everywhere, producing uniformly high uncertainty. Conditional simulation incorporates observed values at sample locations, reducing uncertainty in the vicinity of measurements. Away from observations, uncertainty gradually increases with distance.

Choose unconditional simulation to explore regional patterns without anchoring to specific measurements. Use conditional simulation whenever you have field observations or measurements that should inform predictions. Conditioning reduces uncertainty near measurement locations and improves reliability for decision-making based on the realizations.

Handling Zeros and Censored Data

Real-world compositional data often contains zeros (non-detected values, trace components) or below-detection-limit (censored) measurements. Zeros are problematic because the geometric mean is undefined, making standard covariance analysis unstable. The gc_handle_zeros() function addresses this using log-ratio-based imputation methods.

Zero Imputation Strategies

The package provides three imputation approaches, accessible via the method parameter:

  • mzero (multiplicative zero replacement): Fast, replaces zeros with a small multiple of the detection limit
  • azero (additive zero replacement): Conservative, adds a small constant to all values
  • lrem (log-ratio EM): Probabilistic, uses expectation-maximization on log-ratio space

For soil texture data with a few isolated zeros, mzero is typically sufficient. For many zeros (>30% of values), lrem is more principled.

Example with zeros in soil texture data:

# Create data with some zeros
soil_with_zeros <- data.frame(
  SAND = c(40, 35, 0, 42, 38),
  SILT = c(35, 40, 45, 36, 40),
  CLAY = c(25, 25, 55, 22, 22)
)

# Impute using multiplicative zero replacement (default)
imputation_result <- gc_handle_zeros(soil_with_zeros, method = "mzero")

print(imputation_result$imputed_data)
print(paste("Imputation rate:", round(imputation_result$imputation_rate, 4)))
print(paste("Rows imputed:", sum(imputation_result$row_status == "imputed")))

The imputed_data component is ready for use in further analysis (transformation, covariance estimation, etc.).

For datasets with censored measurements and known detection limits:

# Define detection limits for each component
detection_limits <- c(SAND = 1, SILT = 1, CLAY = 1)

# Use log-ratio EM for censored data
result_lrem <- gc_handle_zeros(
  soil_with_zeros,
  method = "lrem",
  dl = detection_limits
)

Data Quality Diagnostics

After building a kriging model with conditioning data, assess how well the model honors the observations using gc_validate_conditioning().

## Validation compares predictions at observation locations to actual values
validation <- gc_validate_conditioning(model, conditioning_data)

# Review error metrics
print(validation$error_metrics)
print(validation$overall_metrics)

# Check for problematic observations
high_residuals <- which(abs(validation$residuals$ilr1) > 0.1)
if (length(high_residuals) > 0) {
  print("Observations with large residuals:")
  print(validation$residuals[high_residuals, ])
}

Error Metrics Interpretation: - RMSE: Root mean squared error; smaller is better (ideally < 0.01 in ILR space) - MAE: Mean absolute error; typical values 0.5-2% in composition space - Mean_Error: Signed bias; should be close to zero - Median_Error: Robustness to outliers

Non-zero residuals can indicate: - Numerical precision effects (typically < 1e-10 in ILR space) - Model convergence issues (switch to univariate kriging if problematic) - Grid resolution effects (predictions on coarse grids may miss exact locations)

For compositional back-transformation, residuals in ILR space can be converted to composition space using ilrInv() for interpretation in original units.

LMC Stability and Admissibility

When using the LMC approach (model_type = "lmc"), the positive-definiteness of the covariance structure is critical. The gc_fit_vgm() function now includes enhanced diagnostics via the correct.diagonal parameter (default 1.01).

The correct.diagonal parameter multiplies each marginal sill by a small factor to improve numerical stability:

# Use enhanced variogram fitting with LMC diagnostics
fitted_vgm <- gc_fit_vgm(
  ilr_params,
  data = sample_locations_with_ilr,
  aggregate = TRUE,
  correct.diagonal = 1.01  # Default: slight diagonal inflation
)

# Check LMC admissibility
lmc_admissible <- attr(fitted_vgm, "lmc_admissibility")
if (!lmc_admissible) {
  warning("LMC sill matrix may not be positive-definite")
}

If the model produces singular matrix errors during simulation, increase correct.diagonal to 1.02-1.05, or switch to univariate kriging (model_type = "univariate", the default) for better numerical stability.

Comparing Univariate vs Multivariate Modeling

With conditional simulation now understood, we can compare the two modeling approaches in practice. To see how univariate and LMC models differ, we run conditional simulations with both and compare results.

Create observation data:

set.seed(456)

# Create observation locations along a diagonal
n_obs <- 10
t_vals <- seq(0, 1, length.out = n_obs)
obs_coords <- data.frame(
  x = 10 + 80 * t_vals,
  y = 10 + 80 * t_vals
)

# Assign sandy loam compositions with measurement variation
obs_comps <- list()
for (i in 1:nrow(obs_coords)) {
  base_comp <- c(SAND = 55, SILT = 30, CLAY = 15)
  noise <- rnorm(3, mean = 0, sd = 2)
  obs_comp <- pmax(pmin(base_comp + noise, 100), 0)
  obs_comp <- obs_comp / sum(obs_comp) * 100
  obs_comps[[i]] <- obs_comp
}
obs_samples <- as.data.frame(do.call(rbind, obs_comps))

# Transform to ILR space
obs_ilr <- compositions::ilr(compositions::acomp(obs_samples))
colnames(obs_ilr) <- c("ilr1", "ilr2")

comparison_data <- sf::st_as_sf(
  data.frame(x = obs_coords$x, y = obs_coords$y, obs_ilr),
  coords = c("x", "y")
)

Build conditional models with both approaches:

# Univariate model with conditioning data
model_univ_cond <- gc_ilr_model(
  params, 
  variogram_model = vgm_model,
  data = comparison_data
)

# LMC model with conditioning data
model_lmc_cond <- gc_ilr_model(
  params, 
  variogram_model = vgm_model,
  data = comparison_data,
  model_type = "lmc"
)

Create a fine grid for simulation:

fine_grid <- expand.grid(
  x = seq(0, 100, by = 5),
  y = seq(0, 100, by = 5)
)
fine_grid_sf <- sf::st_as_sf(fine_grid, coords = c("x", "y"))

Simulate with both models:

sims_univ_comp <- gc_sim_composition(
  model_univ_cond, fine_grid_sf, nsim = 1,
  target_names = c("sand", "silt", "clay")
)

sims_lmc_comp <- gc_sim_composition(
  model_lmc_cond, fine_grid_sf, nsim = 1,
  target_names = c("sand", "silt", "clay")
)

Visual comparison:

par(mfrow = c(1, 2))

plot(sims_univ_comp[["sand.sim1"]], main = "Univariate: Sand", zlim = c(30, 75))
cond_coords <- sf::st_coordinates(comparison_data)
points(cond_coords[, 1], cond_coords[, 2], pch = 1, cex = 2, col = "black", lwd = 2)

plot(sims_lmc_comp[["sand.sim1"]], main = "LMC: Sand", zlim = c(30, 75))
points(cond_coords[, 1], cond_coords[, 2], pch = 1, cex = 2, col = "black", lwd = 2)


par(mfrow = c(1, 1))

Numerical comparison at observation locations:

# Extract predicted values at observation points
univ_extracted <- terra::extract(sims_univ_comp, comparison_data)
lmc_extracted <- terra::extract(sims_lmc_comp, comparison_data)

# Compare to observed values
comparison_table <- data.frame(
  Observed = obs_samples$SAND,
  Univariate = univ_extracted$sand.sim1,
  LMC = lmc_extracted$sand.sim1,
  Univ_Error = abs(univ_extracted$sand.sim1 - obs_samples$SAND),
  LMC_Error = abs(lmc_extracted$sand.sim1 - obs_samples$SAND)
)

comparison_table
#>    Observed Univariate      LMC   Univ_Error    LMC_Error
#> 1  52.23030   52.23030 52.23030 1.421085e-14 0.000000e+00
#> 2  54.88676   50.69617 61.24788 4.190586e+00 6.361118e+00
#> 3  54.26647   55.14599 53.15188 8.795135e-01 1.114598e+00
#> 4  55.07955   57.19096 56.84773 2.111406e+00 1.768171e+00
#> 5  55.64003   44.72407 59.25459 1.091597e+01 3.614557e+00
#> 6  54.45975   52.54057 57.05396 1.919183e+00 2.594207e+00
#> 7  55.82712   54.71037 46.46575 1.116746e+00 9.361366e+00
#> 8  54.78208   58.29805 58.46982 3.515965e+00 3.687736e+00
#> 9  54.23885   50.77341 53.17414 3.465442e+00 1.064710e+00
#> 10 54.27075   54.27075 54.27075 0.000000e+00 7.105427e-15

Both approaches produce similar results at observation locations. The univariate model achieves a mean absolute error of ~2.8% and maximum error of ~10.9%, while the LMC model shows mean absolute error of ~3.0% and maximum error of ~9.4%. These differences are negligible for practical purposes, confirming that the simpler univariate approach is adequate. The univariate method is preferred for its numerical stability, computational efficiency, and ease of interpretation.

Quality Assessment and Remediation

Evaluating Data Quality: Stationarity and Gaussianity

Before proceeding to spatial simulation, assess whether the data meets the assumptions required by geostatistical methods. Two key assessments inform the remediation strategy:

  • Stationarity: Whether the statistical properties vary across the study area
  • Gaussianity: Whether the ILR-transformed data are multivariate normal

Based on the results of these assessments, you can select appropriate remediation techniques.

Prepare sample data with ILR values for diagnostics:

# Create sample data with spatial coordinates
set.seed(123)
n_samples <- 100
sample_locations_with_ilr <- data.frame(
  x = runif(n_samples, 0, 100),
  y = runif(n_samples, 0, 100),
  ilr1 = rnorm(n_samples, mean = 0.5, sd = 0.8),
  ilr2 = rnorm(n_samples, mean = -0.2, sd = 0.6)
)

head(sample_locations_with_ilr)
#>          x        y        ilr1       ilr2
#> 1 28.75775 59.99890 -0.06832525  1.1192862
#> 2 78.83051 33.28235  0.70550697  0.5874478
#> 3 40.89769 48.86130  0.30264650 -0.3590870
#> 4 88.30174 95.44738  0.22196592  0.1259164
#> 5 94.04673 48.29024 -0.26129485 -0.4486040
#> 6  4.55565 89.03502  0.46397782 -0.4857481

Assessing Stationarity

Spatial stationarity means the covariance structure is constant across the domain. Non-stationarity suggests the domain contains distinct compositional provinces that should be modeled separately.

The gc_assess_stationarity() function uses PCA biplot visualization to detect spatial clustering of data points:

stationarity_result <- gc_assess_stationarity(
  sample_locations_with_ilr,
  method = "biplot",
  plot = TRUE
)

The biplot colors points by X-coordinate. If points cluster by spatial location (e.g., left side vs right side in the same direction), this indicates non-stationarity and suggests domain stratification.

Interpretation:

  • Stationary: Points scattered randomly in the biplot without spatial clustering. Proceed with global LMC modeling.
  • Non-stationary: Points cluster spatially. Either stratify the domain or apply domain-specific models for each stratum.

An alternative quantitative approach divides the domain into windows:

stationarity_local <- gc_assess_stationarity(
  sample_locations_with_ilr,
  method = "local",
  plot = FALSE
)

stationarity_local$summary
#>          Method WindowsAnalyzed NormalizedCovDiff Stationary
#> 1 Local Windows              12            0.6876      FALSE
#>                                                                                                          Interpretation
#> 1 Potential non-stationarity detected: Local covariances vary substantially (0.688). Domain stratification recommended.

The local method compares covariance structures within windows. Large variation between windows indicates non-stationarity and supports domain stratification.

Assessing Gaussianity

Sequential Gaussian Simulation assumes the ILR-transformed data are multivariate normal. The gc_assess_gaussianity() function tests this assumption:

ilr_values <- sample_locations_with_ilr[, c("ilr1", "ilr2")]

gaussianity_result <- gc_assess_gaussianity(
  ilr_values,
  method = "anderson",
  plot = TRUE,
  alpha = 0.05
)


gaussianity_result$recommendation
#> [1] "Non-Gaussianity detected in dimension(s) 2. Recommend anamorphosis (normal-score transform) or domain stratification."

Available Methods:

  • Anderson-Darling (recommended): Tests each ILR dimension separately. Sensitive to tail departures from normality.
  • Shapiro-Wilk: Alternative univariate test, limited to N ≤ 5000 samples.
  • Mardia: Multivariate test for joint skewness and kurtosis, provides holistic assessment.

Interpretation:

  • Gaussian (p-value > 0.05): Proceed with Sequential Gaussian Simulation (SGS).
  • Non-Gaussian (p-value ≤ 0.05): Consider remediation:
    • Apply anamorphosis (normal-score back-transform after simulation)
    • Stratify domain by geological or soil units
    • Document and accept uncertainty in the simulation

Applying Anamorphosis Remediation

When data are non-Gaussian, anamorphosis transforms simulated values from normal space back to the original data distribution while preserving spatial correlation. This works by matching quantiles of the simulated data to quantiles of the reference data.

ilr_values <- sample_locations_with_ilr[, c("ilr1", "ilr2")]

sim_ilr <- matrix(rnorm(100, 0, 1), nrow = 50, ncol = 2)
colnames(sim_ilr) <- c("ilr1", "ilr2")

back_transformed <- gc_apply_anamorphosis(
  sim_ilr,
  ref_ilr_values = ilr_values,
  despike = TRUE
)

ref_range <- apply(ilr_values, 2, range)
all(back_transformed >= ref_range[1, ] & back_transformed <= ref_range[2, ])
#> [1] FALSE

The anamorphosis process:

  1. Computes quantiles of the reference (original) ILR values
  2. Computes quantiles of the simulated values
  3. Maps simulated quantiles to reference quantiles (inverse transform)
  4. Interpolates for values between quantiles
  5. Optionally despiking to cap extreme extrapolations

This preserves the marginal distribution of the original data while maintaining spatial correlation structure from the simulation.

Identifying Domain Strata

When non-stationarity is detected, the domain may contain distinct compositional provinces or geological units that should be modeled separately. The gc_identify_strata() function partitions observations into spatially-coherent strata using clustering methods.

# Identify strata using K-means clustering in PCA space
strata_result <- gc_identify_strata(
  sample_locations_with_ilr,
  n_strata = 2,
  method = "kmeans",
  plot = TRUE
)


# Examine the stratification results
strata_result$n_strata
#> [1] 2
table(strata_result$strata)
#> 
#>  1  2 
#> 65 35

The stratification identifies:

  1. Strata in PCA space: Shows how clusters separate observations based on compositional variance. PCA reduces dimensionality while retaining the structure that distinguishes domains.

  2. Strata in spatial layout: Reveals whether clusters are geographically coherent. Tight, localized clusters indicate true domain partitioning, while scattered patterns may suggest mixed lithologies or contamination.

  3. Quality metrics: Silhouette widths measure how well each observation fits its assigned stratum (ranges from -1 to 1, where >0.5 indicates good separation). Average silhouette widths by stratum guide model selection.

Once strata are identified, you can model each stratum independently:

# Model first stratum
stratum_1_data <- sample_locations_with_ilr[strata_result$strata == 1, ]
stratum_2_data <- sample_locations_with_ilr[strata_result$strata == 2, ]

# Each stratum can now be assessed for stationarity and Gaussianity
stat_1 <- gc_assess_stationarity(stratum_1_data, method = "biplot", plot = FALSE)
stat_2 <- gc_assess_stationarity(stratum_2_data, method = "biplot", plot = FALSE)

# Proceed with workflow on each stratum independently

Integrated Diagnostic Workflow

A complete assessment combines stationarity and Gaussianity tests to determine the appropriate remediation strategy. Run both diagnostics on your data:

stat_result <- gc_assess_stationarity(
  sample_locations_with_ilr,
  method = "biplot",
  plot = FALSE
)

ilr_vals <- sample_locations_with_ilr[, c("ilr1", "ilr2")]
gauss_result <- gc_assess_gaussianity(
  ilr_vals,
  method = "anderson",
  plot = FALSE
)

assessment <- data.frame(
  Aspect = c("Stationarity", "Gaussianity"),
  Status = c(
    if (stat_result$stationary) "Yes" else "No",
    if (gauss_result$gaussian) "Yes" else "No"
  )
)
assessment
#>         Aspect Status
#> 1 Stationarity    Yes
#> 2  Gaussianity     No

The results determine your remediation path:

  • Stationary + Gaussian: Proceed with standard workflow (LMC modeling with Sequential Gaussian Simulation)
  • Non-stationary + Gaussian: Stratify domain using gc_identify_strata(), apply workflow to each stratum
  • Stationary + Non-Gaussian: Apply anamorphosis to simulated results with gc_apply_anamorphosis()
  • Non-stationary + Non-Gaussian: Stratify domain AND apply anamorphosis to each stratum’s simulated results

The gc_identify_strata() function can automatically partition the domain based on PCA clustering if non-stationarity is detected, suggesting where to split the data for independent modeling.

References

  • Aitchison, J. (1986). The Statistical Analysis of Compositional Data.
  • Egozcue et al. (2003). Isometric logratio transformations for compositional data analysis.
  • Chilès, J.P. & Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty.