Exploring the tidyverse, lakes, synthetic data, and tibble-SoilProfileCollections

In this brief tutorial, we will be using the aqp and tidyverse R packages for wrangling stratified environmental data from lake cores.

The sample lake core data from Dewey Dunnington’s tidyverse core-data-wrangling tutorial are pretty “typical” of raw, stratified environmental data that you might have kicking around on your hard drive. It also has a lot of information content despite very small (subset) size – which you will see in his tutorial and in what follows. Dewey (paleolimbot on GitHub) is prolific; he has developed great fundamental tools with R, Python and C++ and you should check out his work.

So, that is why we are using lake cores rather than the typical canned soil-database examples from aqp and soilDB. Lake sediments are cool and important troves of information – even where they do not technically meet our criteria for subaqueous soils! I think pedology benefits from study of (paleo-)limnology and its techniques!

Install the required packages if needed. Be sure to get the latest version of aqp off GitHub. At time of writing this post (August 2020), the features that allow for support for tbl_df and data.table in SoilProfileCollection objects are not yet on CRAN [but will be soon].

packages.to.install <- c("aqp", "tidyverse")

if (!all(packages.to.install %in% installed.packages())) {
  
   # install dependencies and CRAN version of packages
   install.packages(c("remotes", packages.to.install)) 
     
   # install development version of aqp
   remotes::install_github("ncss-tech/aqp", dependencies = FALSE, quiet = TRUE)
}

Load the packages. We will use ggplot2 for most plots and base R graphics for SoilProfileCollection sketches.

library(aqp)
library(tidyverse)
library(ggplot2)

# lake core data from @paleolimbot's tutorial
# https://fishandwhistle.net/post/2017/using-the-tidyverse-to-wrangle-core-data

# use tribble function to read in a comma-separated set of expressions and create a tibble
pocmaj_raw <- tribble(
  ~sample_id, ~Ca, ~Ti, ~V,  
  "poc15-2 0", 1036, 1337, 29,
  "poc15-2 0", 1951, 2427, 31,
  "poc15-2 0", 1879, 2350, 39,
  "poc15-2 1", 1488, 2016, 36,
  "poc15-2 2", 2416, 3270, 79,
  "poc15-2 3", 2253, 3197, 79,
  "poc15-2 4", 2372, 3536, 87,
  "poc15-2 5", 2621, 3850, 86,
  "poc15-2 5", 2785, 3939, 95,
  "poc15-2 5", 2500, 3881, 80,
  "maj15-1 0", 1623, 2104, 73,
  "maj15-1 0", 1624, 2174, 73,
  "maj15-1 0", 2407, 2831, 89,
  "maj15-1 1", 1418, 2409, 70,
  "maj15-1 2", 1550, 2376, 70,
  "maj15-1 3", 1448, 2485, 64,
  "maj15-1 4", 1247, 2414, 57,
  "maj15-1 5", 1463, 1869, 78,
  "maj15-1 5", 1269, 1834, 71,
  "maj15-1 5", 1505, 1989, 94
)

# inspect 
head(pocmaj_raw)

## # A tibble: 6 × 4
##   sample_id    Ca    Ti     V
##   <chr>     <dbl> <dbl> <dbl>
## 1 poc15-2 0  1036  1337    29
## 2 poc15-2 0  1951  2427    31
## 3 poc15-2 0  1879  2350    39
## 4 poc15-2 1  1488  2016    36
## 5 poc15-2 2  2416  3270    79
## 6 poc15-2 3  2253  3197    79

On inspection we find a field sample_id that contains site, year, core and layer-level identifiers. Some of these combinations are repeated with unique values of analytes. In this case, assume the layer identifier is the top depth in meters AND that the cores were collected in constant increments of 1 meter. The elements are measured with a portable X-ray fluorescence unit – which does not truly measure concentration – I have reported the units as [pXRF] here.

For the sample data, we have repeated measures in depth – because points in lakes were sampled at successively greater depth slices. The core segments along Z collected at an X, Y location in a lake are not independent, though they may intercept layers that are dissimilar from one another. In the sample data, the shallowest and deepest layers ("0" and "5") have replicates. If this were our data, we would try to make use of this with the rest of the cores – but we only have them for a subset of depths in two cores, so we generalize with the median().

dplyr wrangling for input to aqp

We take the raw data and process it for use in the aqp package.

We will use the magrittr “pipe” (%>%) operator to pipe or chain-together commands. The %>% facilitates passing variables to functions of the form f(x, y) as x %>% f(y). This is unnecessary with only a single “step,” but is particularly useful whenever you don’t want to create a bunch of intermediate variables or would like to keep the sequence of your workflow easy to read.

pocmaj <- pocmaj_raw %>%
  group_by(sample_id) %>%                 # group replicates together
  summarize(Ca = median(Ca), 
            Ti = median(Ti), 
            V  = median(V)) %>%           # calculate sample analyte median
  separate(sample_id, sep = " ",
           into = c("core", "depth")) %>% # split sample_id -> core + depth
  mutate(top = as.numeric(depth) * 100,   # calculate top and bottom depth
         bottom = top + 100)              #  (assuming 100 cm increments)

We group_by() sample_id and then summarize() the median() for each sample_id and analyte. This “flattens” a possible many-to-one relationship for values per depth. After that we separate() sample_id into core (site-level) and depth (layer-level) IDs using " " (a space) as the delimiter. Finally, we we scale depth by a factor of 100 to convert from meters to centimeters.

We could choose mean() and sd() as they are unbiased estimators for our layer-level summary function. The demonstration shown later that uses profile-level standard deviations in Vanadium value could be relatively easily modified to use layer-level variance derived from replicates/bulk/sub samples etc. Other summary statistics could be calculated such as max() for risk-assessment. We use median() because it is good for central tendency and we will be permuting the values around a “centroid” as our final demonstration.

Promote tbl_df to SoilProfileCollection via the formula interface

Once you have a site-level ID, top and bottom depth inside a data.frame-like object, you can use the formula interface to “promote” it to a SoilProfileCollection object. This provides user access to to site and layer-level fields via the site() and horizons() S4 methods, respectively.

For the sample data, or any stratified data without a bottom depth, we want to fill in a bottom depth using known sampling intervals and core geometry. This could be a bit of a “leap” if your values are actually “point” estimates and cannot be considered “representative” for the whole layer/block.

In aqp, the top and bottom depths are used as checks for depth-based ordering and topology – which some algorithms are sensitive to. depths<- sorts layer data (in @horizons slot) first on profile ID (core) then on top depth. From sorted layer data, the @site slot data are derived.

# promote pocmaj to SoilProfileCollection 
depths(pocmaj) <- core ~ top + bottom

core, top and bottom are all columns that exist in pocmaj. tbl_df and data.table are subclasses supported by the SoilProfileCollection in addition to data.frame.

`depths<-`

Promoting via the formula interface follows this form:

depths(your.df) <- unique_site_ID ~ topdepth + bottomdepth

your.df is expected to be an object that inherits from data.frame. The unique_site_ID is shared between all layers associated with a single “profile,” but there is no specific guarantee that the unique_site_ID*topdepth combination results in an layers that are in order, without overlaps, gaps, etc. You can check profile-level validity for a whole collection with the function aqp::checkHzDepthLogic(). Ensuring that the site and layer data have parity can be achieved with aqp::spc_in_sync() – though this should not be necessary for most external uses of aqp.

Soil profile sketches with `plotSPC`

Let’s make a profile sketch of core data for a single analyte, Vanadium (V). We will use this analyte through the rest of the tutorial.

From Wikipedia:

Vanadium is a chemical element with the symbol V and atomic number 23. It is a hard, silvery-grey, malleable transition metal. The elemental metal is rarely found in nature … Vanadium occurs naturally in about 65 minerals and in fossil fuel deposits. It is produced in China and Russia from steel smelter slag. Other countries produce it either from magnetite directly, flue dust of heavy oil, or as a byproduct of uranium mining. It is mainly used to produce specialty steel alloys such as high-speed tool steels, and some aluminium alloys.

# plot soil profile sketch of one variable (V = Vanadium)
plot(pocmaj, color = "V", name = "hzID")

Producing site-level summaries from layer-level data with `mutate_profile`.

mutate_profile is a special variety of transform defined by the aqp package. It contains built-in iteration to support multiple non-standard evaluation expressions that are evaluated in the context of each single profile in a collection.

The result must either resolve to unit length for a profile (a site-level field) or equal in length to the number of layers for that profile (a layer-level field).

Let’s define a function that we can use to summarize SoilProfileCollection objects over arbitrary depth intervals.

#' Calculate site-level summary stats over interval [0, max.depth]
#' @name summarize_core
#' @param spc A SoilProfileCollection 
#' @param max.depth Vertical depth to truncate output to [0, max.depth]; default: 600;
#' @return A SoilProfileCollection
summarize_core <- function(spc, max.depth = 600) {
  
  # max.depth is subject to some basic constraints
  stopifnot(is.numeric(max.depth), length(max.depth) == 1, max.depth > 0)
  
  # truncate cores to a common depth interval [0, max.depth]
  res <- trunc(spc, 0, max.depth)  %>% 
    
    # for the depth-weighted mean, calculate horizon thickness
    #   identical to arithmetic mean if depth interval is constant
    transform(layer_wt = bottom - top) %>%
    
    # mutate_profile performs mutation by profile (site/core)
    mutate_profile(dwt_V = weighted.mean(V, layer_wt),
                   median_V = median(V), 
                   mean_V = mean(V), 
                   sd_V = sd(V))
  return(res)
}

# calculate depth-weighted average V within each core truncated to upper 200cm or 600cm
pocmaj2 <- summarize_core(pocmaj, 200)
pocmaj6 <- summarize_core(pocmaj, 600)

# create unique profile IDs
profile_id(pocmaj2) <- paste(profile_id(pocmaj2), "[0 - 2 m]")
profile_id(pocmaj6) <- paste(profile_id(pocmaj6), "[0 - 6 m]")

The below code demonstrates effects of total sampling depth on aggregation of stratified layers to “point” or site-level values.

# inspect result (a site-level field calculated from all layers)
pocmaj2and6 <- c(pocmaj2, pocmaj6)

# combined plot
plot(pocmaj2and6, color = "V", name = "hzID")

# inspect
site(pocmaj2and6)[,c(idname(pocmaj2and6), "dwt_V", "median_V", "mean_V", "sd_V")]

## # A tibble: 4 × 5
##   core              dwt_V median_V mean_V  sd_V
##   <chr>             <dbl>    <dbl>  <dbl> <dbl>
## 1 maj15-1 [0 - 2 m]  71.5     71.5   71.5  2.12
## 2 maj15-1 [0 - 6 m]  68.7     70     68.7  7.31
## 3 poc15-2 [0 - 2 m]  33.5     33.5   33.5  3.54
## 4 poc15-2 [0 - 6 m]  66.3     79     66.3 25.7

The summary statistics reflect what we noticed above in the profile plot and will be important later. One might be interested in aggregate properties of groups of cores summarized as above. We only have two unique cores – but that won’t stop us.

We can take this further and aggregate cores together into groups based on proximity, process, etc. This small two-site example dataset appears to have two different water bodies or areas where cores were collected. Presumably there are more data from there. In this case, we want to compare our 2-meter depth-truncated cores to the full 6-meter depth cores to test the effect of sampling depth on observed aggregate statistics. What are you missing by coring only to 2 rather than 6? To do this we will create a synthetic dataset.

Generating synthetic data

There is obviously natural variation in stratigraphy that is variably captured by a constant sampling interval. In addition, there are sampling errors such as uncertain datum, incomplete recovery and compaction. We will generate some realizations assuming that all layer boundaries have a standard deviation in depth of 10 centimeters and that V concentration varies related to the original value plus Gaussian noise related to a core-level variance parameter.

# assume variation in bottom depth amounts to a standard deviation of about 10% 1m core thickness
# i.e. that normal curve centered at the boundary between 100cm layers has SD of 10cm
horizons(pocmaj2and6)$layer_sd <- 10

To simulate we use the aqp method permute_profile. We use profileApply to iterate over cores, and aqp::combine to combine the 100 realizations we generate for each of the four input profiles.

# random number seed
set.seed(1234)

# iterate over cores, generate 100 realizations each, combine result
bigpocmaj <- aqp::combine(profileApply(pocmaj2and6, function(p) {
               # use a horizon-level field to produce geometric realizations
               pp <- perturb(p, boundary.attr = "layer_sd")
               
               # naive simulation of varying V values
               #  - core-specific standard deviation rounded base10 logarithm 
               #  - value cannot be negative [snaps to zero]
               pp$V <- pmax(0, pp$V + rnorm(nrow(pp), mean = 0, sd = 10^round(log10(pp$sd_V))))
               
               return(pp)
             }))

Note that this is a relatively naive way of simulating the property values, but also is perhaps reasonable considering the vertical granularity of observations in the Z dimension.

Inspecting the synthetic data

# simulated 400 realizations
#  4 profiles (2 cores, 0-2m and 0-6m case), 100 times each
length(bigpocmaj)

## [1] 400

# extract horizons tbl_df
h <- horizons(bigpocmaj)

# denormalize site-level core variable to horizon
h$core <- denormalize(bigpocmaj, "core")

# create factor levels for combined plots from multiple data sources
littlebig <- rbind(data.frame(pocmaj = "S", h[,horizonNames(pocmaj)]),
                   data.frame(pocmaj = "O",  horizons(pocmaj)))

# set up colors and labels
cbPalette <- c("#009E73", "#E69F00")
cbLabels <- c(sprintf("Median data & density (n = %s, bw = 5)", sum(!is.na(pocmaj$V))), 
              sprintf("Synthetic data (n = %s, bins = 100)", sum(!is.na(bigpocmaj$V))))

# plot
ggplot(littlebig, aes(x = V, after_stat(density),
                      color = factor(pocmaj), fill = factor(pocmaj))) + 
  geom_histogram(data = filter(littlebig, pocmaj == "S"), bins = 100) +
  geom_density(data = filter(littlebig, pocmaj == "O"), aes(x = V, fill = NULL), bw = 5) +
  geom_point(data = filter(littlebig, pocmaj == "O"), aes(x = V, y = 0)) +
  labs(title = "Median v.s. Synthetic bigpocmaj data\nVanadium by pXRF",
       x = "Vanadium (V), pXRF", y = "Density", color = "Legend", fill = "Legend") +
  scale_fill_manual(values = cbPalette, labels = cbLabels) +
  scale_color_manual(values = cbPalette, labels = cbLabels) +
  theme_bw()

We see bi-modal patterns exhibited in the layer medians are reflected and fleshed out in our simulated data bigpocmaj. Notice the extreme values at edges of the empirical distribution – which are extrapolations beyond the range observed in the input data. The inclusion of the mixture of 2 and 6-meter deep profiles concentrates density in the range of distribution found in the upper depth portion of the profiles.

Grouped Profile Plot

This is a groupedProfilePlot() showing the different realizations generated from each of the four “parent” core descriptions. They look pretty noisy as they are unaware of any sort of meaningful pattern of depth covariance.

groupedProfilePlot(bigpocmaj[sample(1:length(bigpocmaj), size = 30),], 
                   groups = "core", color = "V", print.id = FALSE)

Compare percentiles

# raw v.s. median (analyte 1:1 with layer) v.s. synthetic
rbind(tibble(Source = "Raw",       Q = t(round(quantile(pocmaj_raw$V)))),
      tibble(Source = "Median",    Q = t(round(quantile(pocmaj2and6$V)))),
      tibble(Source = "Synthetic", Q = t(round(quantile(bigpocmaj$V)))))

## # A tibble: 3 × 2
##   Source    Q[,"0%"] [,"25%"] [,"50%"] [,"75%"] [,"100%"]
##   <chr>        <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
## 1 Raw             29       62       73       82        95
## 2 Median          31       52       70       78        87
## 3 Synthetic        6       50       70       78       110

Note that the median()-based summary by layer “tamps” down the more extreme values from the raw data. However, the synthetic distribution captures the original range and then some. This is because we are “permuting” or “perturbing” Vanadium values using core-level variance parameters and assumptions of normality.

A maximum value of 110 pXRF is obtained in the synthetic sample – which is nearly 30 pXRF greater than the highest median, and about 20 pXRF greater than the highest observed raw value. For estimation of extrema rather than central tendency above approaches would need to be adjusted, but this is fine for this demonstration. It looks reasonably convincing when you compare to the density plot – though there really is just very little input data in this demo.

# use aqp::slab to calculate quantiles of V for 10cm "slabs" grouped by core
#  this is layer-level data result in a data.frame
hbpp <- slab(bigpocmaj, .oldID ~ V, slab.structure = 10)
hbpp <- hbpp[complete.cases(hbpp),]

# small jitter to prevent overplotting of medians
hbpp$p.q50 <- jitter(hbpp$p.q50, amount = 1)

# set up colors
cbPalette <- c("#009E73", "#E69F00", "#cc79a7", "#0072b2")

# plot
ggplot(hbpp, aes(x = V, y = top)) + 
       scale_y_reverse() +
       labs(title = "Synthetic `bigpocmaj` 10cm slab-wise median V [pXRF]\n w/ 0.05, 0.95 quantiles", 
            x = "Vanadium (V), pXRF", y = "Depth, cm", color = "Legend") +
       geom_path(data = hbpp, aes(x = p.q50, y = top, color = factor(.oldID)), linetype = 2) +
       geom_path(data = hbpp, aes(x = p.q5, y = top, color = factor(.oldID)), linetype = 3) + 
       geom_path(data = hbpp, aes(x = p.q95, y = top, color = factor(.oldID)), linetype = 3) + 
       scale_color_manual(values = cbPalette) +
       theme_bw()

# calculate site-level summary stats for each simulated core
simulated.groups <- summarize_core(bigpocmaj, 600) %>% 
  
  # group by "core" (original profile ID)
  aqp::groupSPC(core) %>%
  
  # summarize profile stats by original core group
  aqp::summarizeSPC(mean_dwt_V = mean(dwt_V), 
                    mean_median_V = mean(median_V), 
                    mean_mean_V = mean(mean_V), 
                    mean_sd_V = mean(sd_V),
                    sd_dwt_V = sd(dwt_V), 
                    sd_median_V = sd(median_V), 
                    sd_mean_V = sd(mean_V), 
                    sd_sd_V = sd(sd_V))

## converting core to factor

# inspect
head(simulated.groups)

## # A tibble: 6 × 9
##   core       mean_dwt_V mean_median_V mean_mean_V mean_sd_V sd_dwt_V sd_median_V
##   <chr>           <dbl>         <dbl>       <dbl>     <dbl>    <dbl>       <dbl>
## 1 maj15-1 […       71.9          72.1        72.1      1.97       NA          NA
## 2 maj15-1 […       72.0          71.9        71.9      1.76       NA          NA
## 3 maj15-1 […       72.2          72.0        72.0      1.80       NA          NA
## 4 maj15-1 […       72.6          73.0        73.0      2.64       NA          NA
## 5 maj15-1 […       71.1          71.0        71.0      2.93       NA          NA
## 6 maj15-1 […       71.6          71.5        71.5      2.70       NA          NA
## # ℹ 2 more variables: sd_mean_V <dbl>, sd_sd_V <dbl>

So, we may not completely believe the values coming out of this naive simulation, but it shows the workflow for how you would summarize a larger amount of data.

Also possibly gives some ideas on how to begin comparing observed data to synthetic data. The output generally reflects the pattern we saw from the individual cores – with a little more fuzziness.

Commentary

The analysis shown above can be useful for demonstrating sensitivity to factors that affect scaling and aggregation of depth-dependent properties. In these lake cores factors could include: the depth range sampled, core recovery/compaction/fall-in, variation in analytes compared to “random noise” around a single centroid, use of individual v.s. group v.s. global variance parameters, and more.

We assume standard deviations for core depth ranges are constant at 10 cm / 10% of a 1-meter slice. This produces a slight perturbation to the parent geometry without dramatically influencing the overall degree of aggregation. We further assume that the variation in the analyte can be “perturbed” for a layer based on the original value for that layer plus a Gaussian offset related to the whole-core variation in analyte.

With the above assumptions it is tacitly accepted that there is a single “mode” within a core – clearly this is not always the case in nature, especially as depth increases – but the question of whether nature can be approximated by a single mode will perennially be an interesting one. The above analysis is suggestive of two or more distinct modes of V value in layers: those with value ~70 pXRF, and those with about half that – but that is just based on two unrelated cores.

Ideally, simulation parameters would be based on more than one core and would be depth-dependent.

permute_profile only deals in Gaussian offsets of boundaries which means it is limited for generating profiles that exhibit more “natural” depth-skewed distributions or long-range-order types of variation in space. That means you shouldn’t expect it to be able to generate the stuff that you might find in nature – you have to sample that yourself! And then permute it :)! That said, in the future permute_profile will likely be expanded to support additional distributions, parameters, models or covariance structures that can be derived from experimental/observed data.

Next steps

If you enjoyed this tutorial and want to expand on what you worked on here, you could try modifying the code to explore Calcium (Ca) or Titanium (Ti) values instead of Vanadium (V). Dewey’s tutorial where this dataset came from has tips on using facets to plot multiple analytes etc. Also you can adjust the layer boundary standard deviation size or number of realizations (currently 100) per profile. You could also try and load your own datasets into this routine and modify it for your own use. I will caution you about naive assumptions of normality and independence one last time, though.

If discussion about identification of similar layers within depth-stratified data intrigues you, functions in aqp such as generalize_hz, hzTransitionProbabilies, genhzTableToAdjMat and mostLikelyHzSequence may be useful. Identifying “depth modes” in soils data is essential to properly categorizing knowledge and generating accurate interpretations of properties. Dylan Beaudette has developed several routines for visualizing generalized horizon patterns such as plotSoilRelationGraph and plots that show simultaneous horizon transition probabilities for multiple generalized horizons across depth.

Here is a tutorial on initial steps for developing generalized horizon concepts for the Statistics for Soil Survey course. It needs to be expanded to demonstrate some of the graphical and modeling components.

The stratigraphic diagrams with tidypaleo & ggplot2 tutorial from Dewey has some great paleolimnology-depth-dendrograms that are getting at the same fundamental “depth mode” concepts – this might be worth exploring especially in the context of aqp::slice and aqp::slab. It also would be helpful if you are interested in expanding the plots in this tutorial to show multiple analytes.