4  Fuel data exploration

Our main questions regarding fuel loading are:

  1. Are there differences in fuel loading between treatments before and after pct?
  2. If so, what fuel loading components differ, and in which treatments before and after pct?
  3. What is the magnitude of the difference (for each fuel class) by treatment before and after pct?

We have a range of response variables that include loading in several different classes of surface fuels. Some of these may be correlated with each other, which may have an effect on our interpretation of differences between treatments. For instance, if duff-litter load is negatively correlated with vegetation, even if we don’t see differences between treatment in either of these variables separately, it might be found that for any given level of vegetation loading, one treatment, or another may have consistently higher levels of duff-litter. I will be analyzing fuel loading categories independently. While not ideal, this simplifies the analysis.

In addition to correlation among classes, there may also be overlap. Primarily, this would be relevant to the 1-hr and duff/litter classes. We used a cutoff between redwoowd sprays that were counted as 1-hr fuels at about 2mm. These same 1-hr fuels were also included in the duff/litter depth because they were usually thouroughly mixed with the litter. For this reason, it is not possible to simply add these two classes.

Similarly, fuel beds after pct were covered with substantial slash, often precluding the ability to measure duff/litter. Furthermore, as a result of an oversight in the sampling protocol, leaves suspended in slash were not addressed, as they were neither captured as litter or 1-hr fuels. But these would be expected to contribute significantly to fire behavior after curing. It is also unclear how long long the slash leaves will remain suspended — we would expect this to vary by species — and their impact on fire behavior would be expected to change dramatically with their spatial arrangement (bulk density), with reduced behavior as they settle to the forest floor. For this reason, comparisons of duff/litter and 1-hr fuels before and after pct would be difficult to interpret.

Finallly, because herbaceous vegetation played a relatively minor role in our plots, it was lumped together with woody vegetation.

I opt to restrict quantitative comparisons to between treatments, before and after pct seperately.

In order to assess potential outliers, co-linearity, interactions, and other problems with statistical tests, we’ll first conduct some data exploration as outlined by Zuur et al. (2010).

For reference, here is a list of the fuel loading (response) variables of interest.

  1. duff/litter load
  2. woody vegetation
  3. one hr fuels
  4. ten hr fuels
  5. hundred hr fuels
  6. coarse woody debris

It’s important to note that variables 1 and 2 were measured twice per transect and the rest were measured once per transect. All analysese were conducted at the transect level–variables 1-3 were analyzed at their transect average (average of two observations per transect.) This was done to simplify the analysis.

4.1 Basic Summary

Code
# function to access mostly raw data, but with combined veg and thoushr fuels.
# and select the output variables (`...`)
load2 <- function(shape = "wide", pct = c("pre", "post", "both"), ...) {
  pct <- pct[1]
  pct_filter <- switch(pct,
    pre = "prepct", post = "postpct", both = "prepct|postpct"
  )
  tl <- filter(total_load, phase == pct_filter)
  load_vars <- c("onehr", "tenhr", "hundhr", "dufflitter", "thoushr", "veg", "veg_diff")
  # TODO: if(pct == "post")
  if (!missing(...)) tl <- select(tl, ...)
  if (shape == "long") {
    tl <- pivot_longer(tl,
      -any_of(c("phase", "site", "treatment", "corner", "azi")),
      names_to = "class",
      values_to = "load"
    )
    load_vars <- load_vars[load_vars %in% tl$class]
    tl <- mutate(tl, class = factor(class, levels = load_vars)) |>
      filter(!is.na(load))
  }
  tl
}

These plots seem to indicate that, pre-pct, there is slightly higher 1-hr fuel load for HA and HD, compared to GS and LD. Ten years after harvest, we would expect 1-hr fuels to be dominated by twigs and branches falling from live crowns and there is ostensibly a greater crown volume in the HA and HD treatments. Duff and litter may be slightly higher in the HD treatment, compared to the others. For live fuels, there appears to be a decreasing trend from GS > LD > HA & HD. Most of live fuel is due to regenerating sprouts. The difference in sprout growth in high vs. low light environments was readily apparent in the field.

Ten-hr and 100-hr, and 1000-hr fuels do not reveal an obvious trend. I’m not sure why this is at the moment.

Code
load2("long", "pre") |>
  ggplot(aes(treatment, load)) +
  geom_boxplot() +
  facet_wrap(~class, scales = "free", labeller = fuel_class_labeller) +
  labs(x = "Treatment", y = expression(Load ~ (Mg %.% ha^-1))) +
  scale_x_discrete(labels = toupper)
Figure 4.1: Boxplots of fuels data summarized for six fuel classes across four overstory harvest treatments, prior to pre-commercial thinning, 10 years after harvest. Woody and herbaceous vegetation have been combined and sound and rotten, 1,000-hr fuels have been combined.
Code
load2("long", "post", -dufflitter) |>
  ggplot(aes(treatment, load)) +
  geom_boxplot() +
  facet_wrap(~class, scales = "free", labeller = fuel_class_labeller) +
  labs(x = "Treatment", y = expression(Load ~ (Mg %.% ha^-1))) +
  scale_x_discrete(labels = toupper)
Figure 4.2: Boxplots of transect level fuels data summarized for six fuel classes across four overstory harvest treatments, following pre-commercial thinning, 10 years after harvest. Woody and herbaceous vegetation have been combined and sound and rotten, 1,000-hr fuels have been combined. Duff and litter data was not collected for this phase of the experiment. One-hour fuels do not included leaves suspended in the slash. Vegetation difference is the difference in pre- and post-pct vegetation fuels and represents vegetation slash recruited to the forest floor.

Here is a table summary of our raw data.

Code
spf <- "%.1f"
load2("long", "pre") |>
  group_by(class, treatment) |>
  summarize(
    avg_load = mean(load),
    sd_load = sd(load)
  ) |>
  mutate(
    load = paste0(sprintf(spf, avg_load), " (", sprintf(spf, sd_load), ")"),
    .keep = "unused"
  ) |>
  pivot_wider(names_from = treatment, values_from = load) |>
  knitr::kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
Table 4.1: Average (sd) transect level load (Mg ha-1) for six fuel class categories in four different overstory harvest techniques.
class gs ld ha hd
onehr 0.6 (0.6) 0.7 (0.5) 1.2 (0.8) 1.0 (0.6)
tenhr 3.7 (2.8) 3.4 (2.0) 3.2 (3.2) 2.9 (1.6)
hundhr 11.8 (8.6) 9.5 (8.5) 9.7 (9.0) 9.5 (7.6)
dufflitter 48.4 (31.1) 44.9 (18.3) 40.2 (22.0) 55.0 (28.9)
thoushr 44.9 (57.0) 35.3 (52.4) 41.6 (105.6) 39.5 (45.1)
veg 38.2 (42.3) 21.2 (16.0) 12.9 (14.6) 17.3 (16.1)

4.2 Outliers

Code
cleavland_plot <- function(data, title, load_var = load) {
  data |>
    group_by(site, treatment) |> # nolint
    mutate(replicate_mean_load = mean({{ load_var }}, na.rm = TRUE)) |>
    ggplot(
      aes(
        {{ load_var }},
        fct_reorder(
          interaction(site, treatment, sep = " "),
          {{ load_var }},
          .na_rm = TRUE
        ),
        color = treatment
      )
    ) +
    geom_jitter(width = 0, height = 0.2) +
    labs(x = expression(Load ~ (Mg %.% ha^-1)), y = "Data order", title = title)
}
(a) One hour fuels
(b) Ten hour fuels
(c) Hundred hour fuels
Figure 4.3: Data distribution of loading for fine woody debris classes. Data are sorted by mean loading within each replicate. Jitter has been added to aid in visual interpretation.
Figure 4.4: Sum of coarse woody (>7.64 cm, sound and rotten wood combined) fuel loading for transects. The y-axis is sorted by mean CWD loading for each replicate.
Figure 4.5: Combined duff and litter loading at each station along trancects. Y-axis is sorted as in Figure 4.4.
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
(a) Woody vegetation
(b) herbaceous vegetation
Figure 4.6: Vegetation fuel loading for each station along transects, including live and dead fuels attached to live vegetation. Y-axis is sorted as in Figure 4.4.

4.3 Normality

For further testing, I will summarize the data somewhat, by combining vegetation loading (woody and herb), and coarse woody loading (sound and rotten) into just two loading metrics. Now we have the following response variables:

  1. dufflitter
  2. onehr
  3. tenhr
  4. hundhr
  5. thoushr
  6. veg

When using manova to test for difference between groups with multiple response variables, it is important that the response variables are multivariate normally distributed. Unfortunately, it would appear that we have a probelem with normality. The raw data for each loading variable is clearly not normally distributed Figure 4.7.

Code
myqqplot <- function(data, var) {
  data |>
    ggplot(aes(sample = {{ var }})) +
    stat_qq() +
    stat_qq_line() +
    facet_grid(class ~ treatment, scales = "free") +
    labs(
      x = "Theoretical quantiles",
      y = "Sample quantiles",
      title = "Normal Q-Q Plot"
    )
}

load2("long", "pre") |>
  myqqplot(load)
Figure 4.7: Naive qq plot of loading variables. This doesn’t take into the fact that our data is nested. You could say this is based on a simple model where all observations are independent.
Code
# This code creates a dataframe with nested columns that include the original
# data and calculated coordinates of a superimposed normal distribution.
# TODO: why does this inlcude site corner azi?
bins <- 16
hist_dat <- load2("long", "pre", all_of(treatment_load_vars)) |>
  drop_na() |>
  # Facet grid each column (class) has same scale, find limits to calculate bin
  # width, limits are either implied by the constructed normal curce, or the raw
  # data
  group_by(class) |>
  mutate(
    xmin = min(c(mean(load) - 3 * sd(load), load)),
    xmax = max(c(mean(load) + 3 * sd(load), load))
  ) |>
  group_by(treatment, class) |>
  nest(data = load) |>
  # generate data for a normal curve with mean and sd from observed data to
  # cover 3 sd.
  mutate(
    norm_x = map(data, function(d) {
      seq(
        from = mean(d$load) - 3 * sd(d$load),
        to = mean(d$load) + 3 * sd(d$load),
        length.out = 100
      )
    }),
    # scale curve to expected binwidth based on plot layout (same x scale across
    # all fuel classes) multiplied by the number of observations. The histogram
    # and normal curve should represent the same total area.
    norm_y = map(data, function(d) {
      dens <- dnorm(unlist(norm_x), mean = mean(d$load), sd = sd(d$load))
      dens * ((xmax - xmin) / bins) * nrow(d)
    })
  )

# here is the code to plot the above data. I decided not to use this after all,
# but it was a lot of work, so I'm keeping it for posterity
# nolint start
# ggplot(filter(hist_dat, class %in% load_vars[1:3] )) +
#   geom_histogram(data = \(x) unnest(x, data), aes(x = load), bins = bins) +
#   geom_line(data = \(x) unnest(x, c(norm_x, norm_y)), aes(norm_x, norm_y)) +
#   facet_grid(treatment ~ class, scales = "free") +
#   labs(y = "count", x = "Load Mg/ha")
#
# ggplot(filter(hist_dat, class %in% load_vars[4:6] )) +
#   geom_histogram(data = \(x) unnest(x, data), aes(x = load), bins = bins) +
#   geom_line(data = \(x) unnest(x, c(norm_x, norm_y)), aes(norm_x, norm_y)) +
#   facet_grid(treatment ~ class, scales = "free") +
#   labs(y = "count", x = "Load Mg/ha")
# nolint end
Code
walk(list(c(1:3), c(4:6)), function(x) {
  p <- load2("long", "pre") |>
    filter(class %in% load_vars[x]) |>
    ggplot(aes(load)) +
    geom_histogram(bins = 16) +
    facet_grid(treatment ~ class, scales = "free") +
    labs(y = "count", x = "Load Mg/ha")
  print(p)
})
(a)
(b)
Figure 4.8: Histotrams for the fuel loading response variables.

4.3.1 Box-Cox transformation

I would like to look at the effect of this transformation on the response data.

Code
boxcox <- function(y, lambda1, lambda2) {
  if (lambda1 == 0) {
    log(y + lambda2)
  } else {
    ((y + lambda2)^lambda1 - 1) / lambda1
  }
}

load2("long", "pre", all_of(treatment_load_vars)) |>
  group_by(class) |>
  nest() |>
  rowwise() |>
  mutate(
    lambda = list(suppressMessages(
      geoR::boxcoxfit(data$load, lambda2 = TRUE)$lambda
    )),
    load_bc = list(boxcox(data$load, lambda[1], lambda[2]))
  ) |>
  unnest(c(data, load_bc)) |>
  ungroup() |>
  myqqplot(load_bc)

While, this looks somewhat more normal, the zeros end up being a little strange. I applied a separate transformation for each fuel class, but all treatments within a fuel class have the same transformation.

For MANOVA we are concerned with the within group multivariate normality, the assumption does not appear to be met here either (Figure 4.9). The code output below indicates the rows with the greatest deviation from normal.

Code
tl <- load2("wide", "pre", all_of(treatment_load_vars))
load_mod <- lm(as.matrix(tl[-1]) ~ treatment, data = tl)

m_dist <- heplots::Mahalanobis(resid(load_mod))

load2("wide", "pre", everything())[order(m_dist, decreasing = TRUE)[1:10], ]
# A tibble: 10 × 12
   phase  site    treatment corner   azi dufflitter onehr  tenhr hundhr thoushr
   <chr>  <chr>   <fct>     <chr>  <dbl>      <dbl> <dbl>  <dbl>  <dbl>   <dbl>
 1 prepct waldos  ha        nw        90       23.2 1.27  10.7    21.2    593. 
 2 prepct waldos  gs        ne       270      131.  1.72   6.46   16.1     36.3
 3 prepct whiskey ha        n        225       27.1 2.76  13.3    45.2     32.1
 4 prepct waldos  gs        se       270       54.1 0.504 11.6    34.8    280. 
 5 prepct whiskey gs        n        225       27.1 0.124  2.09    8.56     0  
 6 prepct waldos  ha        se       270      112.  2.32   0.930   2.54     0  
 7 prepct waldos  ha        se         0       15.5 3.34   6.53   13.7     17.7
 8 prepct waldos  gs        se         0       38.7 1.08   7.31   37.3     17.6
 9 prepct waldon  gs        e        225      112.  0.711  8.98   14.7     89.6
10 prepct whiskey ld        s        315       77.3 0.390  5.38   31.9     59.8
# ℹ 2 more variables: veg <dbl>, veg_diff <dbl>
Code
treatments <- c("gs", "ha", "ld", "hd")
par(mfrow = c(2, 2))
invisible(lapply(treatments, \(x) {
  heplots::cqplot(filter(tl, treatment == x)[-1], main = x)
}))
(a)
Figure 4.9: Plot A assessed the multivariate normality of residuals given the model where all loading variabels are a function of the treatment group.

A number of different tests of multivariate normaliy also confirm the lack of evidence for meeting this assumption (Table 4.2).

Code
all_mvn_tests <- function(data) {
  c("mardia", "hz", "royston", "dh", "energy") |>
    map(\(x) MVN::mvn(data = data, subset = "treatment", mvnTest = x)) |>
    map(\(x) x$multivariateNormality) |>
    map(\(x) bind_rows(x, .id = "treatment")) |>
    map_dfr(\(x) {
      filter(x, Test != "MVN") |> # nolint
        mutate(across(where(is.factor), \(f) as.numeric(as.character(f)))) |>
        select(1:2, statistic = 3, `p value`, Result = last_col()) # nolint
    })
}

all_mvn_tests(load2("wide", "pre", all_of(treatment_load_vars))) |>
  knitr::kable(digits = 4)
Table 4.2: Several different tests of multivariate normality indicate a lack of evidence to support this assumption.
treatment Test statistic p value Result
gs Mardia Skewness 156.2386 0.0000 NO
gs Mardia Kurtosis 4.1306 0.0000 NO
ld Mardia Skewness 122.0012 0.0000 NO
ld Mardia Kurtosis 3.2534 0.0011 NO
ha Mardia Skewness 198.9055 0.0000 NO
ha Mardia Kurtosis 5.8903 0.0000 NO
hd Mardia Skewness 85.1671 0.0072 NO
hd Mardia Kurtosis 0.7268 0.4673 YES
gs Henze-Zirkler 1.1979 0.0000 NO
ld Henze-Zirkler 1.1036 0.0001 NO
ha Henze-Zirkler 1.2792 0.0000 NO
hd Henze-Zirkler 1.0978 0.0001 NO
gs Royston 79.7755 0.0000 NO
ld Royston 65.9480 0.0000 NO
ha Royston 105.5921 0.0000 NO
hd Royston 51.6106 0.0000 NO
gs Doornik-Hansen 45.8165 0.0000 NO
ld Doornik-Hansen 38.4513 0.0001 NO
ha Doornik-Hansen 57.3566 0.0000 NO
hd Doornik-Hansen 67.3211 0.0000 NO
gs E-statistic 1.9038 0.0000 NO
ld E-statistic 1.7881 0.0000 NO
ha E-statistic 2.1409 0.0000 NO
hd E-statistic 1.6312 0.0000 NO

4.3.2 Other distributions

If our data is not normally distributed, then what distribution is it? I’m going to assume what we are interested in the distribution of data within groups (treatments).

I attemped to model the distribution of our conditional response data (fuel size class by treatment), but it mostly didn’t work.

Our data is non-negative (contains zeros) continuous (for the most part) and highly variable in terms of skew and kurtosis. The presence of zeros, makes using the Gamma distribution more difficult. One possibility is a hurdle gamma, or Zero-adjusted Gamma.

The following show where our data lies in terms of kurtosis and skewness compared to other common distributions. It looks somewhat Gamma-ish?

Code
d <- load2("long", "pre", treatment, all_of(load_vars)) |>
  split(~class) |>
  map(~ split(.x, ~treatment))

walk(d, ~ walk(.x, \(x) fitdistrplus::descdist(x$load, boot = 111)))
(a) Treatment gs, fuel class onehr
(b) Treatment ha, fuel class onehr
(c) Treatment hd, fuel class onehr
(d) Treatment ld, fuel class onehr
(e) Treatment gs, fuel class tenhr
(f) Treatment ha, fuel class tenhr
(g) Treatment hd, fuel class tenhr
(h) Treatment ld, fuel class tenhr
(i) Treatment gs, fuel class hundhr
(j) Treatment ha, fuel class hundhr
(k) Treatment hd, fuel class hundhr
(l) Treatment ld, fuel class hundhr
(m) Treatment gs, fuel class dufflitter
(n) Treatment ha, fuel class dufflitter
(o) Treatment hd, fuel class dufflitter
(p) Treatment ld, fuel class dufflitter
(q) Treatment gs, fuel class thoushr
(r) Treatment ha, fuel class thoushr
(s) Treatment hd, fuel class thoushr
(t) Treatment ld, fuel class thoushr
(u) Treatment gs, fuel class veg
(v) Treatment ha, fuel class veg
(w) Treatment hd, fuel class veg
(x) Treatment ld, fuel class veg
Figure 4.10: Skewness and kurtosis for fuel classes within treatments.

4.3.3 Zero-adjusted Gamma

I’ll try using the gamlss package, which fits models “where all the parameters of the assumed distribution for the response can be modelled as additive functions of the explanatory variables.”

I’m not really sure how it works, but I know if will allow me to fit a model assuming a gamma distribution, while modeling the zeros separately. First I fit a model, then I get the estimated distribution parameters, then I plot the density curve over a histrogram (scaled to density). This looks promising to me.

Code
# Zero adjusted Gamma distribution fit to histograms
plot_zaga <- function(i, d) {
  m1 <- gamlss::gamlssML(d, family = gamlss.dist::ZAGA())
  hist(d, prob = TRUE, main = i, breaks = 20)
  curve(
    gamlss.dist::dZAGA(x, mu = m1$mu, sigma = m1$sigma, nu = m1$nu), # nolint
    from = min(d),
    to = max(d),
    add = TRUE
  )
}

plot_zaga_class <- function(data, class_name) {
  if (missing(class_name)) {
    class_name <- as.character(substitute(data))
    class_name <- class_name[length(class_name)]
  }
  par(mfrow = c(2, 2))
  iwalk(data, ~ plot_zaga(.y, .x))
  mtext(class_name, cex = 1.6, side = 3, line = -2, outer = TRUE)
}

d2 <- d |> map(~ map(.x, "load"))
iwalk(d2, ~ plot_zaga_class(.x, .y))

4.3.4 Poisson fit of count data

All of the woody debris can be viewed as count data, and mean diameter. The mean diameter implies a distribution, which we actually have for coarse woody, but not for FWD.

I wonder If I can model these counts as a Poisson process.

Code
cwd_counts <- cwd |>
  group_by(site, treatment, corner, azi) |>
  summarize(count = sum(count), .groups = "drop")

par(mfrow = c(2, 2))
cwd_counts |>
  group_by(treatment) |>
  group_walk(function(data, group) {
    n <- data$count
    hist(n, main = group, prob = TRUE)
    lines(0:max(n), dpois(0:max(n), mean(n)))
  })

That didn’t look so great, so I also tried using a Zero-inflated Poisson distribution, but the model fit an extremely small value for the parameter that controls the probability of zero (referred to here as sigma) and so was effectively the same as the Poisson fit.

Code
par(mfrow = c(2, 2))
cwd_counts |>
  group_by(treatment) |>
  nest() |>
  transmute(
    sigma = map_dbl(
      data,
      ~ gamlss::gamlssML(.x$count, family = gamlss.dist::ZIP())$sigma
    )
  )
# A tibble: 4 × 2
# Groups:   treatment [4]
  treatment  sigma
  <chr>      <dbl>
1 gs        0.134 
2 ha        0.0775
3 hd        0.267 
4 ld        0.0568

4.4 Homogeneity of variance

There seem to be some pretty big differences in the variance between treatments. This is likely to do with outliers. For linear regression, it is recommended that maximum variance ration should be below 4.

Code
max_var <- load2("long", "pre") |>
  group_by(class, treatment) |>
  summarize(var = sd(load)^2, load = max(load), .groups = "drop_last") |>
  summarize(
    max_var_rat = paste("Max. var. ratio: ", round(max(var) / min(var))),
    x = 1.1,
    y = max(load) * 1.1,
    .groups = "drop"
  )

load2("long", "pre") |>
  ggplot(aes(treatment, load)) +
  geom_boxplot() +
  geom_text(data = max_var, aes(x, y, label = max_var_rat, hjust = "inward")) +
  facet_wrap(~class, scales = "free")

4.5 Zeros

We do have zeros, which is important if we want to employ a glm like Gamma, which is only defined for positive values.

Code
load2("long", "pre") |>
  group_by(class, treatment) |>
  summarize(zeros = sum(load == 0), percent = zeros / n(), .groups = "drop") |>
  ggplot(aes(treatment, percent)) +
  geom_col() +
  geom_text(
    aes(label = if_else(zeros == 0, NA, zeros), y = percent / 2),
    color = "gray70",
    na.rm = TRUE
  ) +
  facet_wrap(~class) +
  scale_y_continuous(labels = scales::percent)

4.6 Correlation of response variables

I’m not sure if it’s important, but I was curious if the various fuel loading classes were correlated with each other. Either across the board, or within a given treatment.

Code
suppressMessages(GGally::ggpairs(load2("wide", "pre", all_of(load_vars))))
Figure 4.11: Correlation among the response variables (fuel classes).

4.7 Independence

Because of how are data were collected they are not independent. The current data is summarized at the transect level. At that level. We have two transects at each corner. Because of spatial autocorrelation, these may be correlated with each other. Corners (and thus transects) are nested within plots, and plots are within treatments. Each plot received a different treatment. What I’m not clear about is: should I include a random variable for plots, if I’m including a fixed effect for treatment?

There are also question about at what level to summarize/model the data. I’ve already averaged stations within transects for several variable that were collected at the station level (two within each transect). What are the trade-offs for either averaging at the corner level, or alternatively, analyzing our raw station data instead of averaging.

Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3–14. https://doi.org/10.1111/j.2041-210X.2009.00001.x