4 Fuel data exploration

Our main questions regarding fuel loading are:

Are there differences in fuel loading between treatments before and after pct?
If so, what fuel loading components differ, and in which treatments before and after pct?
What is the magnitude of the difference (for each fuel class) by treatment before and after pct?

We have a range of response variables that include loading in several different classes of surface fuels. Some of these may be correlated with each other, which may have an effect on our interpretation of differences between treatments. For instance, if duff-litter load is negatively correlated with vegetation, even if we don’t see differences between treatment in either of these variables separately, it might be found that for any given level of vegetation loading, one treatment, or another may have consistently higher levels of duff-litter. I will be analyzing fuel loading categories independently. While not ideal, this simplifies the analysis.

In addition to correlation among classes, there may also be overlap. Primarily, this would be relevant to the 1-hr and duff/litter classes. We used a cutoff between redwoowd sprays that were counted as 1-hr fuels at about 2mm. These same 1-hr fuels were also included in the duff/litter depth because they were usually thouroughly mixed with the litter. For this reason, it is not possible to simply add these two classes.

Similarly, fuel beds after pct were covered with substantial slash, often precluding the ability to measure duff/litter. Furthermore, as a result of an oversight in the sampling protocol, leaves suspended in slash were not addressed, as they were neither captured as litter or 1-hr fuels. But these would be expected to contribute significantly to fire behavior after curing. It is also unclear how long long the slash leaves will remain suspended — we would expect this to vary by species — and their impact on fire behavior would be expected to change dramatically with their spatial arrangement (bulk density), with reduced behavior as they settle to the forest floor. For this reason, comparisons of duff/litter and 1-hr fuels before and after pct would be difficult to interpret.

Finallly, because herbaceous vegetation played a relatively minor role in our plots, it was lumped together with woody vegetation.

I opt to restrict quantitative comparisons to between treatments, before and after pct seperately.

In order to assess potential outliers, co-linearity, interactions, and other problems with statistical tests, we’ll first conduct some data exploration as outlined by Zuur et al. (2010).

For reference, here is a list of the fuel loading (response) variables of interest.

duff/litter load
woody vegetation
one hr fuels
ten hr fuels
hundred hr fuels
coarse woody debris

It’s important to note that variables 1 and 2 were measured twice per transect and the rest were measured once per transect. All analysese were conducted at the transect level–variables 1-3 were analyzed at their transect average (average of two observations per transect.) This was done to simplify the analysis.

4.1 Basic Summary

Code

# function to access mostly raw data, but with combined veg and thoushr fuels.
# and select the output variables (`...`)
load2 <- function(shape = "wide", pct = c("pre", "post", "both"), ...) {
  pct <- pct[1]
  pct_filter <- switch(pct,
    pre = "prepct", post = "postpct", both = "prepct|postpct"
  )
  tl <- filter(total_load, phase == pct_filter)
  load_vars <- c("onehr", "tenhr", "hundhr", "dufflitter", "thoushr", "veg", "veg_diff")
  # TODO: if(pct == "post")
  if (!missing(...)) tl <- select(tl, ...)
  if (shape == "long") {
    tl <- pivot_longer(tl,
      -any_of(c("phase", "site", "treatment", "corner", "azi")),
      names_to = "class",
      values_to = "load"
    )
    load_vars <- load_vars[load_vars %in% tl$class]
    tl <- mutate(tl, class = factor(class, levels = load_vars)) |>
      filter(!is.na(load))
  }
  tl
}

These plots seem to indicate that, pre-pct, there is slightly higher 1-hr fuel load for HA and HD, compared to GS and LD. Ten years after harvest, we would expect 1-hr fuels to be dominated by twigs and branches falling from live crowns and there is ostensibly a greater crown volume in the HA and HD treatments. Duff and litter may be slightly higher in the HD treatment, compared to the others. For live fuels, there appears to be a decreasing trend from GS > LD > HA & HD. Most of live fuel is due to regenerating sprouts. The difference in sprout growth in high vs. low light environments was readily apparent in the field.

Ten-hr and 100-hr, and 1000-hr fuels do not reveal an obvious trend. I’m not sure why this is at the moment.

Code

load2("long", "pre") |>
  ggplot(aes(treatment, load)) +
  geom_boxplot() +
  facet_wrap(~class, scales = "free", labeller = fuel_class_labeller) +
  labs(x = "Treatment", y = expression(Load ~ (Mg %.% ha^-1))) +
  scale_x_discrete(labels = toupper)

Figure 4.1: Boxplots of fuels data summarized for six fuel classes across four overstory harvest treatments, prior to pre-commercial thinning, 10 years after harvest. Woody and herbaceous vegetation have been combined and sound and rotten, 1,000-hr fuels have been combined.

Code

load2("long", "post", -dufflitter) |>
  ggplot(aes(treatment, load)) +
  geom_boxplot() +
  facet_wrap(~class, scales = "free", labeller = fuel_class_labeller) +
  labs(x = "Treatment", y = expression(Load ~ (Mg %.% ha^-1))) +
  scale_x_discrete(labels = toupper)

Figure 4.2: Boxplots of transect level fuels data summarized for six fuel classes across four overstory harvest treatments, following pre-commercial thinning, 10 years after harvest. Woody and herbaceous vegetation have been combined and sound and rotten, 1,000-hr fuels have been combined. Duff and litter data was not collected for this phase of the experiment. One-hour fuels do not included leaves suspended in the slash. Vegetation difference is the difference in pre- and post-pct vegetation fuels and represents vegetation slash recruited to the forest floor.

Here is a table summary of our raw data.

Code

spf <- "%.1f"
load2("long", "pre") |>
  group_by(class, treatment) |>
  summarize(
    avg_load = mean(load),
    sd_load = sd(load)
  ) |>
  mutate(
    load = paste0(sprintf(spf, avg_load), " (", sprintf(spf, sd_load), ")"),
    .keep = "unused"
  ) |>
  pivot_wider(names_from = treatment, values_from = load) |>
  knitr::kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

Table 4.1: Average (sd) transect level load (Mg ha^-1) for six fuel class categories in four different overstory harvest techniques.

class	gs	ld	ha	hd
onehr	0.6 (0.6)	0.7 (0.5)	1.2 (0.8)	1.0 (0.6)
tenhr	3.7 (2.8)	3.4 (2.0)	3.2 (3.2)	2.9 (1.6)
hundhr	11.8 (8.6)	9.5 (8.5)	9.7 (9.0)	9.5 (7.6)
dufflitter	48.4 (31.1)	44.9 (18.3)	40.2 (22.0)	55.0 (28.9)
thoushr	44.9 (57.0)	35.3 (52.4)	41.6 (105.6)	39.5 (45.1)
veg	38.2 (42.3)	21.2 (16.0)	12.9 (14.6)	17.3 (16.1)

4.2 Outliers

Code

cleavland_plot <- function(data, title, load_var = load) {
  data |>
    group_by(site, treatment) |> # nolint
    mutate(replicate_mean_load = mean({{ load_var }}, na.rm = TRUE)) |>
    ggplot(
      aes(
        {{ load_var }},
        fct_reorder(
          interaction(site, treatment, sep = " "),
          {{ load_var }},
          .na_rm = TRUE
        ),
        color = treatment
      )
    ) +
    geom_jitter(width = 0, height = 0.2) +
    labs(x = expression(Load ~ (Mg %.% ha^-1)), y = "Data order", title = title)
}

Figure 4.4: Sum of coarse woody (>7.64 cm, sound and rotten wood combined) fuel loading for transects. The y-axis is sorted by mean CWD loading for each replicate.

Figure 4.5: Combined duff and litter loading at each station along trancects. Y-axis is sorted as in Figure 4.4.

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

4.3 Normality

For further testing, I will summarize the data somewhat, by combining vegetation loading (woody and herb), and coarse woody loading (sound and rotten) into just two loading metrics. Now we have the following response variables:

dufflitter
onehr
tenhr
hundhr
thoushr
veg

When using manova to test for difference between groups with multiple response variables, it is important that the response variables are multivariate normally distributed. Unfortunately, it would appear that we have a probelem with normality. The raw data for each loading variable is clearly not normally distributed Figure 4.7.

Code

myqqplot <- function(data, var) {
  data |>
    ggplot(aes(sample = {{ var }})) +
    stat_qq() +
    stat_qq_line() +
    facet_grid(class ~ treatment, scales = "free") +
    labs(
      x = "Theoretical quantiles",
      y = "Sample quantiles",
      title = "Normal Q-Q Plot"
    )
}

load2("long", "pre") |>
  myqqplot(load)

Figure 4.7: Naive qq plot of loading variables. This doesn’t take into the fact that our data is nested. You could say this is based on a simple model where all observations are independent.

Code

# This code creates a dataframe with nested columns that include the original
# data and calculated coordinates of a superimposed normal distribution.
# TODO: why does this inlcude site corner azi?
bins <- 16
hist_dat <- load2("long", "pre", all_of(treatment_load_vars)) |>
  drop_na() |>
  # Facet grid each column (class) has same scale, find limits to calculate bin
  # width, limits are either implied by the constructed normal curce, or the raw
  # data
  group_by(class) |>
  mutate(
    xmin = min(c(mean(load) - 3 * sd(load), load)),
    xmax = max(c(mean(load) + 3 * sd(load), load))
  ) |>
  group_by(treatment, class) |>
  nest(data = load) |>
  # generate data for a normal curve with mean and sd from observed data to
  # cover 3 sd.
  mutate(
    norm_x = map(data, function(d) {
      seq(
        from = mean(d$load) - 3 * sd(d$load),
        to = mean(d$load) + 3 * sd(d$load),
        length.out = 100
      )
    }),
    # scale curve to expected binwidth based on plot layout (same x scale across
    # all fuel classes) multiplied by the number of observations. The histogram
    # and normal curve should represent the same total area.
    norm_y = map(data, function(d) {
      dens <- dnorm(unlist(norm_x), mean = mean(d$load), sd = sd(d$load))
      dens * ((xmax - xmin) / bins) * nrow(d)
    })
  )

# here is the code to plot the above data. I decided not to use this after all,
# but it was a lot of work, so I'm keeping it for posterity
# nolint start
# ggplot(filter(hist_dat, class %in% load_vars[1:3] )) +
#   geom_histogram(data = \(x) unnest(x, data), aes(x = load), bins = bins) +
#   geom_line(data = \(x) unnest(x, c(norm_x, norm_y)), aes(norm_x, norm_y)) +
#   facet_grid(treatment ~ class, scales = "free") +
#   labs(y = "count", x = "Load Mg/ha")
#
# ggplot(filter(hist_dat, class %in% load_vars[4:6] )) +
#   geom_histogram(data = \(x) unnest(x, data), aes(x = load), bins = bins) +
#   geom_line(data = \(x) unnest(x, c(norm_x, norm_y)), aes(norm_x, norm_y)) +
#   facet_grid(treatment ~ class, scales = "free") +
#   labs(y = "count", x = "Load Mg/ha")
# nolint end

Code

walk(list(c(1:3), c(4:6)), function(x) {
  p <- load2("long", "pre") |>
    filter(class %in% load_vars[x]) |>
    ggplot(aes(load)) +
    geom_histogram(bins = 16) +
    facet_grid(treatment ~ class, scales = "free") +
    labs(y = "count", x = "Load Mg/ha")
  print(p)
})

4.3.1 Box-Cox transformation

I would like to look at the effect of this transformation on the response data.

Code

boxcox <- function(y, lambda1, lambda2) {
  if (lambda1 == 0) {
    log(y + lambda2)
  } else {
    ((y + lambda2)^lambda1 - 1) / lambda1
  }
}

load2("long", "pre", all_of(treatment_load_vars)) |>
  group_by(class) |>
  nest() |>
  rowwise() |>
  mutate(
    lambda = list(suppressMessages(
      geoR::boxcoxfit(data$load, lambda2 = TRUE)$lambda
    )),
    load_bc = list(boxcox(data$load, lambda[1], lambda[2]))
  ) |>
  unnest(c(data, load_bc)) |>
  ungroup() |>
  myqqplot(load_bc)

While, this looks somewhat more normal, the zeros end up being a little strange. I applied a separate transformation for each fuel class, but all treatments within a fuel class have the same transformation.

For MANOVA we are concerned with the within group multivariate normality, the assumption does not appear to be met here either (Figure 4.9). The code output below indicates the rows with the greatest deviation from normal.

Code

tl <- load2("wide", "pre", all_of(treatment_load_vars))
load_mod <- lm(as.matrix(tl[-1]) ~ treatment, data = tl)

m_dist <- heplots::Mahalanobis(resid(load_mod))

load2("wide", "pre", everything())[order(m_dist, decreasing = TRUE)[1:10], ]

# A tibble: 10 × 12
   phase  site    treatment corner   azi dufflitter onehr  tenhr hundhr thoushr
   <chr>  <chr>   <fct>     <chr>  <dbl>      <dbl> <dbl>  <dbl>  <dbl>   <dbl>
 1 prepct waldos  ha        nw        90       23.2 1.27  10.7    21.2    593. 
 2 prepct waldos  gs        ne       270      131.  1.72   6.46   16.1     36.3
 3 prepct whiskey ha        n        225       27.1 2.76  13.3    45.2     32.1
 4 prepct waldos  gs        se       270       54.1 0.504 11.6    34.8    280. 
 5 prepct whiskey gs        n        225       27.1 0.124  2.09    8.56     0  
 6 prepct waldos  ha        se       270      112.  2.32   0.930   2.54     0  
 7 prepct waldos  ha        se         0       15.5 3.34   6.53   13.7     17.7
 8 prepct waldos  gs        se         0       38.7 1.08   7.31   37.3     17.6
 9 prepct waldon  gs        e        225      112.  0.711  8.98   14.7     89.6
10 prepct whiskey ld        s        315       77.3 0.390  5.38   31.9     59.8
# ℹ 2 more variables: veg <dbl>, veg_diff <dbl>

Code

treatments <- c("gs", "ha", "ld", "hd")
par(mfrow = c(2, 2))
invisible(lapply(treatments, \(x) {
  heplots::cqplot(filter(tl, treatment == x)[-1], main = x)
}))

A number of different tests of multivariate normaliy also confirm the lack of evidence for meeting this assumption (Table 4.2).

Code

all_mvn_tests <- function(data) {
  c("mardia", "hz", "royston", "dh", "energy") |>
    map(\(x) MVN::mvn(data = data, subset = "treatment", mvnTest = x)) |>
    map(\(x) x$multivariateNormality) |>
    map(\(x) bind_rows(x, .id = "treatment")) |>
    map_dfr(\(x) {
      filter(x, Test != "MVN") |> # nolint
        mutate(across(where(is.factor), \(f) as.numeric(as.character(f)))) |>
        select(1:2, statistic = 3, `p value`, Result = last_col()) # nolint
    })
}

all_mvn_tests(load2("wide", "pre", all_of(treatment_load_vars))) |>
  knitr::kable(digits = 4)

Table 4.2: Several different tests of multivariate normality indicate a lack of evidence to support this assumption.

treatment	Test	statistic	p value	Result
gs	Mardia Skewness	156.2386	0.0000	NO
gs	Mardia Kurtosis	4.1306	0.0000	NO
ld	Mardia Skewness	122.0012	0.0000	NO
ld	Mardia Kurtosis	3.2534	0.0011	NO
ha	Mardia Skewness	198.9055	0.0000	NO
ha	Mardia Kurtosis	5.8903	0.0000	NO
hd	Mardia Skewness	85.1671	0.0072	NO
hd	Mardia Kurtosis	0.7268	0.4673	YES
gs	Henze-Zirkler	1.1979	0.0000	NO
ld	Henze-Zirkler	1.1036	0.0001	NO
ha	Henze-Zirkler	1.2792	0.0000	NO
hd	Henze-Zirkler	1.0978	0.0001	NO
gs	Royston	79.7755	0.0000	NO
ld	Royston	65.9480	0.0000	NO
ha	Royston	105.5921	0.0000	NO
hd	Royston	51.6106	0.0000	NO
gs	Doornik-Hansen	45.8165	0.0000	NO
ld	Doornik-Hansen	38.4513	0.0001	NO
ha	Doornik-Hansen	57.3566	0.0000	NO
hd	Doornik-Hansen	67.3211	0.0000	NO
gs	E-statistic	1.9038	0.0000	NO
ld	E-statistic	1.7881	0.0000	NO
ha	E-statistic	2.1409	0.0000	NO
hd	E-statistic	1.6312	0.0000	NO

4.3.2 Other distributions

If our data is not normally distributed, then what distribution is it? I’m going to assume what we are interested in the distribution of data within groups (treatments).

I attemped to model the distribution of our conditional response data (fuel size class by treatment), but it mostly didn’t work.

Our data is non-negative (contains zeros) continuous (for the most part) and highly variable in terms of skew and kurtosis. The presence of zeros, makes using the Gamma distribution more difficult. One possibility is a hurdle gamma, or Zero-adjusted Gamma.

The following show where our data lies in terms of kurtosis and skewness compared to other common distributions. It looks somewhat Gamma-ish?

Code

d <- load2("long", "pre", treatment, all_of(load_vars)) |>
  split(~class) |>
  map(~ split(.x, ~treatment))

walk(d, ~ walk(.x, \(x) fitdistrplus::descdist(x$load, boot = 111)))

4.3.3 Zero-adjusted Gamma

I’ll try using the gamlss package, which fits models “where all the parameters of the assumed distribution for the response can be modelled as additive functions of the explanatory variables.”

I’m not really sure how it works, but I know if will allow me to fit a model assuming a gamma distribution, while modeling the zeros separately. First I fit a model, then I get the estimated distribution parameters, then I plot the density curve over a histrogram (scaled to density). This looks promising to me.

Code

# Zero adjusted Gamma distribution fit to histograms
plot_zaga <- function(i, d) {
  m1 <- gamlss::gamlssML(d, family = gamlss.dist::ZAGA())
  hist(d, prob = TRUE, main = i, breaks = 20)
  curve(
    gamlss.dist::dZAGA(x, mu = m1$mu, sigma = m1$sigma, nu = m1$nu), # nolint
    from = min(d),
    to = max(d),
    add = TRUE
  )
}

plot_zaga_class <- function(data, class_name) {
  if (missing(class_name)) {
    class_name <- as.character(substitute(data))
    class_name <- class_name[length(class_name)]
  }
  par(mfrow = c(2, 2))
  iwalk(data, ~ plot_zaga(.y, .x))
  mtext(class_name, cex = 1.6, side = 3, line = -2, outer = TRUE)
}

d2 <- d |> map(~ map(.x, "load"))
iwalk(d2, ~ plot_zaga_class(.x, .y))

4.3.4 Poisson fit of count data

All of the woody debris can be viewed as count data, and mean diameter. The mean diameter implies a distribution, which we actually have for coarse woody, but not for FWD.

I wonder If I can model these counts as a Poisson process.

Code

cwd_counts <- cwd |>
  group_by(site, treatment, corner, azi) |>
  summarize(count = sum(count), .groups = "drop")

par(mfrow = c(2, 2))
cwd_counts |>
  group_by(treatment) |>
  group_walk(function(data, group) {
    n <- data$count
    hist(n, main = group, prob = TRUE)
    lines(0:max(n), dpois(0:max(n), mean(n)))
  })

That didn’t look so great, so I also tried using a Zero-inflated Poisson distribution, but the model fit an extremely small value for the parameter that controls the probability of zero (referred to here as sigma) and so was effectively the same as the Poisson fit.

Code

par(mfrow = c(2, 2))
cwd_counts |>
  group_by(treatment) |>
  nest() |>
  transmute(
    sigma = map_dbl(
      data,
      ~ gamlss::gamlssML(.x$count, family = gamlss.dist::ZIP())$sigma
    )
  )

# A tibble: 4 × 2
# Groups:   treatment [4]
  treatment  sigma
  <chr>      <dbl>
1 gs        0.134 
2 ha        0.0775
3 hd        0.267 
4 ld        0.0568

4.4 Homogeneity of variance

There seem to be some pretty big differences in the variance between treatments. This is likely to do with outliers. For linear regression, it is recommended that maximum variance ration should be below 4.

Code

max_var <- load2("long", "pre") |>
  group_by(class, treatment) |>
  summarize(var = sd(load)^2, load = max(load), .groups = "drop_last") |>
  summarize(
    max_var_rat = paste("Max. var. ratio: ", round(max(var) / min(var))),
    x = 1.1,
    y = max(load) * 1.1,
    .groups = "drop"
  )

load2("long", "pre") |>
  ggplot(aes(treatment, load)) +
  geom_boxplot() +
  geom_text(data = max_var, aes(x, y, label = max_var_rat, hjust = "inward")) +
  facet_wrap(~class, scales = "free")

4.5 Zeros

We do have zeros, which is important if we want to employ a glm like Gamma, which is only defined for positive values.

Code

load2("long", "pre") |>
  group_by(class, treatment) |>
  summarize(zeros = sum(load == 0), percent = zeros / n(), .groups = "drop") |>
  ggplot(aes(treatment, percent)) +
  geom_col() +
  geom_text(
    aes(label = if_else(zeros == 0, NA, zeros), y = percent / 2),
    color = "gray70",
    na.rm = TRUE
  ) +
  facet_wrap(~class) +
  scale_y_continuous(labels = scales::percent)

4.6 Correlation of response variables

I’m not sure if it’s important, but I was curious if the various fuel loading classes were correlated with each other. Either across the board, or within a given treatment.

Code

suppressMessages(GGally::ggpairs(load2("wide", "pre", all_of(load_vars))))

Figure 4.11: Correlation among the response variables (fuel classes).

4.7 Independence

Because of how are data were collected they are not independent. The current data is summarized at the transect level. At that level. We have two transects at each corner. Because of spatial autocorrelation, these may be correlated with each other. Corners (and thus transects) are nested within plots, and plots are within treatments. Each plot received a different treatment. What I’m not clear about is: should I include a random variable for plots, if I’m including a fixed effect for treatment?

There are also question about at what level to summarize/model the data. I’ve already averaged stations within transects for several variable that were collected at the station level (two within each transect). What are the trade-offs for either averaging at the corner level, or alternatively, analyzing our raw station data instead of averaging.

Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3–14. https://doi.org/10.1111/j.2041-210X.2009.00001.x