Fork me on GitHub

ed=r

R in Ecotoxicology and Environmental Sciences

Collinearity

It has been quiet the last months here - main reason is that I’m working on my master’s thesis. I have already prepared some more examples from ‘Quantitative Ecotoxicolgy’, but I didn’t come to publish them here.

This post is about collinearity and the implications for linear models. The best way to explore this is by simulations - where I create data with known properties and look what happens.

Create correlated random variables

The first problem for this simulation was: How can we create correlated random variables?

This simulation is similar to the one in Legendre & Legendre (2012) , where it is also mentioned that one could use Cholesky Decompostion to create correlated variables:

What this function function does:

  • Create two normal random variables (X1, X2 ~ N(0, 1))
  • Create desired correlation matrix
  • Compute the Choleski factorization of the correlation matrix ( R )
  • Apply the resulting matrix on the two variables (this rotates, shears and scales the variables so that they are correlated)
  • Create a dependent variable y after the model:

y ~ 5 + 7*X1 + 7*X2 + e with e ~ N(0, 1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#############################################################################
# create two correlated variables and a dependent variable
# n : number of points
# p:correlation

datagen <- function(n , p){
  # random points N(0,1)
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  X <- cbind(x1, x2)

  # desired correlation matrix
  R <- matrix(c(1, p, p, 1), ncol = 2)

  # use cholesky decomposition `t(U) %*% U = R`
  U <- chol(R)
  corvars <- X %*% U

  # create dependent variable after model:
  # y ~ 5 + 7 * X1 + 7 * X2 + e | E ~ N(0,10)
  y <- 5 + 7*corvars[,1] + 7*corvars[,2] + rnorm(n, 0, 1)
  df <- data.frame(y, corvars)
  return(df)
}

Let’s see if it works. This creates two variables with 1000 observations with a correlation of 0.8 between them and a dependent variable.

1
df1 <- datagen(n = 1000, p = 0.8)

The correlation between X1 and X2 is as desired nearly 0.8

1
cor(df1)
1
2
3
4
##          y      X1      X2
## y  1.00000 0.94043 0.94715
## X1 0.94043 1.00000 0.79139
## X2 0.94715 0.79139 1.00000

And the data follows the specified model.

1
2
mod <- lm(y ~ X1 + X2, data = df1)
summary(mod)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## 
## Call:
## lm(formula = y ~ X1 + X2, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7375 -0.6712  0.0561  0.6225  2.9057 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0210     0.0311     161   <2e-16 ***
## X1            6.9209     0.0517     134   <2e-16 ***
## X2            7.0813     0.0498     142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.984 on 997 degrees of freedom
## Multiple R-squared:  0.995,	Adjusted R-squared:  0.995 
## F-statistic: 9.14e+04 on 2 and 997 DF,  p-value: <2e-16
1
pairs(df1)

plot of chunk example_datagen3

Methods to spot collinearity

Dormann et al. (2013) lists eight methods to spot collinearity (see their Table 1). I will only show how to calculate two of those (but see the Appendix of the Dormann paper for code to all methods):

Absolute value of correlation coefficients r

1
(cormat <- cor(df1[, 2:3]))
1
2
3
##         X1      X2
## X1 1.00000 0.79139
## X2 0.79139 1.00000

Dormann et al. (2013) found that ‘coefficients between predictor variables of r > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation’.

Variance Inflation Factors

1
2
require(car)
vif(mod)
1
2
##     X1     X2 
## 2.6759 2.6759

Which is equivalent to (for variable X1):

1
2
sum <- summary(lm(X1 ~ X2, data = df1))
1/(1 - sum$r.squared)
1
## [1] 2.6759

Unfortunately there are many ‘rules of thumb’ associated with VIF: > 10, > 4, …

Simulation 1: How does collinearity affect the precision of estimates?

Here we simulate datasets with correlations ranging from -0.8 to 0.8:

1
2
3
4
ps <- seq(-0.8, 0.8, 0.005)
n = 1000

sim1 <- lapply(ps, function(x) datagen(n = n, p = x))

Next we fit a linear model to each of the datasets and extract the coefficients table:

1
2
3
4
5
6
# Function to fit and extract coefficients
regfun <- function(x) {
    return(summary(lm(y ~ X1 + X2, data = x))$coefficients)
}
# run regfun on every dataset
res1 <- lapply(sim1, regfun)

Finally we extract the Standard Errors and plot them against the degree of correlation:

1
2
3
4
5
6
7
8
9
# extract Standard Errors from results
ses <- data.frame(ps, t(sapply(res1, function(x) x[, 2])))

# plot
require(ggplot2)
require(reshape2)
ses_m <- melt(ses, id.vars = "ps")
ggplot(ses_m, aes(x = ps, y = value)) + geom_point(size = 3) + facet_wrap(~variable) +
    theme_bw() + ylab("Std. Error") + xlab("Correlation")

plot of chunk plot1

It can be clearly seen, that collinearity inflates the Standard Errors for the correlated variables. The intercept is not affected.

Having large standard errors parameter estimates are also variable, which is demonstrated in the next simulation:

Simulation 2: Are estimated parameters stable under collinearity?

Here I simulate for different correlations 100 datasets each in order to see how stable the estimates are.

The code creates the datasets, fits a linear model to each dataset and extracts the estimates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
n_sims <- 100
ps <- seq(-0.8, 0.8, 0.05)

# function to generate n_sims datasets, for give p
sim2 <- function(p) {
    sim <- lapply(1:n_sims, function(x) datagen(n = 1000, p = p))
    res <- lapply(sim, regfun)
    est <- t(sapply(res, function(x) x[, 1]))
    out <- data.frame(p, est)
    return(out)
}

res2 <- lapply(ps, function(x) {
    return(sim2(x))
})

The we create a plot of the estimates for the three coefficients, each boxplot represents 100 datasets.

1
2
3
4
5
6
7
ests <- lapply(res2, melt, id.vars = "p")
ests <- do.call(rbind, ests)

ggplot(ests, aes(y = value, x = factor(p))) + geom_boxplot() + facet_wrap(~variable,
    scales = "free_y") + theme_bw() + geom_hline(aes(yintercept = 7), data = subset(ests,
    variable %in% c("X1", "X2")), col = "red", lwd = 1) + ylab("Estimate") +
    xlab("Correlation")

plot of chunk plot2

The red line indicates the coefficients after the data has been generated (7 * X1 and 7 * X2). We see that the spread of estimates increases as correlation increases.

This is confirmed by looking at the standard deviation of the estimates:

1
2
3
4
sds <- data.frame(ps, t(sapply(res2, function(x) apply(x[, 2:4], 2, sd))))
sds_m <- melt(sds, id.vars = "ps")
ggplot(sds_m, aes(x = ps, y = value)) + geom_point(size = 3) + facet_wrap(~variable) +
    theme_bw() + ylab("SD of estimates from 100 simulations") + xlab("Correlation")

plot of chunk plot3

If the standard errors are large enough it may happen that parameter estimates may be so variable that even their sign is changed.

References

  • Carsten Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime Marquéz, Bernd Gruber, Bruno Lafourcade, Pedro Leitão, Tamara Münkemüller, Colin McClean, Patrick Osborne, Björn Reineking, Boris Schröder, Andrew Skidmore, Damaris Zurell, Sven Lautenbach, (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36 (1) 027-046 10.1111/j.1600-0587.2012.07348.x
  • Pierre Legendre, Louis Legendre, (2012) Numerical ecology.

Dispersion-based Weighting - a New Function in Vegan

Intro

When analysing multivariate community data the first step is often to transform the abundance data, in order to downweight contributions of high abundance taxa to dissimilarity. There are a lot of transformations available, for example:

  • log(x + 1), but see O’Hara & Kotze (2010)
  • ln(2x + 1), which is used often with mesocosm data
  • $x^{0.5}$
  • $x^{0.25}$
  • Presence/Absence
  • and manymany more

However these are all global transformations (applied to every species in the dataset).

In an email discussion with Bob Clarke he pointed me to one of his papers Clarke et al. 2006:

Clarke, K. R., M. G. Chapman, P. J. Somerfield, and H. R. Needham. 2006. “Dispersion-based Weighting of Species Counts in Assemblage Analyses.” Marine Ecology. Progress Series 320: 11–27. klick

I somehow missed this paper and it looks like this hasn’t been implemented in R yet. So I wrote it up for myself and the function has been merged into vegan this week (see Changelog).

Get it

Currently the function is only available in the dev-version of vegan. You can it easily install it from github using the devtools-package. You can also take a look at the implementation here.

1
2
3
require(devtools)
install_github("vegan", "jarioksa")
require(vegan)

The new function is named dispweight and takes three arguments:

1
dispweight(comm, group, nperm = 1000)

The community data matrix, a vector describing the group structure and the number of permutations to use for the permutation test.

It returns a list of four elements:

  • D: The dispersion index
  • p: p-value of the permutation test
  • weights: The weights applied to species
  • transformed: The transformed abundance matrix

Use it

Here it take the dune dataset and Management as grouping variable.

First we take look at the data via NMDS:

1
2
3
4
5
6
7
8
9
10
11
12
13
require(vegan)
data(dune)
data(dune.env)

# NMDS on fouth-root transformed abundances
un_nmds <- metaMDS(dune^0.25, distance = "bray")

# plot it
cols = rainbow(4)
plot(un_nmds, display = "sites", type = "n")
points(un_nmds, col = cols[dune.env$Management], pch = 16, cex = 1.5)
ordihull(un_nmds, groups = dune.env$Management, lty = "dotted")
ordispider(un_nmds, groups = dune.env$Management, label = TRUE)

plot of chunk plot_raw

Lets run the NMDS on dispersion-weighted abundances:

1
2
3
4
5
6
7
8
9
10
11
# calculate weights
dpw <- dispweight(dune, dune.env$Management, nperm = 100)

# NMDS on dispersion weighted community data
disp_nmds <- metaMDS(dpw$transformed, distance = "bray")

# plot
plot(disp_nmds, display = "sites", type = "n")
points(disp_nmds, col = cols[dune.env$Management], pch = 16, cex = 1.5)
ordihull(disp_nmds, groups = dune.env$Management, lty = "dotted")
ordispider(disp_nmds, groups = dune.env$Management, label = TRUE)

plot of chunk plot_disp

In this example there is not a big difference, but as they write in their paper:

[…] we should not expect a dispersion-weighting approach to radically alter ordination and test results. Indeed, the improvements shown in the real examples of this paper are sometimes small and subtle […].

I haven’t tested the function very much, so let me know if you find any bugs! Note that the code isn’t optimised for speed and follows largely the paper.

Refs

  • KR Clarke, MG Chapman, PJ Somerfield, HR Needham, (2006) Dispersion-Based Weighting of Species Counts in Assemblage Analyses. Marine Ecology Progress Series 320 11-27 10.3354/meps320011
  • Robert B. O’Hara, D. Johan Kotze, (2010) do Not Log-Transform Count Data. Methods in Ecology And Evolution 1 118-122 10.1111/j.2041-210X.2010.00021.x

Quantitative Ecotoxicology, Page 180, Example 4.10, Time-to-death

This is example 4.10 on page 180 of Quantitative Ecotoxicology. A follow-up to example 4.9.

Data

We use data from the previous example and also convert the concentrations into a factor:

1
2
3
4
5
6
7
# download data from github
require(RCurl)
url <- getURL("https://raw.github.com/EDiLD/r-ed/master/quantitative_ecotoxicology/data/TOXICITY.csv",
    ssl.verifypeer = FALSE)
TOXICITY <- read.table(text = url, header = TRUE)
# add status
TOXICITY$FLAG <- ifelse(TOXICITY$TTD > 96, 1, 2)

Compare models

We can fit parametric survival time models using the survreg() function, where the dist argument specifies the assumed distribution. The Gamma distribution is not supported by survreg (but see below).

1
2
3
4
5
6
require(survival)
# fit different models
mod_exp <- survreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY, dist = "exponential")
mod_wei <- survreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY, dist = "weibull")
mod_lnorm <- survreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY, dist = "lognorm")
mod_llog <- survreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY, dist = "loglogistic")

From the models we can access the Log-Likelihood via object$loglig[2]. Note that survreg gives 2 likelihoods, one without covariables (intercept only) and one with covariables. We are interested in the second one.

1
2
3
# extract loglik and save into data.frame
df <- data.frame(model = c("exp", "weibull", "lnorm", "loglog"), logLik = c(mod_exp$loglik[2],
    mod_wei$loglik[2], mod_lnorm$loglik[2], mod_llog$loglik[2]))

and extract the AIC-values (extractAIC) for model comparisons:

1
2
3
4
# add AIC as column to data.frame
df$AIC <- c(extractAIC(mod_exp)[2], extractAIC(mod_wei)[2], extractAIC(mod_lnorm)[2],
    extractAIC(mod_llog)[2])
df
1
2
3
4
5
##     model  logLik    AIC
## 1     exp -1309.7 2625.4
## 2 weibull -1114.3 2236.7
## 3   lnorm -1121.8 2251.5
## 4  loglog -1118.1 2244.2

Like with SAS the weibull-model has the lowest AIC. Not that the Log-Likelihood values differ by a constant of ~920 from SAS. Therefore also the AIC deviates by a constant of ~1840 (=2*920).

Inspect model

From the summary we get the estimates and their standard errors, as well as a Wald test for the coefficients.

1
summary(mod_wei)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## 
## Call:
## survreg(formula = Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY, 
##     dist = "weibull")
##              Value Std. Error      z         p
## (Intercept)  7.849    0.08449  92.89  0.00e+00
## PPT         -0.295    0.00512 -57.54  0.00e+00
## WETWT        1.066    0.25604   4.16  3.15e-05
## Log(scale)  -1.189    0.04455 -26.69 6.09e-157
## 
## Scale= 0.305 
## 
## Weibull distribution
## Loglik(model)= -1114.3   Loglik(intercept only)= -1606.3
## 	Chisq= 983.9 on 2 degrees of freedom, p= 0 
## Number of Newton-Raphson Iterations: 9 
## n=480 (70 observations deleted due to missingness)

Note that the estimate of $0.0602 \pm 0.2566$ for WETWT in the book must be a typo: The standard error is way too big to be statistically significant, so I think the R result is right here.

We see that the concentration has a negative effect on time to death and that bigger fish survive longer.

Plot model

From our model we can predict for any salt concentration or fish wet weight the median time to death.

To produce Figure 4.12 we first generate some new data for which we want to predict the median time to death. I use expand.grid here, which gives every combination of wet weight from 0 to 1g and four salt concentrations.

Then I use the predict function to predict for all these combinations the median-time-to-death from our model (type='quantile', p = 0.5 specifies the median time).

Afterwards this is plotted using the ggplot2 plotting system.

1
2
3
4
5
6
7
8
9
10
11
# newdata, all combinations of WETWT and PPT
newtox <- expand.grid(WETWT = seq(0, 1.5, length.out = 100), PPT = seq(12.5,
    20, by = 2.5))
# predict ttd for newdata
newtox$preds <- predict(mod_wei, newdata = newtox, type = "quantile", p = 0.5)

# plot
require(ggplot2)
ggplot(newtox, aes(x = WETWT, y = preds, col = factor(PPT), group = PPT)) +
    geom_line() + theme_bw() + labs(x = "Wet Weight (g)", y = "Median-Time-To-Death (h)",
    col = "Concentration (g/L)") + coord_cartesian(ylim = c(0, 100))

plot of chunk plot_predicts

Notes

We can also plot the data and the model into one plot. Here I fit a model only with concentration, for demonstration. We can plot the Kaplan-Meier curve as shown in previous example. Then add the model as curve using the predict function:

1
2
3
4
5
6
7
8
9
10
11
12
13
# plot KM
km <- survfit(Surv(TTD, FLAG) ~ PPT, data = TOXICITY)
plot(km, col = 1:7)

# Fit model only for Concentrations
mod_ppt <- survreg(Surv(TTD, FLAG) ~ PPT, data = TOXICITY, dist = "weibull")

# add model to plot
PPT_u <- sort(unique(TOXICITY$PPT))
for (i in seq_along(PPT_u)) {
    lines(predict(mod_ppt, newdata = list(PPT = PPT_u[i]), type = "quantile",
        p = 1:99/100), 99:1/100, col = i)
}

plot of chunk plot_model

To produce a plot similar to Figure 4.11 we can use the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
# opposite of %in% : 'not in'
`%!in%` <- Negate(`%in%`)

# remove 0 and 20.1 treatments
TOXICITY_sub <- TOXICITY[TOXICITY$PPT %!in% c(0, 20.1), , drop = TRUE]
# kaplan-meier estimates
km <- survfit(Surv(TTD, FLAG) ~ PPT, data = TOXICITY_sub)
fac <- factor(rep(sort(unique(TOXICITY_sub$PPT)), km$strata))
cols <- 1:5
# plot tranformed time vs. survival
plot(log(km$time), log(-log(km$surv)), col = cols[fac], pch = 16)
# add legend
legend("bottomright", legend = levels(fac), col = cols, pch = 16, cex = 0.7)

plot of chunk unnamed-chunk-7

To fit a gamma distribution we have to use another package: flexsurv. Usage is (intended) similar to survival. However the loglogistic distribution is not build in. But this can be fixed easily, see ?flexsurvreg:

1
2
3
4
5
6
#### Define loglogistic distribution for flexsurvreg
library(eha)
custom.llogis <- list(name = "llogis", pars = c("shape", "scale"), location = "scale",
    transforms = c(log, log), inv.transforms = c(exp, exp), inits = function(t) {
        c(1, median(t))
    })
1
2
3
4
5
6
7
8
9
10
11
12
require(flexsurv)
# fit all models using flexsurvreg
mod_flex_exp <- flexsurvreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY,
    dist = "exp")
mod_flex_wei <- flexsurvreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY,
    dist = "weibull")
mod_flex_lnorm <- flexsurvreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY,
    dist = "lnorm")
mod_flex_llog <- flexsurvreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY,
    dist = custom.llogis)
mod_flex_ggamma <- flexsurvreg(Surv(TTD, FLAG) ~ PPT + WETWT, data = TOXICITY,
    dist = "gengamma")

Comparing the results from flexsurvreg with survreg, we see that the estimates are identical for all models. Therefore I conclude that flexsurv is an alternative when fitting with gamma distribution.

Code and data are available on my github-repo under file name ‘p180’.

Quantitative Ecotoxicology, Page 178, Example 4.9, Time-to-death

This is example 4.9 on page 178 of Quantitative Ecotoxicology - time-to-death data.

Thankfully, Prof. Newman provided me the data for this example. You can get it from the github-repo (TOXICTY.csv).

1
2
3
4
5
require(RCurl)
url <- getURL("https://raw.github.com/EDiLD/r-ed/master/quantitative_ecotoxicology/data/TOXICITY.csv",
    ssl.verifypeer = FALSE)
TOXICITY <- read.table(text = url, header = TRUE)
head(TOXICITY)
1
2
3
4
5
6
7
##   TTD TANK  PPT WETWT STDLGTH
## 1   8    1 15.8 0.112     1.9
## 2   8    1 15.8 0.050     1.5
## 3   8    1 15.8 0.029     1.2
## 4   8    1 15.8 0.045     1.4
## 5   8    2 15.8 0.097     1.8
## 6   8    2 15.8 0.048     1.4
1
summary(TOXICITY)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
##       TTD            TANK           PPT           WETWT      
##  Min.   : 8.0   Min.   : 1.0   Min.   : 0.0   Min.   :0.024  
##  1st Qu.:24.0   1st Qu.: 4.0   1st Qu.:10.3   1st Qu.:0.068  
##  Median :72.0   Median : 7.0   Median :11.6   Median :0.092  
##  Mean   :63.3   Mean   : 7.5   Mean   :11.7   Mean   :0.135  
##  3rd Qu.:97.0   3rd Qu.:11.0   3rd Qu.:15.8   3rd Qu.:0.129  
##  Max.   :97.0   Max.   :14.0   Max.   :20.1   Max.   :1.489  
##                                               NA's   :70     
##     STDLGTH    
##  Min.   :1.20  
##  1st Qu.:1.60  
##  Median :1.70  
##  Mean   :1.87  
##  3rd Qu.:2.00  
##  Max.   :4.90  
##  NA's   :70

The data consists of 5 columns:

  • TTD : Time to death
  • TANK : Tank
  • PPT : NaCl Concentration
  • WETWT : wet weight
  • STDLGTH : Standard length

Columns 4 and 5 have 70 NA’s (no data available due to measurement error), but we won’t use these in this example. The observations with TTD = 97 are ‘survivors’, since the experiment run only 96 hours.

First we need to create a column FLAG for the status of the animal (dead/alive):

1
TOXICITY$FLAG <- ifelse(TOXICITY$TTD > 96, 1, 2)

So 1 denotes alive and 2 dead.

Then we can plot the data. Each line is a tank and colors denote the NaCl concentrations.

1
2
3
4
require(survival)
mod <- survfit(Surv(TTD, FLAG) ~ PPT + strata(TANK), data = TOXICITY)
plot(mod, col = rep(1:7, each = 2), mark.time = FALSE, xlab = "Hours", ylab = "% Survival")
legend("bottomleft", legend = sort(unique(TOXICITY$PPT)), col = 1:7, lty = 1)

plot of chunk plot_surv

We see a clear relationship between concentration and the survival curves. In this example we are interested in differences between the duplicates. We see that the two curves for the 11.6 g/L concentration are quite similar, while there is more divergence between tanks in the 13.2 g/L treatment.

We can test for differences using the survdiff function. With the rho argument we can specify the type of test: rho = 0 is a log-rank test and rho = 1 is equivalent to the Peto & Peto modification of the Gehan-Wilcoxon test.

First the log-rank test for each concentration:

1
survdiff(Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 10.3, ], rho = 0)
1
2
3
4
5
6
7
8
9
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     10.3, ], rho = 0)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=9  40        6     9.02      1.01      2.22
## TANK=10 38       11     7.98      1.14      2.22
## 
##  Chisq= 2.2  on 1 degrees of freedom, p= 0.136
1
survdiff(Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 10.8, ], rho = 0)
1
2
3
4
5
6
7
8
9
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     10.8, ], rho = 0)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=7 41       10     11.6     0.226     0.503
## TANK=8 39       12     10.4     0.253     0.503
## 
##  Chisq= 0.5  on 1 degrees of freedom, p= 0.478
1
survdiff(Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 11.6, ], rho = 0)
1
2
3
4
5
6
7
8
9
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     11.6, ], rho = 0)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=5 37       20     18.9    0.0623     0.129
## TANK=6 41       20     21.1    0.0559     0.129
## 
##  Chisq= 0.1  on 1 degrees of freedom, p= 0.719
1
survdiff(Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 13.2, ], rho = 0)
1
2
3
4
5
6
7
8
9
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     13.2, ], rho = 0)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=3 38       34     40.6      1.06       3.1
## TANK=4 40       37     30.4      1.41       3.1
## 
##  Chisq= 3.1  on 1 degrees of freedom, p= 0.0781
1
survdiff(Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 15.8, ], rho = 0)
1
2
3
4
5
6
7
8
9
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     15.8, ], rho = 0)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=1 39       39     44.5     0.674      3.09
## TANK=2 40       40     34.5     0.868      3.09
## 
##  Chisq= 3.1  on 1 degrees of freedom, p= 0.0789

We could also run this in a for loop (here the Wilcoxon test):

1
2
3
4
5
for (i in sort(unique(TOXICITY$PPT)[-c(2, 7)])) {
    cat("\n", i, "\n")
    print(survdiff(Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == i,
        ], rho = 1))
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
## 
##  10.3 
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     i, ], rho = 1)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=9  40     5.41     8.21     0.952      2.28
## TANK=10 38    10.09     7.29     1.071      2.28
## 
##  Chisq= 2.3  on 1 degrees of freedom, p= 0.131 
## 
##  10.8 
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     i, ], rho = 1)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=7 41     8.75    10.33     0.240     0.596
## TANK=8 39    10.81     9.24     0.269     0.596
## 
##  Chisq= 0.6  on 1 degrees of freedom, p= 0.44 
## 
##  11.6 
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     i, ], rho = 1)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=5 37     15.4     14.8    0.0205    0.0524
## TANK=6 41     15.9     16.5    0.0184    0.0524
## 
##  Chisq= 0.1  on 1 degrees of freedom, p= 0.819 
## 
##  13.2 
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     i, ], rho = 1)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=3 38     17.8     23.5      1.37      5.02
## TANK=4 40     24.9     19.2      1.68      5.02
## 
##  Chisq= 5  on 1 degrees of freedom, p= 0.0251 
## 
##  15.8 
## Call:
## survdiff(formula = Surv(TTD, FLAG) ~ TANK, data = TOXICITY[TOXICITY$PPT == 
##     i, ], rho = 1)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## TANK=1 39     22.6     27.0     0.707      3.15
## TANK=2 40     27.8     23.4     0.814      3.15
## 
##  Chisq= 3.1  on 1 degrees of freedom, p= 0.076

Basically we get the same results as in the book:

None of the log-rank tests is statistically significant (at the 0.05 level). The wilcoxon test for the 13.2 g/L treatment shows a p < 0.05. This is also in agreement with the plot.

The $\chi^2$ values differ slightly but share the same trend - I suspect this is due to different data used.

With this dataset we can do much more. We already saw that there might be a relationship between survival time and concentration, but more on this later (example 4.10).

Code and data are available on my github-repo under file name ‘p176’.

Quantitative Ecotoxicology, Page 162, Example 4.7, Duplicate Treatments

This is a short one (example 4.7 on page 1621 of Quantitative Ecotoxicology).

First we create the data as matrix:

1
2
3
TEST <- matrix(c(1, 19, 6, 14), byrow = TRUE, ncol = 2, dimnames = list(c("Tank_A",
    "Tank_B"), c("Number_Dead", "Number_Surviving")))
TEST
1
2
3
##        Number_Dead Number_Surviving
## Tank_A           1               19
## Tank_B           6               14

The we can easily run fisher’s Exact test for this 2x2 table using the function fisher.test():

1
fisher.test(TEST)
1
2
3
4
5
6
7
8
9
10
11
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  TEST 
## p-value = 0.09148
## alternative hypothesis: true odds ratio is not equal to 1 
## 95 percent confidence interval:
##  0.0025451 1.2461419 
## sample estimates:
## odds ratio 
##    0.12883
1
fisher.test(TEST, alternative = "greater")
1
2
3
4
5
6
7
8
9
10
11
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  TEST 
## p-value = 0.9958
## alternative hypothesis: true odds ratio is greater than 1 
## 95 percent confidence interval:
##  0.0051465       Inf 
## sample estimates:
## odds ratio 
##    0.12883
1
fisher.test(TEST, alternative = "less")
1
2
3
4
5
6
7
8
9
10
11
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  TEST 
## p-value = 0.04574
## alternative hypothesis: true odds ratio is less than 1 
## 95 percent confidence interval:
##  0.00000 0.96589 
## sample estimates:
## odds ratio 
##    0.12883

The results are identical to the one in the book.