Fork me on GitHub

ed=r

R in Ecotoxicology and Environmental Sciences

Analysing Mesocosm Experiments Using Generalized Linear Models

Introduction

The first post of this blog was about analysing mesocosm data using Principal Response Curves (PRC). At the of the second part of this post I mention alternatives to PRC. I picked up this topic for my Master’s Thesis, which resulted in a paper that was published two weeks ago.

Based on 11 mesocosm data sets (thanks to the co-authors who kindly shared their data) I compared 3 different methods to analyse such data. In this post I show how to analyse mesocosm data using Generalized Linear Models for multivariate data ($GLM_{mv}$). Compared to PRC these have a few advantages:

  • no need to choose a transformation of the community reponse, GLMs provide a natural way to model count data.
  • better representation of the responding taxa.
  • there is some indication that the models have higher power (Warton et al. (2012), Szöcs et al. (2015)*).

Nevertheless, there are also some drawbacks:

  • $GLM_{mv}$ are computationally more expensive
  • The graphical presentation of results (but see below).

In this post I will demonstrate how to analyse the chlorpyrifos data set also used in the first post.

Setup and data set

First we load some packages that I use in this post.

1
2
3
4
5
library(vegan)
library(mvabund)
library(reshape2)
library(plyr)
library(ggplot2)

The abundance data we will use come with the vegan package. For demonstration purposes and to save computation time I use only eight species (of the 178). Moreover, the data in this data set was log(10 * x + 1) transformed and I backtransform to the raw counts.

1
2
3
4
5
6
data(pyrifos)
take <- c('binitent', 'olchaeta', 'caenhora', 'cloedipt',
          'chaoobsc', 'gammpule', 'libellae', 'agdasphr')
abu <- pyrifos[ , names(pyrifos) %in% take]
abu <- round((exp(abu) - 1)/10)
head(abu)
1
2
3
4
5
6
7
##        olchaeta binitent gammpule caenhora cloedipt libellae chaoobsc agdasphr
## w.4.c1        6        1        0      133      446        0        1        0
## w.4.c2       15        0        0       62      420        1        3        2
## w.4.c3       24        6        0       50      151        0        5        0
## w.4.c4       59        0        0      119      331        0        1        0
## w.4.c5        0        0        0      133      442        0        4        4
## w.4.c6        5       12        3      128      472        0        1        0

The explaining variables (time, treatment and replicate number) are not shipped with vegan but can be easily generated:

1
2
3
4
env <- data.frame(time = rep(c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24), each = 12),
                  treatment = rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11),
                  replicate = gl(12, 1, length=132))
head(env)
1
2
3
4
5
6
7
##   time treatment replicate
## 1   -4       0.1         1
## 2   -4       0.0         2
## 3   -4       0.0         3
## 4   -4       0.9         4
## 5   -4       0.0         5
## 6   -4      44.0         6

As always before a data analysis we first take a look at the raw data to get an impression what’s going on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
plotraw <- data.frame(abu, env)
plotraw <- melt(plotraw, id.vars = c('treatment', 'time', 'replicate'))
ggplot(plotraw, aes(x = time, y = value + 1,
                   col = factor(treatment))) +
  geom_point() +
  geom_smooth(aes(group = treatment), se = FALSE) +
  facet_wrap(~variable) +
  scale_y_log10() +
  labs(y = 'Abundance + 1', x = 'Week') +
  geom_vline(aes(xintercept = 0))  +
  theme_bw() +
  theme(legend.justification = c(1, 0),
        legend.position = c(1, 0),
        legend.key = element_blank())

plot of chunk plotraw

We see that caenhora and cloedipt show a strong treatment related negative response with a recovery after some weeks. This is also the case for chaoobsc, but weaker. olchaeta, binitent and gammpule might profit from treatment. libellae and agdasphr are low in abundance and we cannot much about their response.

Fit and check models

To fit GLMs to these eight species we use the mvabund package (Wang et al. (2012)). We first need to create a mvabund object from our abundance data

1
abu <- mvabund(abu)

Then we can try to fit a Poisson GLM with treatment, time and their interaction as predictors:

1
2
mod_pois <- manyglm(abu ~ factor(treatment) * factor(time),
                    data = env, family = 'poisson')

But does this model fit to the data? We can use residual plots to check the model:

1
plot(mod_pois)

plot of chunk plot_resid

The residual plot shows a pronounced fan-shape, which indicates that this model does not fit very well to the data! Let’s try a negative binomial model (which allows accounting for overdispersion):

1
2
3
mod_nb <- manyglm(abu ~ factor(treatment) * factor(time),
                  data = env, family = 'negative.binomial')
plot(mod_nb)

plot of chunk plot_resid_nb

This fits much better, we will stick with this model…

Test hypotheses

To test hypotheses we must take the repeated measure design into account. As in PRC, we do this by using restricted permutations (maybe there will be multivariate mixed effect models available in the future). One permutation design would be to permute whole time series of replicated ditches (plots), keeping the order of samplings (Within(type = 'none')). This is easily done via the permute package.

1
2
3
4
control <- how(within = Within(type = 'none'),
               plots = Plots(strata = env$replicate, type = 'free'),
               nperm=50)
permutations <- shuffleSet(nrow(abu), control = control)

The object permuations is a permutation matrix which we will use later in hypothesis testing.

Treatment x Time Interaction

One hypothesis we could test is ‘The treatment effect is constant over time’, which is the interaction between treatment and time. To test if we have a statistically significant interaction, we first fit a model without the interaction term

1
2
mod_nb_nointeraction <- manyglm(abu ~ factor(treatment) + factor(time),
                                data = env, family = 'negative.binomial')

Then we compare the two model using a Likelihood-Ratio Test. This is done with the anova() method

1
2
3
mod_nb_aov <- anova(mod_nb, mod_nb_nointeraction,
                    bootID = permutations,
                    p.uni = 'adjusted', test = 'LR')
1
2
## Using bootID matrix from input. 
## Time elapsed: 0 hr 0 min 44 sec

The output is twofold - first we get a multivariate test:

1
2
3
4
5
##                      Res.Df Df.diff Dev Pr(>Dev)  
## mod_nb_nointeraction    117                       
## mod_nb                   77      40 634     0.02 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which indicates that there is an interaction between treatment and time.

And second we retrieve a univariate test for every species. Here are just the p-values displayed:

1
printCoefmat(mod_nb_aov$uni.p, na.print = '')
1
2
3
4
5
6
##                      olchaeta binitent gammpule caenhora cloedipt libellae chaoobsc
## mod_nb_nointeraction                                                               
## mod_nb                 0.4902   1.0000   0.6863   0.0196   0.0196   1.0000   0.0392
##                      agdasphr
## mod_nb_nointeraction         
## mod_nb                   0.86

This indicates that caenhora, cloedipt and chaoobsc show treatment effect varying with time.

General treatment effect

We could also compare the model with interaction to a model containing only time as explaining variable. This gives us a test for the total treatment effect (treatment + treatment x time):

1
2
3
4
mod_nb_null <- manyglm(abu ~ factor(time), data = env,
                       family = 'negative.binomial')
mod_treat_aov <- anova(mod_nb, mod_nb_null , bootID = permutations,
      p.uni='adjusted', test = 'LR')
1
2
## Using bootID matrix from input. 
## Time elapsed: 0 hr 0 min 35 sec

Period of treatment effect

Now that we know that the treatment effect is varying with time, we could further explore this interaction by testing the treatment effect at every sampling date. This is similar to PRC.

1
2
3
4
5
6
7
8
9
10
11
mod_pt <- NULL
for(i in levels(factor(env$time))) {
  take_abu <- abu[env$time == i, ]
  take_env <- env[env$time == i, ]
  # model
  mod_pt[[i]]$mod <- manyglm(take_abu ~ factor(treatment), data = take_env)
  mod_pt[[i]]$aov <- anova(mod_pt[[i]]$mod, nBoot = 100,
                           p.uni = 'adjusted', test = 'LR', show.time="none")
  mod_pt[[i]]$sum <- summary(mod_pt[[i]]$mod, nBoot = 100,
                             p.uni = 'adjusted', test = 'LR')
}

For every week I fit a model with treatment as explaining variable and the run a Likelihood-Ratio test. Moreover, I also return the summary of the models (will be used later for plotting). Note, that I used only 100 permutations here to safe computation time.

From the resulting list we can extract informations that we are interested in. E.g. we could extract the p-values at every sampling week:

1
2
3
4
5
6
get_pvals <- function(x){
  comm <- c(community = x$aov$table[2, 4])
  spec <- x$aov$uni.p[2, ]
  c(comm, spec)
}
ldply(mod_pt, get_pvals)
1
2
3
4
5
6
7
8
9
10
11
12
##    .id community olchaeta binitent gammpule caenhora cloedipt libellae chaoobsc agdasphr
## 1   -4      0.37     1.00     1.00     0.65     1.00     1.00     1.00     0.33     0.65
## 2   -1      0.62     0.86     0.95     0.95     0.74     0.95     0.95     0.95     0.95
## 3  0.1      0.10     0.95     0.95     0.95     0.12     0.10     1.00     0.08     0.87
## 4    1      0.02     0.89     0.94     0.94     0.02     0.02     1.00     0.08     0.75
## 5    2      0.02     0.73     0.91     0.91     0.01     0.01     0.90     0.09     0.91
## 6    4      0.02     0.21     0.81     0.81     0.02     0.02     1.00     0.05     0.77
## 7    8      0.03     0.31     0.76     0.16     0.03     0.25     1.00     0.18     0.64
## 8   12      0.08     0.89     0.89     0.17     0.04     0.86     1.00     0.73     1.00
## 9   15      0.06     0.82     0.97     0.13     0.13     0.82     0.82     0.56     0.97
## 10  19      0.07     0.67     0.94     0.41     0.21     0.67     0.62     0.67     0.45
## 11  24      0.18     0.84     0.99     0.60     0.22     0.98     0.98     0.22     0.98

This indicates a treatment effect on community structure from the day of application till week 12.

Note, this test is different to the method described in the PRC post. There I used the log-transformed treatment as continuous predictor (regression design). In contrast, I use treatment as categorical predictor (anova design) in this example, which explains the differences.

Plotting treatment effect

Let’s further investigate what’s going on graphically. How dose the treatment effect on the community over time look like? Which species are when responsible for this effect?

To plot the effect of treatment I extract the species deviances (a sum-of-squares equivalent). Species with a high deviance show a strong treatment effect. The community effect is then simply the sum of species effects/deviances.

And here is plot of treatment effects over time:

1
2
3
4
5
6
7
devs <- ldply(mod_pt, function(x) x$aov$uni.test[2, ])
plotdf <- melt(devs, id.vars = '.id')
ggplot(plotdf, aes(x = as.numeric(as.character(.id)), y = value, fill = variable)) +
  geom_area(col = 'grey15') +
  geom_vline(aes(xintercept = 0)) +
  theme_bw() +
  labs(x = 'time', y = 'Deviance (=Effect)')

plot of chunk plot_deviance

We see that the treatment effect increases after chlorpyrifos application (Lo and behold!) and then lowers off (but not to the initial level). Moreover, wee see that the contributions to the effect also change with time: Shortly after application till week 8 caenhora, cloedipt and chaoobsc are responsible for the treatment effect. However, after week 8 the contribution of gammapule increases. Other taxa play only a minor role.

Of course, we could extract this information also numerically, similar to the explained variance in PRC.

A PRC like plot

We could also draw a PRC like plot (using the summary object from above). The procedure is always similar: extract the necessary information (coefficients in this case), prepare it, plot it.

1
2
3
4
5
6
7
8
9
10
11
12
13
get_coef <- function(x){
  x$sum$coefficients[-1, 1]
}
coefs <- ldply(mod_pt, get_coef)
coefs <- melt(coefs, id.vars = '.id')
ggplot(coefs, aes(x = as.numeric(as.character(.id)), y = value, col = variable)) +
  geom_line() +
  geom_vline(aes(xintercept = 0)) +
  theme_bw() +
  labs(x = 'Time', y = 'Deviance (=Effect)') +
  scale_color_discrete('Treatment', labels = c(0.1, 0.9, 6, 44)) +
  theme(legend.position = 'bottom',
        legend.key = element_blank())

plot of chunk plot_coefs

This plot basically show the same patterns as the PRC.

Getting a species plot is a bit harder. The species information displayed in standard PRC plot condenses a lot of information. Basically, the plot shows the general species responses. Therefore, I the general treatment test (mod_treat_aov from above) for this plot.

I first extract and calculate contribution of each species to the community response from this test:

1
2
contr <- data.frame(resp = mod_treat_aov$uni.test[2, ] / mod_treat_aov$table[2, 3])
contr$spec <- row.names(contr)

Then I just plot the species names according to their contribution.

1
2
3
4
5
ggplot(contr, aes(x = factor(1), y = resp * 100)) +
  geom_text(aes(label = spec)) +
  labs(y = 'Contribution [%]', x = '') +
  theme_minimal() +
  scale_x_discrete(breaks = NULL)

plot of chunk unnamed-chunk-16

The only difference to the PRC is that the direction of response is not displayed.

However, the plot showing the species contribution over time (see “Plotting treatment effect”) is much more informative

Outlook

I hope this tutorial showed that applying GLMs for multivariate data to mesocosm data is not harder than performing a PRC in R. However, currently there is no alternative to R and no graphical user interface (gui). PRC can be also run in the commercial CANOCO software package with a nice gui. This may hinder the adoption of these models to ecotoxicology and easy to use software is needed.

References

  • Warton, D.I., Wright, S.T., Wang, Y., 2012. Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution 3, 89–101.
  • Szöcs, E., Van den Brink, P.J., Lagadic, L., Caquet, T., Roucaute, M., Auber, A., Bayona, Y., Liess, M., Ebke, P., Ippolito, A., ter Braak, C.J.F., Brock, T.C.M., Schäfer, R.B., 2015. Analysing chemical-induced changes in macroinvertebrate communities in aquatic mesocosm experiments: a comparison of methods. Ecotoxicology. doi:10.1007/s10646-015-1421-0
  • Wang, Y., Naumann, U., Wright, S.T., Warton, D.I., 2012. mvabund- an R package for model-based analysis of multivariate abundance data. Methods in Ecology and Evolution 3, 471–474.

(*) Note, the main criticism from the reviewers was that I cannot deduce power differences from the analysis of these 11 data sets. E.g. Figure 1 in Szöcs (2015) shows lower p-values for GLM, which indicates a higher power. This annoyed me a little, so I did a (univariate) simulation study. It is currently under review and I’ll post soon about it.

Species Sensitivity Distributions (SSD) With R

Introduction

Species Sensitivity Distributions (SSD) are a central tool for ecological risk assessment (ERA). Species show different sensitivities to chemicals and the variation between species can be described by a statistical distribution. A concentration at which x% of species are affected can be derived from SSDs (= $HC_x$ ). Usually a HC5 is derived (with 95% confidence interval) and used in ERA.

Data

In this example I will generate a SSD for the insecticide Chlorpyrifos (CAS 2921-88-2). SSDs are generated using data of toxicity experiments (like EC_50 / LC50 values). Such kind of data is available e.g. from US EPA ECOTOX database, ECHA or ETOX.

I prepared some data from the US EPA ECOTOX database for this post. I will skip the data cleaning and data quality checks here - but note, this data has not been checked thoroughly and prepared only for demonstration purposes. However, data cleaning and checking is a very important step for every data analysis.

You can read the data into R with these three lines

1
2
3
4
5
# download data from github
require(RCurl)
url <- getURL("https://raw.githubusercontent.com/EDiLD/r-ed/master/post_ssd/ssd_data.csv",
    ssl.verifypeer = FALSE)
df <- read.table(text = url, header = TRUE, sep = ',', stringsAsFactors = FALSE)

A first look at the data

SSDs are typically displayed as a plot showing the fraction of affected species on the y axis and the concentration on the x-axis.

To calculate the fraction affected we order the species by their toxicity values and then calculate the fraction:

1
2
df <- df[order(df$val), ]
df$frac <- ppoints(df$val, 0.5)

Then we can take a first look at the data:

1
2
3
4
5
6
7
8
require(ggplot2)
ggplot(data = df) +
  geom_point(aes(x = val, y = frac), size = 5) +
  geom_text(aes(x = val, y = frac, label = species), hjust = 1.1, size = 4) +
  theme_bw() +
  scale_x_log10(limits=c(0.0075, max(df$val))) +
  labs(x = expression(paste('Concentration of Chlorpyrifos [ ', mu, 'g ', L^-1, ' ]')),
       y = 'Fraction of species affected')

plot of chunk unnamed-chunk-4

Fitting a distribution to the data

To fit a distribution to the data we can use the fitdistr() function from the MASS package or the more flexible fitdist() from the fitdistrplus package (there are also others). I will use the MASS package here to fit a lognormal distribution to this data.

1
2
3
require(MASS)
fit <- fitdistr(df$val, 'lognormal')
fit
1
2
3
##    meanlog     sdlog 
##   0.20235    1.55028 
##  (0.26204)  (0.18529)

The mean (meanlog) and standard deviation (sdlog) of the lognormal distribution were estimated from the data. We could fit and compare (e.g. by AIC) different distributions, but I stick with the lognormal here.

Derive HC5

From the estimated parameters of the fitted distribution we can easily extract the HC5.

1
(hc5 <- qlnorm(0.05, meanlog = fit$estimate[1], sdlog = fit$estimate[2]))
1
## [1] 0.095596

parametric Bootstrap

To be more conservative the lower limit of the confidence interval (CI) around the HC5 is sometimes used. The lower limit of the CI can be estimated from the data using parametric bootstrap.

The idea is:

  • generate random values from the fitted distribution
  • fit to these random values the distribution
  • estimate the HC5 from this new distribution
  • repeat many times to assess the variability of HC5 values

Alternatively, also non-parametric bootstrap could be used (resample from the data, not from the fitted distribution).

In R we write a function (myboot()) that does steps 1-3 for us:

1
2
3
4
5
6
7
8
9
myboot <- function(fit, p){
  # resample from fitted distribution
  xr <- rlnorm(fit$n, meanlog = fit$estimate[1], sdlog = fit$estimate[2])
  # fit distribition to new data
  fitr <- fitdistr(xr, 'lognormal')
  # return HCp
  hc5r <- qlnorm(p, meanlog = fitr$estimate[1], sdlog = fitr$estimate[2])
  return(hc5r)
}

We repeat this function 1000 times and get the quantiles of the bootstrapped HC5 values:

1
2
3
set.seed(1234)
hc5_boot <- replicate(1000, myboot(fit, p = 0.05))
quantile(hc5_boot, probs = c(0.025, 0.5, 0.975))
1
2
##     2.5%      50%    97.5% 
## 0.046027 0.102427 0.214411

So for this data and the lognormal distribution the HC5 would be 0.096 with a CI of [0.046; 0.214].

A fancy plot

Finally, I generate a fancy SSD plot with predictions (red), with bootstrapped values (blue), CI (dashed) and the raw data (dots):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
myboot2 <- function(fit, newxs){
  # resample
  xr <- rlnorm(fit$n, meanlog = fit$estimate[1], sdlog = fit$estimate[2])
  # fit to resample
  fitr <- fitdistr(xr, 'lognormal')
  # predict for new data
  pyr <- plnorm(newxs, meanlog = fitr$estimate[1], sdlog = fitr$estimate[2])
  return(pyr)
}
# newdata to predict
newxs <- 10^(seq(log10(0.01), log10(max(df$val)), length.out = 1000))
boots <- replicate(1000, myboot2(fit, newxs))

# fancy plot
require(reshape2)
# extract boostrap values
bootdat <- data.frame(boots)
bootdat$newxs <- newxs
bootdat <- melt(bootdat, id = 'newxs')
# extract CI
cis <- apply(boots, 1, quantile, c(0.025, 0.975))
rownames(cis) <- c('lwr', 'upr')
# add fitted values
pdat <- data.frame(newxs, py = plnorm(newxs, meanlog = fit$estimate[1], sdlog = fit$estimate[2]))
# add CI
pdat <- cbind(pdat, t(cis))
# x koordinates for species names (better use lower Ci instead of fit...)
df$fit <- 10^(log10(qlnorm(df$frac, meanlog = fit$estimate[1], sdlog = fit$estimate[2])) -0.4)
# plot
ggplot()+
  geom_line(data = bootdat, aes(x = newxs, y = value, group = variable), col = 'steelblue', alpha = 0.05) +
  geom_point(data = df, aes(x = val, y = frac)) +
  geom_line(data = pdat, aes(x = newxs, y = py), col = 'red') +
  geom_line(data = pdat, aes(x = newxs, y = lwr), linetype = 'dashed') +
  geom_line(data = pdat, aes(x = newxs, y = upr), linetype = 'dashed') +
  geom_text(data = df, aes(x = fit, y = frac, label = species), hjust = 1, size = 4) +
  theme_bw() +
  scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000), limits = c(0.003, max(df$val))) +
  labs(x = expression(paste('Concentration of Chlorpyrifos [ ', mu, 'g ', L^-1, ' ]')),
       y = 'Fraction of species affected')

plot of chunk unnamed-chunk-9

Note, that the bootstrap function is slightly different: this is because for this plot I wanted to have for each bootstrap a different (blue) line.

Web Scraping Pesticides Data With R

This week I had to find the CAS-numbers for a bunch of pesticides. Moreover, I also needed information about the major groups of these pesticides (e.g. herbicides, fungicides, …) and some of them were in German language.

ETOX is quite useful to find the CAS-numbers, even for German names, as they have also synonyms in their database.

Another, useful source is the Compendium of Pesticide Common Names by Allan Wood.

Since I had > 500 compounds in my list, this was feasible to be done manually. So I wrote two small functions (etox_to_cas() and allanwood()) to search and retrieve information from these two websites.

Both are available from my esmisc package on github.com. These are small functions using the RCurl and XML packages for scraping. They have not been tested very much and may not be very robust.

Query CAS from ETOX

1
2
require(esmisc)
etox_to_cas('2,4-D')
1
## [1] "94-75-7"

If you have a bunch of compounds you can use ‘sapply()’ to feed etox_to_cas:

1
sapply(c('2,4-D', 'DDT', 'Diclopfop'), etox_to_cas)
1
2
##     2,4-D       DDT Diclopfop 
## "94-75-7" "50-29-3"        NA

Query CAS and pesticide group from Allan Wood

1
allanwood('Fluazinam')
1
2
##                                CAS                           activity 
##                       "79622-59-6" "fungicides (pyridine fungicides)"
1
sapply(c('Fluazinam', 'Diclofop', 'DDT'), allanwood)
1
2
3
4
5
6
7
8
9
##          Fluazinam                         
## CAS      "79622-59-6"                      
## activity "fungicides (pyridine fungicides)"
##          Diclofop                                         
## CAS      "40843-25-2"                                     
## activity "herbicides (aryloxyphenoxypropionic herbicides)"
##          DDT                                                                                                            
## CAS      "50-29-3"                                                                                                      
## activity "acaricides (bridged diphenyl acaricides; organochlorine acaricides)insecticides (organochlorine insecticides)"

Quantitative Ecotoxicology, Page 223, Example 5.1

This is example 5.1 on page 223 of Quantitative Ecotoxicology - reproduced with R.

Load and clean the data

Read the data into R:

1
2
3
4
5
6
7
8
df <- read.table(header = TRUE, text = 'conc A B C D
0 1 1 0.9 0.9
32 0.8 0.8 1 0.8
64 0.9 1 1 1 
128 0.9 0.9 0.8 1
256 0.7 0.9 1 0.5
512 0.4 0.3 0.4 0.2')
df
1
2
3
4
5
6
7
##   conc   A   B   C   D
## 1    0 1.0 1.0 0.9 0.9
## 2   32 0.8 0.8 1.0 0.8
## 3   64 0.9 1.0 1.0 1.0
## 4  128 0.9 0.9 0.8 1.0
## 5  256 0.7 0.9 1.0 0.5
## 6  512 0.4 0.3 0.4 0.2

Next we do some house-keeping: convert the data to long format and the concentration to factor.

1
2
3
4
5
6
require(reshape2)
# to long
dfm <- melt(df, id.vars = 'conc', value.name = 'y', variable.name = 'tank')
# conc as factor
dfm$conc <- factor(dfm$conc)
head(dfm)
1
2
3
4
5
6
7
##   conc tank   y
## 1    0    A 1.0
## 2   32    A 0.8
## 3   64    A 0.9
## 4  128    A 0.9
## 5  256    A 0.7
## 6  512    A 0.4

Let’s have first look at the data:

1
2
boxplot(y ~ conc, data = dfm,
        xlab = 'conc', ylab = 'Proportion surv.')

plot of chunk unnamed-chunk-4

Transform response

Next we apply the arcsine transformation:

1
2
3
4
dfm$y_asin <- ifelse(dfm$y == 1,
                     asin(sqrt(dfm$y)) - asin(sqrt(1/40)),
                     asin(sqrt(dfm$y))
                     )

This adds the transformed values as column y_asin to our data.frame. Survivals of 1 (100%) are transformed

, where n = number of animals per replicate.

All other values are transformed using

If we would have had observations with 0% survival, these zero values should be transformed using

by adding an extra ifelse().

Let’s look at the transformed values:

1
head(dfm)
1
2
3
4
5
6
7
##   conc tank   y  y_asin
## 1    0    A 1.0 1.41202
## 2   32    A 0.8 1.10715
## 3   64    A 0.9 1.24905
## 4  128    A 0.9 1.24905
## 5  256    A 0.7 0.99116
## 6  512    A 0.4 0.68472
1
2
boxplot(y_asin ~ conc, data = dfm,
        xlab = 'conc', ylab = 'Transformed prop. surv.')

plot of chunk unnamed-chunk-6

Doesn’t look that different…

ANOVA

To fit a ANOVA to his data we use the aov() function:

1
mod <- aov(y_asin ~ conc, data = dfm)

And summary() gives the anova table:

1
summary(mod)
1
2
3
4
5
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## conc         5  1.575  0.3151    13.3 0.000016 ***
## Residuals   18  0.426  0.0237                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The within-treatment variance is termed Residuals and the between-treatment variance is named according to the predictor conc. The total variance is simply the sum of those and not displayed.

R already performed an F-test for us, indicated by the F value (=ratio of the Mean Sq) and Pr (>F) columns.

Multiple Comparisons

Now we know that there is a statistically significant treatment effect, we might be interested which treatments differ from the control group.

The in the book mentioned Tukey contrasts (comparing each level with each other) can be easily done with the multcomp package:

1
2
require(multcomp)
summary(glht(mod, linfct = mcp(conc = 'Tukey')))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = y_asin ~ conc, data = dfm)
## 
## Linear Hypotheses:
##                Estimate Std. Error t value Pr(>|t|)    
## 32 - 0 == 0     -0.1472     0.1088   -1.35   0.7530    
## 64 - 0 == 0      0.0407     0.1088    0.37   0.9989    
## 128 - 0 == 0    -0.0762     0.1088   -0.70   0.9795    
## 256 - 0 == 0    -0.2211     0.1088   -2.03   0.3634    
## 512 - 0 == 0    -0.7273     0.1088   -6.69   <0.001 ***
## 64 - 32 == 0     0.1879     0.1088    1.73   0.5327    
## 128 - 32 == 0    0.0709     0.1088    0.65   0.9850    
## 256 - 32 == 0   -0.0740     0.1088   -0.68   0.9820    
## 512 - 32 == 0   -0.5802     0.1088   -5.33   <0.001 ***
## 128 - 64 == 0   -0.1170     0.1088   -1.08   0.8849    
## 256 - 64 == 0   -0.2619     0.1088   -2.41   0.2055    
## 512 - 64 == 0   -0.7681     0.1088   -7.06   <0.001 ***
## 256 - 128 == 0  -0.1449     0.1088   -1.33   0.7643    
## 512 - 128 == 0  -0.6511     0.1088   -5.98   <0.001 ***
## 512 - 256 == 0  -0.5062     0.1088   -4.65   0.0023 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

However, this leads to 15 comparisons (and tests) and we may not be interested in all. Note that we are wrong in 1 out of 20 tests ($\alpha = 0.05$) (if we do not apply correction for multiple testing).

An alternative would be just to compare the control group to the treatments. This is called Dunnett contrasts and leads to only 5 comparison.

The syntax is the same, just change Tukey to Dunnett:

1
summary(glht(mod, linfct = mcp(conc = 'Dunnett')))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Dunnett Contrasts
## 
## 
## Fit: aov(formula = y_asin ~ conc, data = dfm)
## 
## Linear Hypotheses:
##              Estimate Std. Error t value Pr(>|t|)    
## 32 - 0 == 0   -0.1472     0.1088   -1.35     0.54    
## 64 - 0 == 0    0.0407     0.1088    0.37     0.99    
## 128 - 0 == 0  -0.0762     0.1088   -0.70     0.93    
## 256 - 0 == 0  -0.2211     0.1088   -2.03     0.20    
## 512 - 0 == 0  -0.7273     0.1088   -6.69   <0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

The column Estimate gives use the difference in means between the control and the respective treatments and Std. Error the standard error from these estimates. Both are combined to a t value (=Estimate / Std. Error), from which we can get a p-value (P(>t)).

Note, that the p-values are already corrected for multiple testing, as indicated at the bottom of the output. If you want to change the correction method you can use:

1
summary(glht(mod, linfct = mcp(conc = 'Dunnett')), test = adjusted('bonferroni'))

This applies Bonferroni-correction, see ?p.adjust and ?adjusted for other methods.

Outlook

Warton & Hui (2011) demonstrated that the arcsine transform should not be used in either circumstance. Similarly as O’Hara & Kotze (2010) showed that count data should not be log-transformed.

I a future post I will show how to analyse this data without transformation using Generalized Linear Models (GLM) and perhabs some simulations showing that using GLM can lead to an increased statistical power for ecotoxicological data sets.

Note, that I couldn’t find any reference to Generalized Linear Models in Newman (2012) and EPA (2002), although they have been around for 30 years now (Nelder & Wedderburn, 1972).

References

  • Warton, D. I., & Hui, F. K. (2011). The arcsine is asinine: the analysis of proportions in ecology. Ecology, 92(1), 3-10.

  • O’Hara, R. B., & Kotze, D. J. (2010). Do not log‐transform count data. Methods in Ecology and Evolution, 1(2), 118-122.

  • Newman, M. C. (2012). Quantitative ecotoxicology. Taylor & Francis, Boca Raton, FL.

  • EPA (2002). Methods for Measuring the Acute Toxicity of Effluents and Receiving Waters to Freshwater and Marine Organisms.

  • Nelder J.A., & Wedderburn R.W.M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society Series A (General) 135:370–384.

Web Scraping Chemical Data With R

I recently came across the problem to convert CAS numbers into SMILES and retrieve additional information about the compound.

The are several sources around the web that provide chemical informations, eg. PubChem, ChemSpider and the Chemical Identifier Resolver.

I wrote up some functions to interact from R with these servers. You can find them in my esmisc package:

1
2
3
4
install.packages('devtools')
require(devtools)
install_github('esmisc', 'EDiLD')
require(esmisc)

These functions are very crude and need some further development (if you want to improve, fork the package!), however, here’s a short summary:

Covert CAS to SMILES

Suppose we have some CAS numbers and want to convert them to SMILES:

1
casnr <- c("107-06-2", "107-13-1", "319-86-8")
Via Cactus
1
cactus(casnr, output = 'smiles')
1
2
## [1] "C(Cl)CCl"                       "C(C#N)=C"                      
## [3] "C1(Cl)C(Cl)C(Cl)C(Cl)C(Cl)C1Cl"
Via ChemSpider

Note, that ChemSpider requires a security token. To obtain a token please register at ChemSpider.

1
2
csid <- get_csid(casnr, token = token)
csid_to_smiles(csid, token)
1
2
3
## [1] "C(CCl)Cl"                                                   
## [2] "C=CC#N"                                                     
## [3] "[C@@H]1([C@@H]([C@@H]([C@@H]([C@H]([C@@H]1Cl)Cl)Cl)Cl)Cl)Cl"
Via PubChem
1
2
cid <- get_cid(casnr)
cid_to_smiles(cid)
1
2
## [1] "C(CCl)Cl"                       "C=CC#N"                        
## [3] "C1(C(C(C(C(C1Cl)Cl)Cl)Cl)Cl)Cl"

Retrieve other data from CAS

All these web resources provide additional data. Here is an example retrieving the molecular weights:

Via cactus
1
cactus(casnr, output = 'mw')
1
## [1] "98.9596"  "53.0634"  "290.8314"
Via ChemSpider
1
csid_to_ext(csid, token)$MolecularWeight
1
## [1] "98.95916"  "53.06262"  "290.82984"
Via Pubchem
1
cid_to_ext(cid)$mw
1
## [1] "98.959160"  "53.062620"  "290.829840"

ChemSpider and PubChem return the same values, however the results from cactus are slightly different.

Retrieve partitioning coefficients

Partition coefficients are another useful property. LOGKOW is a databank that contains experimental data, retrieved from the literature, on over 20,000 organic compounds.

get_kow() extracts the ‘Recommended values’ for a given CAS:

1
get_kow(casnr)
1
## [1] 1.48 0.25 4.14

This function is very crude. For example, it returns only the first hit if multiple hits are found in the database - a better way would be to ask for user input, as we did the taxize package.

Outlook

Currently I have no time to extensively develop these functions. I would be happy if someone picks up this work - it’s fairly easy: just fork the repo and start.

In future this could be turned into a ROpenSci package as it is within their scope.