Species Sensitivity Distributions (SSD) are a central tool for ecological risk
assessment (ERA).
Species show different sensitivities to chemicals and the variation between
species can be described by a statistical distribution.
A concentration at which x% of species are affected can be derived from SSDs
(= $HC_x$ ).
Usually a HC5 is derived (with 95% confidence interval) and used in ERA.
Data
In this example I will generate a SSD for the insecticide Chlorpyrifos
(CAS 2921-88-2).
SSDs are generated using data of toxicity experiments (like EC_50 / LC50 values).
Such kind of data is available e.g. from
US EPA ECOTOX database,
ECHA or
ETOX.
I prepared some data from the US EPA ECOTOX database for this post.
I will skip the data cleaning and data quality checks here - but note, this data has not
been checked thoroughly and prepared only for demonstration purposes.
However, data cleaning and checking is a very important step for every data analysis.
You can read the data into R with these three lines
12345
# download data from githubrequire(RCurl)url <- getURL("https://raw.githubusercontent.com/EDiLD/r-ed/master/post_ssd/ssd_data.csv", ssl.verifypeer =FALSE)df <- read.table(text = url, header =TRUE, sep =',', stringsAsFactors =FALSE)
A first look at the data
SSDs are typically displayed as a plot showing the fraction of affected species
on the y axis and the concentration on the x-axis.
To calculate the fraction affected we order the species by their toxicity values and then calculate the fraction:
require(ggplot2)ggplot(data = df)+ geom_point(aes(x = val, y = frac), size =5)+ geom_text(aes(x = val, y = frac, label = species), hjust =1.1, size =4)+ theme_bw()+ scale_x_log10(limits=c(0.0075, max(df$val)))+ labs(x = expression(paste('Concentration of Chlorpyrifos [ ', mu,'g ', L^-1,' ]')), y ='Fraction of species affected')
Fitting a distribution to the data
To fit a distribution to the data we can use the fitdistr() function from the MASS package or
the more flexible fitdist() from the fitdistrplus package (there are also others).
I will use the MASS package here to fit a lognormal distribution to this data.
The mean (meanlog) and standard deviation (sdlog) of the lognormal distribution were estimated from the data.
We could fit and compare (e.g. by AIC) different distributions,
but I stick with the lognormal here.
Derive HC5
From the estimated parameters of the fitted distribution we can easily extract the HC5.
To be more conservative the lower limit of the confidence interval (CI) around the HC5 is sometimes used.
The lower limit of the CI can be estimated from the data using parametric bootstrap.
The idea is:
generate random values from the fitted distribution
fit to these random values the distribution
estimate the HC5 from this new distribution
repeat many times to assess the variability of HC5 values
Alternatively, also non-parametric bootstrap could be used (resample from the data, not from the fitted distribution).
In R we write a function (myboot()) that does steps 1-3 for us:
123456789
myboot <-function(fit, p){# resample from fitted distribution xr <- rlnorm(fit$n, meanlog = fit$estimate[1], sdlog = fit$estimate[2])# fit distribition to new data fitr <- fitdistr(xr,'lognormal')# return HCp hc5r <- qlnorm(p, meanlog = fitr$estimate[1], sdlog = fitr$estimate[2])return(hc5r)}
We repeat this function 1000 times and get the quantiles of the bootstrapped HC5 values:
123
set.seed(1234)hc5_boot <- replicate(1000, myboot(fit, p =0.05))quantile(hc5_boot, probs = c(0.025,0.5,0.975))
12
## 2.5% 50% 97.5% ## 0.046027 0.102427 0.214411
So for this data and the lognormal distribution the HC5 would be 0.096 with a CI of [0.046; 0.214].
A fancy plot
Finally, I generate a fancy SSD plot with predictions (red), with bootstrapped values (blue), CI (dashed) and the raw data (dots):
This week I had to find the CAS-numbers for a bunch of pesticides.
Moreover, I also needed information about the major groups of these pesticides (e.g. herbicides, fungicides, …) and some of them were in German language.
ETOX is quite useful to find the CAS-numbers, even for German names, as they have also synonyms in their database.
Since I had > 500 compounds in my list, this was feasible to be done manually.
So I wrote two small functions (etox_to_cas() and allanwood()) to search and retrieve information from these two websites.
Both are available from my esmisc package on github.com.
These are small functions using the RCurl and XML packages for scraping.
They have not been tested very much and may not be very robust.
Query CAS from ETOX
12
require(esmisc)etox_to_cas('2,4-D')
1
## [1] "94-75-7"
If you have a bunch of compounds you can use ‘sapply()’ to feed etox_to_cas:
1
sapply(c('2,4-D','DDT','Diclopfop'), etox_to_cas)
12
## 2,4-D DDT Diclopfop ## "94-75-7" "50-29-3" NA
Query CAS and pesticide group from Allan Wood
1
allanwood('Fluazinam')
12
## CAS activity ## "79622-59-6" "fungicides (pyridine fungicides)"
To fit a ANOVA to his data we use the aov() function:
1
mod <- aov(y_asin ~ conc, data = dfm)
And summary() gives the anova table:
1
summary(mod)
12345
## Df Sum Sq Mean Sq F value Pr(>F) ## conc 5 1.575 0.3151 13.3 0.000016 ***## Residuals 18 0.426 0.0237 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The within-treatment variance is termed Residuals and the between-treatment variance is named according to the predictor conc. The total variance is simply the sum of those and not displayed.
R already performed an F-test for us, indicated by the F value (=ratio of the Mean Sq) and Pr (>F) columns.
Multiple Comparisons
Now we know that there is a statistically significant treatment effect, we might be interested which treatments differ from the control group.
The in the book mentioned Tukey contrasts (comparing each level with each other) can be easily done with the multcomp package:
However, this leads to 15 comparisons (and tests) and we may not be interested in all. Note that we are wrong in 1 out of 20 tests ($\alpha = 0.05$) (if we do not apply correction for multiple testing).
An alternative would be just to compare the control group to the treatments. This is called Dunnett contrasts and leads to only 5 comparison.
The syntax is the same, just change Tukey to Dunnett:
The column Estimate gives use the difference in means between the control and the respective treatments and Std. Error the standard error from these estimates. Both are combined to a t value (=Estimate / Std. Error), from which we can get a p-value (P(>t)).
Note, that the p-values are already corrected for multiple testing, as indicated at the bottom of the output.
If you want to change the correction method you can use:
1
summary(glht(mod, linfct = mcp(conc ='Dunnett')), test = adjusted('bonferroni'))
This applies Bonferroni-correction, see ?p.adjust and ?adjusted for other methods.
Outlook
Warton & Hui (2011) demonstrated that the arcsine transform should not be used in either circumstance. Similarly as O’Hara & Kotze (2010) showed that count data should not be log-transformed.
I a future post I will show how to analyse this data without transformation using Generalized Linear Models (GLM) and perhabs some simulations showing that using GLM can lead to an increased statistical power for ecotoxicological data sets.
Note, that I couldn’t find any reference to Generalized Linear Models in Newman (2012) and EPA (2002), although they have been around for 30 years now (Nelder & Wedderburn, 1972).
References
Warton, D. I., & Hui, F. K. (2011). The arcsine is asinine: the analysis of proportions in ecology. Ecology, 92(1), 3-10.
O’Hara, R. B., & Kotze, D. J. (2010). Do not log‐transform count data. Methods in Ecology and Evolution, 1(2), 118-122.
Newman, M. C. (2012). Quantitative ecotoxicology. Taylor & Francis, Boca Raton, FL.
EPA (2002). Methods for Measuring the Acute Toxicity of Effluents and Receiving Waters to Freshwater and Marine Organisms.
Nelder J.A., & Wedderburn R.W.M. (1972). Generalized Linear Models. Journal of the Royal Statistical Society Series A (General) 135:370–384.
All these web resources provide additional data. Here is an example retrieving the molecular weights:
Via cactus
1
cactus(casnr, output ='mw')
1
## [1] "98.9596" "53.0634" "290.8314"
Via ChemSpider
1
csid_to_ext(csid, token)$MolecularWeight
1
## [1] "98.95916" "53.06262" "290.82984"
Via Pubchem
1
cid_to_ext(cid)$mw
1
## [1] "98.959160" "53.062620" "290.829840"
ChemSpider and PubChem return the same values, however the results from cactus are slightly different.
Retrieve partitioning coefficients
Partition coefficients are another useful property. LOGKOW is a databank that contains experimental data, retrieved from the literature, on over 20,000 organic compounds.
get_kow() extracts the ‘Recommended values’ for a given CAS:
1
get_kow(casnr)
1
## [1] 1.48 0.25 4.14
This function is very crude. For example, it returns only the first hit if multiple hits are found in the database - a better way would be to ask for user input, as we did the taxize package.
Outlook
Currently I have no time to extensively develop these functions.
I would be happy if someone picks up this work - it’s fairly easy: just fork the repo and start.
In future this could be turned into a ROpenSci package as it is within their scope.
It has been quiet the last months here - main reason is that I’m working on my master’s thesis.
I have already prepared some more examples from ‘Quantitative Ecotoxicolgy’, but I didn’t come to publish them here.
This post is about collinearity and the implications for linear models. The best way to explore this is by simulations - where I create data with known properties and look what happens.
Create correlated random variables
The first problem for this simulation was: How can we create correlated random variables?
This simulation is similar to the one in Legendre & Legendre (2012) , where it is also mentioned that one could use Cholesky Decompostion to create correlated variables:
What this function function does:
Create two normal random variables (X1, X2 ~ N(0, 1))
Create desired correlation matrix
Compute the Choleski factorization of the correlation matrix ( R )
Apply the resulting matrix on the two variables (this rotates, shears and scales the variables so that they are correlated)
Create a dependent variable y after the model:
y ~ 5 + 7*X1 + 7*X2 + e with e ~ N(0, 1)
123456789101112131415161718192021222324
############################################################################## create two correlated variables and a dependent variable# n : number of points# p:correlationdatagen <-function(n , p){# random points N(0,1) x1 <- rnorm(n) x2 <- rnorm(n) X <- cbind(x1, x2)# desired correlation matrix R <- matrix(c(1, p, p,1), ncol =2)# use cholesky decomposition `t(U) %*% U = R` U <- chol(R) corvars <- X %*% U
# create dependent variable after model:# y ~ 5 + 7 * X1 + 7 * X2 + e | E ~ N(0,10) y <-5+7*corvars[,1]+7*corvars[,2]+ rnorm(n,0,1) df <- data.frame(y, corvars)return(df)}
Let’s see if it works.
This creates two variables with 1000 observations with a correlation of 0.8 between them and a dependent variable.
1
df1 <- datagen(n =1000, p =0.8)
The correlation between X1 and X2 is as desired nearly 0.8
1
cor(df1)
1234
## y X1 X2## y 1.00000 0.94043 0.94715## X1 0.94043 1.00000 0.79139## X2 0.94715 0.79139 1.00000
And the data follows the specified model.
12
mod <- lm(y ~ X1 + X2, data = df1)summary(mod)
12345678910111213141516171819
## ## Call:## lm(formula = y ~ X1 + X2, data = df1)## ## Residuals:## Min 1Q Median 3Q Max ## -2.7375 -0.6712 0.0561 0.6225 2.9057 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.0210 0.0311 161 <2e-16 ***## X1 6.9209 0.0517 134 <2e-16 ***## X2 7.0813 0.0498 142 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.984 on 997 degrees of freedom## Multiple R-squared: 0.995, Adjusted R-squared: 0.995 ## F-statistic: 9.14e+04 on 2 and 997 DF, p-value: <2e-16
1
pairs(df1)
Methods to spot collinearity
Dormann et al. (2013) lists eight methods to spot collinearity (see their Table 1). I will only show how to calculate two of those (but see the Appendix of the Dormann paper for code to all methods):
Dormann et al. (2013) found that ‘coefficients between predictor variables of r > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation’.
Variance Inflation Factors
12
require(car)vif(mod)
12
## X1 X2 ## 2.6759 2.6759
Which is equivalent to (for variable X1):
12
sum <- summary(lm(X1 ~ X2, data = df1))1/(1- sum$r.squared)
1
## [1] 2.6759
Unfortunately there are many ‘rules of thumb’ associated with VIF: > 10, > 4, …
Simulation 1: How does collinearity affect the precision of estimates?
Here we simulate datasets with correlations ranging from -0.8 to 0.8:
1234
ps <- seq(-0.8,0.8,0.005)n =1000sim1 <- lapply(ps,function(x) datagen(n = n, p = x))
Next we fit a linear model to each of the datasets and extract the coefficients table:
123456
# Function to fit and extract coefficientsregfun <-function(x){return(summary(lm(y ~ X1 + X2, data = x))$coefficients)}# run regfun on every datasetres1 <- lapply(sim1, regfun)
Finally we extract the Standard Errors and plot them against the degree of correlation:
123456789
# extract Standard Errors from resultsses <- data.frame(ps, t(sapply(res1,function(x) x[,2])))# plotrequire(ggplot2)require(reshape2)ses_m <- melt(ses, id.vars ="ps")ggplot(ses_m, aes(x = ps, y = value))+ geom_point(size =3)+ facet_wrap(~variable)+ theme_bw()+ ylab("Std. Error")+ xlab("Correlation")
It can be clearly seen, that collinearity inflates the Standard Errors for the correlated variables. The intercept is not affected.
Having large standard errors parameter estimates are also variable, which is demonstrated in the next simulation:
Simulation 2: Are estimated parameters stable under collinearity?
Here I simulate for different correlations 100 datasets each in order to see how stable the estimates are.
The code creates the datasets, fits a linear model to each dataset and extracts the estimates.
123456789101112131415
n_sims <-100ps <- seq(-0.8,0.8,0.05)# function to generate n_sims datasets, for give psim2 <-function(p){ sim <- lapply(1:n_sims,function(x) datagen(n =1000, p = p)) res <- lapply(sim, regfun) est <- t(sapply(res,function(x) x[,1])) out <- data.frame(p, est)return(out)}res2 <- lapply(ps,function(x){return(sim2(x))})
The we create a plot of the estimates for the three coefficients, each boxplot represents 100 datasets.
The red line indicates the coefficients after the data has been generated (7 * X1 and 7 * X2). We see that the spread of estimates increases as correlation increases.
This is confirmed by looking at the standard deviation of the estimates:
1234
sds <- data.frame(ps, t(sapply(res2,function(x) apply(x[,2:4],2, sd))))sds_m <- melt(sds, id.vars ="ps")ggplot(sds_m, aes(x = ps, y = value))+ geom_point(size =3)+ facet_wrap(~variable)+ theme_bw()+ ylab("SD of estimates from 100 simulations")+ xlab("Correlation")
If the standard errors are large enough it may happen that parameter estimates may be so variable that even their sign is changed.
References
Carsten Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime Marquéz, Bernd Gruber, Bruno Lafourcade, Pedro Leitão, Tamara Münkemüller, Colin McClean, Patrick Osborne, Björn Reineking, Boris Schröder, Andrew Skidmore, Damaris Zurell, Sven Lautenbach, (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography36 (1) 027-046 10.1111/j.1600-0587.2012.07348.x
Pierre Legendre, Louis Legendre, (2012) Numerical ecology.