Eduard Szöcs

Data in Environmental Science and Eco(toxico-)logy

Species Sensitivity Distributions (SSD) with R

Introduction

Species Sensitivity Distributions (SSD) are a central tool for ecological risk assessment (ERA). Species show different sensitivities to chemicals and the variation between species can be described by a statistical distribution. A concentration at which x% of species are affected can be derived from SSDs (= $HC_x$ ). Usually, an HC5 is derived (with 95% confidence interval) and used in ERA.

Data

In this example, I will generate an SSD for the insecticide Chlorpyrifos (CAS 2921-88-2).

SSDs are generated using data of toxicity experiments (like EC_50 / LC50 values). Such kind of data is available e.g. from US EPA ECOTOX database, ECHA or ETOX.

I prepared some data from the US EPA ECOTOX database for this post. I will skip the data cleaning and data quality checks here - but note, this data has not been checked thoroughly and prepared only for demonstration purposes. However, data cleaning and checking is a very important step for every data analysis.

You can read the data into R with these three lines

# download data from github
require(RCurl)
url <- getURL("https://raw.githubusercontent.com/EDiLD/r-ed/master/post_ssd/ssd_data.csv",
    ssl.verifypeer = FALSE)
df <- read.table(text = url, header = TRUE, sep = ',', stringsAsFactors = FALSE)

A first look at the data

SSDs are typically displayed as a plot showing the fraction of affected species on the y-axis and the concentration on the x-axis.

To calculate the fraction affected we order the species by their toxicity values and then calculate the fraction:

df <- df[order(df$val), ]
df$frac <- ppoints(df$val, 0.5)

Then we can take a first look at the data:

require(ggplot2)
ggplot(data = df) +
  geom_point(aes(x = val, y = frac), size = 5) +
  geom_text(aes(x = val, y = frac, label = species), hjust = 1.1, size = 4) +
  theme_bw() +
  scale_x_log10(limits = c(0.0075, max(df$val))) +
  labs(x = expression(paste('Concentration of Chlorpyrifos [ ', mu, 'g ', L^-1, ' ]')), 
       y = 'Fraction of species affected')

plot of chunk ssd_plot_raw

Fitting a distribution to the data

To fit a distribution to the data we can use the fitdistr() function from the MASS package or the more flexible fitdist() from the fitdistrplus package (there are also others). I will use the MASS package here to fit a lognormal distribution to this data.

require(MASS)
fit <- fitdistr(df$val, 'lognormal')
fit
##    meanlog     sdlog 
##   0.20235    1.55028 
##  (0.26204)  (0.18529)

The mean (meanlog) and standard deviation (sdlog) of the lognormal distribution were estimated from the data. We could fit and compare (e.g. by AIC) different distributions, but I stick with the lognormal here.

Derive HC5

From the estimated parameters of the fitted distribution we can easily extract the HC5.

(hc5 <- qlnorm(0.05, meanlog = fit$estimate[1], sdlog = fit$estimate[2]))
## [1] 0.095596

parametric Bootstrap

To be more conservative the lower limit of the confidence interval (CI) around the HC5 is sometimes used. The lower limit of the CI can be estimated from the data using parametric bootstrap.

The idea is:

  • generate random values from the fitted distribution
  • fit to these random values the distribution
  • estimate the HC5 from this new distribution
  • repeat many times to assess the variability of HC5 values

Alternatively, also non-parametric bootstrap could be used (resample from the data, not from the fitted distribution).

In R we write a function (myboot()) that does steps 1-3 for us:

myboot <- function(fit, p){
  # resample from fitted distribution
  xr <- rlnorm(fit$n, meanlog = fit$estimate[1], sdlog = fit$estimate[2])
#   # for a nonparametric boostrap resample from data
#   xr <- sample(df$val, length(df$val), replace = TRUE)
  # fit distribition to new data
  fitr <- fitdistr(xr, 'lognormal')
  # return HCp
  hc5r <- qlnorm(p, meanlog = fitr$estimate[1], sdlog = fitr$estimate[2])
  return(hc5r)
}

We repeat this function 1000 times and get the quantiles of the bootstrapped HC5 values:

set.seed(1234)
hc5_boot <- replicate(1000, myboot(fit, p = 0.05))
quantile(hc5_boot, probs = c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 0.046027 0.102427 0.214411

So for this data and the lognormal distribution the HC5 would be 0.096 with a CI of [0.046; 0.214].

A fancy plot

Finally, I generate a fancy SSD plot with predictions (red), with bootstrapped values (blue), CI (dashed) and the raw data (dots):

myboot2 <- function(fit, newxs){
  # resample
  xr <- rlnorm(fit$n, meanlog = fit$estimate[1], sdlog = fit$estimate[2])
  # fit to resample
  fitr <- fitdistr(xr, 'lognormal')
  # predict for new data
  pyr <- plnorm(newxs, meanlog = fitr$estimate[1], sdlog = fitr$estimate[2])
  return(pyr)
}
# newdata to predict
newxs <- 10^(seq(log10(0.01), log10(max(df$val)), length.out = 1000))
boots <- replicate(1000, myboot2(fit, newxs))
 
# fancy plot
require(reshape2)
# extract boostrap values
bootdat <- data.frame(boots)
bootdat$newxs <- newxs
bootdat <- melt(bootdat, id = 'newxs')
# extract CI
cis <- apply(boots, 1, quantile, c(0.025, 0.975))
rownames(cis) <- c('lwr', 'upr')
# add fitted values
pdat <- data.frame(newxs, py = plnorm(newxs, meanlog = fit$estimate[1], sdlog = fit$estimate[2]))
# add CI
pdat <- cbind(pdat, t(cis))
# x koordinates for species names (better use lower Ci instead of fit...)
df$fit <- 10^(log10(qlnorm(df$frac, meanlog = fit$estimate[1], sdlog = fit$estimate[2])) - 0.4)
# plot
ggplot() +
  geom_line(data = bootdat, aes(x = newxs, y = value, group = variable), col = 'steelblue', alpha = 0.05) + 
  geom_point(data = df, aes(x = val, y = frac)) +
  geom_line(data = pdat, aes(x = newxs, y = py), col = 'red') +   
  geom_line(data = pdat, aes(x = newxs, y = lwr), linetype = 'dashed') + 
  geom_line(data = pdat, aes(x = newxs, y = upr), linetype = 'dashed') + 
  geom_text(data = df, aes(x = fit, y = frac, label = species), hjust = 1, size = 4) +
  theme_bw() +
  scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000), limits = c(0.003, max(df$val))) +
  labs(x = expression(paste('Concentration of Chlorpyrifos [ ', mu, 'g ', L^-1, ' ]')), 
       y = 'Fraction of species affected')

plot of chunk ssd_plot

Note, that the bootstrap function is slightly different: this is because for this plot I wanted to have for each bootstrap a different (blue) line.

fitdistrplus

** Update on 9. October 2015 **

Although writing a bootstrap function is straightforward, you need to write your own function and understand what’s going on. The fitdistrplus package makes the above steps even more easy:

  • 1 - fit the lognormal distribution
require(fitdistrplus)
fit2 <- fitdist(df$val, 'lnorm')
fit2
## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters:
##         estimate Std. Error
## meanlog  0.20235    0.26204
## sdlog    1.55028    0.18529

This gives identical results as above

  • 2 - extract HC5
hc5_2 <- quantile(fit2, probs = 0.05)
hc5_2
## Estimated quantiles for each specified probability (non-censored data)
##            p=0.05
## estimate 0.095596
  • 3 - bootstrap CI
fit2_boot <- bootdist(fit2, bootmethod = 'param', niter = 1000)
quantile(fit2_boot, probs = 0.05)
## (original) estimated quantiles for each specified probability (non-censored data)
##            p=0.05
## estimate 0.095596
## Median of bootstrap estimates
##           p=0.05
## estimate 0.10554
## 
## two-sided 95 % CI of each quantile
##          p=0.05
## 2.5 %  0.046434
## 97.5 % 0.226439

This performs a parametric bootstrap, for a non-parametric boostrap set bootmethod = 'nonparam'.

  • 4- plotting is again a bit cumbersome
# predict new distributions on a grid of 1000
newxs <- 10^(seq(log10(0.01), log10(max(df$val)), length.out = 1000))
pp <- apply(fit2_boot$estim, 1, function(x) plnorm(newxs, x[1], x[2]))
bootdat <- data.frame(pp)
# each column is the result from a bootstrap sample
 
# add x-values
bootdat$newxs <- newxs
require(reshape2)
# bring to long format
bootdat <- melt(bootdat, id.vars = 'newxs')
 
# get Ci from bootstraps
cis <- apply(pp, 1, quantile, c(0.025, 0.975))
rownames(cis) <- c('lwr' ,'upr')
pdat <- data.frame(newxs, py = plnorm(newxs, meanlog = fit$estimate[1], sdlog = fit$estimate[2]))
# add CI
pdat <- cbind(pdat, t(cis))
 
# add x coordinates for species names from fitted values
df$fit <- 10^(log10(qlnorm(df$frac, meanlog = fit2$estimate[1], sdlog = fit2$estimate[2])) - 0.4)
 
# plot it!
ggplot()+
  geom_line(data = bootdat, aes(x = newxs, y = value, group = variable), col = 'steelblue', alpha = 0.05) + 
  geom_point(data = df, aes(x = val, y = frac)) +
  geom_line(data = pdat, aes(x = newxs, y = py), col = 'red') +   
  geom_line(data = pdat, aes(x = newxs, y = lwr), linetype = 'dashed') + 
  geom_line(data = pdat, aes(x = newxs, y = upr), linetype = 'dashed') + 
  geom_text(data = df, aes(x = fit, y = frac, label = species), hjust = 1, size = 4) +
  theme_bw() +
  scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000), limits = c(0.003, max(df$val))) +
  labs(x = expression(paste('Concentration of Chlorpyrifos [ ', mu, 'g ', L^-1, ' ]')), 
       y = 'Fraction of species affected')

plot of chunk ssdplot2

Written on October 9, 2015