Data in Environmental Science and Eco(toxico-)logy
Species Sensitivity Distributions (SSD) with R
Species Sensitivity Distributions (SSD) are a central tool for ecological risk
Species show different sensitivities to chemicals and the variation between
species can be described by a statistical distribution.
A concentration at which x% of species are affected can be derived from SSDs
(= $HC_x$ ).
Usually, an HC5 is derived (with 95% confidence interval) and used in ERA.
In this example, I will generate an SSD for the insecticide Chlorpyrifos
SSDs are generated using data of toxicity experiments (like EC_50 / LC50 values).
Such kind of data is available e.g. from
US EPA ECOTOX database,
I prepared some data from the US EPA ECOTOX database for this post.
I will skip the data cleaning and data quality checks here - but note, this data has not
been checked thoroughly and prepared only for demonstration purposes.
However, data cleaning and checking is a very important step for every data analysis.
You can read the data into R with these three lines
A first look at the data
SSDs are typically displayed as a plot showing the fraction of affected species
on the y-axis and the concentration on the x-axis.
To calculate the fraction affected we order the species by their toxicity values and then calculate the fraction:
Then we can take a first look at the data:
Fitting a distribution to the data
To fit a distribution to the data we can use the fitdistr() function from the MASS package or
the more flexible fitdist() from the fitdistrplus package (there are also others).
I will use the MASS package here to fit a lognormal distribution to this data.
The mean (meanlog) and standard deviation (sdlog) of the lognormal distribution were estimated from the data.
We could fit and compare (e.g. by AIC) different distributions,
but I stick with the lognormal here.
From the estimated parameters of the fitted distribution we can easily extract the HC5.
To be more conservative the lower limit of the confidence interval (CI) around the HC5 is sometimes used.
The lower limit of the CI can be estimated from the data using parametric bootstrap.
The idea is:
generate random values from the fitted distribution
fit to these random values the distribution
estimate the HC5 from this new distribution
repeat many times to assess the variability of HC5 values
Alternatively, also non-parametric bootstrap could be used (resample from the data, not from the fitted distribution).
In R we write a function (myboot()) that does steps 1-3 for us:
We repeat this function 1000 times and get the quantiles of the bootstrapped HC5 values:
So for this data and the lognormal distribution the HC5 would be 0.096 with a CI of [0.046; 0.214].
A fancy plot
Finally, I generate a fancy SSD plot with predictions (red), with bootstrapped values (blue), CI (dashed) and the raw data (dots):
Note, that the bootstrap function is slightly different: this is because for this plot I wanted to have for each bootstrap a different (blue) line.
** Update on 9. October 2015 **
Although writing a bootstrap function is straightforward, you need to write your own function and understand what’s going on.
The fitdistrplus package makes the above steps even more easy:
1 - fit the lognormal distribution
This gives identical results as above
2 - extract HC5
3 - bootstrap CI
This performs a parametric bootstrap, for a non-parametric boostrap set bootmethod = 'nonparam'.