Eduard Szöcs

Data in Environmental Science and Eco(toxico-)logy

Quantitative Ecotoxicology, Page 33, Example 2.1, Winsorization

Get the data (Sulfate Concentrations from Savannah River (South Carolina) in mg / L)) from here and read it into R:

ALL <- read.table("p33.csv", header = TRUE, sep = ";")

So we have a data.frame with one variable and 21 observations:

str(ALL)
## 'data.frame':  21 obs. of  1 variable:
##  $ SO4: num  1.3 2.3 2.6 3.3 3.5 3.5 3.6 4 4.1 4.5 ...
ALL$SO4
##  [1] 1.3 2.3 2.6 3.3 3.5 3.5 3.6 4.0 4.1 4.5 5.2 5.6 5.7 6.1 6.2 6.5 6.9
## [18] 7.1 7.7 7.9 9.9

Winsorization replaces extreme data values with less extreme values. I have written a small function to run the winsorisation:

winsori <- function(x, width = 2) {
    # check if sorted
    if (is.unsorted(x)) 
        stop("Values must be sorted!")
    # get number of observations
    n <- length(x)
    # Replace lowest
    x[1:width] <- x[width + 1]
    # replace highest
    x[(n - width + 1):n] <- x[(n - width)]
    x
}

The function takes a ordered vector and replaces the 2 highest and 2 lowest values (can be changed by the ‘width’-Argument by their neighbors.

We can apply this function to our data and safe it as new column:

ALL$SO4_win <- winsori(ALL$SO4)
# display the first and 5 last rows
ALL[c(1:5, 17:21), ]
##    SO4 SO4_win
## 1  1.3     2.6
## 2  2.3     2.6
## 3  2.6     2.6
## 4  3.3     3.3
## 5  3.5     3.5
## 17 6.9     6.9
## 18 7.1     7.1
## 19 7.7     7.7
## 20 7.9     7.7
## 21 9.9     7.7

Worked as expected. The Winsorized mean and standard-deviation is:

# mean
mean(ALL$SO4_win)
## [1] 5.081
# standard deviation
sd(ALL$SO4_win)
## [1] 1.792

For the Winsorized Standard Deviation we need again a homemade function:

sw <- function(x, width = 2) {
    n <- length(x)
    sd(x) * (n - 1)/(n - 2 * width - 1)
}
sw(ALL$SO4_win)
## [1] 2.24

And lastly we calculate the mean for the trimmed data (remove two observation from each tail):

mean(ALL$SO4, trim = 2/21)
## [1] 5.065

Code and data are available at my github-repo under filename ‘p33’.

Written on December 1, 2012