Eduard Szöcs

Data in Environmental Science and Eco(toxico-)logy

Regulatory Acceptable Concentrations (RAC)

Intro

The German Environment Agency (UBA) recently published a list of regulatory acceptable concentrations (RAC) for 108 pesticides [Link]. I used it already in a previous project (paper is currently under review) and here I show how we can use R to:

  1. Download the list
  2. Digitize the pdf
  3. Retrieve additional data
  4. Create summary statistics to explore the RACs

I will use two ROpenSci packages for this tasks:

  1. tabulizer to digitize the pdf and
  2. my webchem package to retrieve information on the pesticides.

Download the data

First, we need to get the RAC-list from ETOX.

Here I download the file to a temporary file:

rac_original <- tempfile()
download.file('https://webetox.uba.de/webETOX/public/basics/literatur/download.do?id=284', 
                destfile = rac_original)

Digitize the data

Next we need to get the pdf into a tabular format. I use tabulizer for this task. We digitize the pdf table using extract_tables():

library("tabulizer")
rac_list <- extract_tables(rac_original)
str(rac_list)
## List of 2
##  $ : chr [1:71, 1:11] "1 " "2 " "3 " "4 " ...
##  $ : chr [1:38, 1:11] "70 " "71 " "72 " "73 " ...

This gives two lists with matrices because there are two pages. We combine both

rac_list <- do.call(rbind, rac_list)
str(rac_list)
##  chr [1:109, 1:11] "1 " "2 " "3 " "4 " "5 " "6 " "7 " ...

We see that there are 109 instead of 108 lines. Inspecting the raw pdf and the table we see that there is an addtional for Diuron with 2 CAS numbers and no indentifer in the first column. I simply delete this row:

rac_list <- rac_list[!rac_list[ ,1] == '', ]

Clean the data

Some more cleaning is needed before we can use this data efficiently:

  1. Delete unused columns
  2. Coerce to data.frame
# keep only selected columns and make data.frame
rac <- data.frame(rac_list[ , c(1, 2, 4, 6)], stringsAsFactors = FALSE)
head(rac)
##   X1              X2                      X3    X4
## 1 1         2,4‐D H      94‐75‐7 12.08.2015   1,1 
## 2 2   Acequinocyl I   57960‐19‐7 28.10.2015     9 
## 3 3   Acetamiprid I  135410‐20‐7 16.10.2015  0,24 
## 4 4     Aclonifen H   74070‐46‐5 24.09.2015  1,06 
## 5 5  Azoxystrobin F  131860‐33‐8 14.08.2015  0,55 
## 6 6   Benalaxyl‐M F   71626‐11‐4 14.08.2015    20
  1. split columns missed by tabulizer.
# split columns X2 and X3
library("tidyr")
rac <- separate(data = rac, col = X2, into = c("name", "type"), sep = -3)
rac <- separate(data = rac, col = X3, into = c("CAS", "date"), sep = " ")
# remove date 
rac$date <- NULL
head(rac)
##   X1          name type         CAS    X4
## 1 1         2,4‐D    H      94‐75‐7  1,1 
## 2 2   Acequinocyl    I   57960‐19‐7    9 
## 3 3   Acetamiprid    I  135410‐20‐7 0,24 
## 4 4     Aclonifen    H   74070‐46‐5 1,06 
## 5 5  Azoxystrobin    F  131860‐33‐8 0,55 
## 6 6   Benalaxyl‐M    F   71626‐11‐4   20
  1. Check CAS numbers
# this is needed to fix an encoding problem
rac$CAS <- gsub('‐', '-', rac$CAS)
library(webchem)
# check if all are valid cas numbers
# Note in a future release of webchem is.cas will be vectorized!
which(vapply(rac$CAS, is.cas, FUN.VALUE = TRUE) == FALSE)
## 16.10.2015 
##         39

One CAS is wrong (date in stead of CAS), which I set to NA

rac$CAS[39] <-  NA
  1. coerce rac to numeric (currently it is text)
rac$rac <- as.numeric(gsub(",", ".", rac$X4))
  1. final cleanup
# select columns
rac <- rac[ , c('name', 'CAS', 'rac')]
# remove substances without rac
rac <- rac[!is.na(rac$rac), ]
# trim whitespaces (leading & trailling)
rac$name <- gsub("^\\s+|\\s+$", "", rac$name)
# set cas for dodin to 2439-10-3
rac$CAS[rac$name == 'Dodin'] <- '2439-10-3'
head(rac)
##           name         CAS   rac
## 1        2,4‐D     94-75-7  1.10
## 2  Acequinocyl  57960-19-7  9.00
## 3  Acetamiprid 135410-20-7  0.24
## 4    Aclonifen  74070-46-5  1.06
## 5 Azoxystrobin 131860-33-8  0.55
## 6  Benalaxyl‐M  71626-11-4 20.00
# # save
# write.table(rac, file = '../files/rac.csv', sep = ';', row.names = FALSE)

You can download the cleaned file here.

Retrieve additional information

I search for additional information, like the activity, in Alan Wood’s Compendium of Pesticide Common Names:

aw_results <- aw_query(rac$CAS, type = 'cas')

This worked very well and we found information for all substances except for a Prothioconazole metabolite

# Check for which CAS no data (=NA) was found
which(vapply(aw_results, function(y) length(y) == 1, TRUE))
## 120983-64-4 
##          90

I remove this from the results and the rac list.

rac <- rac[!rac$CAS == '120983-64-4', ]
aw_results <- aw_results[!names(aw_results) == '120983-64-4']

Finally, I extract the subactivity and add it to the rac table:

rac$subactivity <- vapply(aw_results, function(y) y[['subactivity']][1], 'x')

From this I extract the activity and build 4 groups (insecticides, herbicides, fungicides and others):

rac$type <- gsub('.* (.*)$', '\\1', rac$subactivity)
# fix
rac$type[rac$type == 'insecticidesmolluscicides'] <- 'insecticides'
# build groups
rac$type <- ifelse(!rac$type %in% c('herbicides', 'fungicides', 'insecticides'),
                   'other',
                   rac$type)

Next, I search the Physprop database for $K_{OW}$ values.

pp_results <- pp_query(rac$CAS)
# Check for which CAS no data was found
which(vapply(pp_results, function(y) length(y) == 1, TRUE))
## 581809-46-3 500008-45-7 736994-63-1 163515-14-8 149961-52-4 158062-67-0 
##           9          15          23          32          35          44 
##  83066-88-0 658066-35-4 173159-57-4 144550-36-7 140923-17-7 374726-62-2 
##          45          50          56          59          61          66 
## 106700-29-2 137641-05-5 117428-22-5 178928-70-6 131929-63-0 203313-25-1 
##          81          82          83          89          94          95 
## 102851-06-9 153719-23-4 
##          97         101

We could not find data for 20 substances and I set their value to NA.

rac$kow <- sapply(pp_results, function(y) {
    # substances without data: NA
    if (length(y) == 1 && is.na(y))
      return(NA)
    out <- y$prop$value[y$prop$variable == 'Log P (octanol-water)']
    # substances without KOW: NA
    ifelse(length(out) == 0, NA, out)
  })

Summarize data

All the code above was just done to create few summary statistics…

Here the RACs splitted by type:

library('ggplot2')
ggplot(rac, aes(y = rac, x = type)) +
  geom_boxplot(fill = 'grey75') +
  geom_jitter(width = 0.2) +
  scale_y_log10() +
  theme_bw() +
  labs(x = '', y = 'RAC [ug/L]')

plot of chunk plot_type

We clearly see that insecticides have much lower RACs than the other groups of pesticides. However, there is no strong relationship between the $log~K_{OW}$ and the RAC for the different types.

ggplot(rac, aes(y = rac, x = kow, col = type)) +
  geom_point() +
  scale_y_log10() +
  theme_bw() +
  facet_wrap(~type) +
  labs(x = expression('log'~K[OW]), y = 'RAC [ug/L]')

plot of chunk plot_kow

Written on December 5, 2016