Data in Environmental Science and Eco(toxico-)logy
Build a local version of the EPA ECOTOX database
Databases of toxicity data (like EC50 of a species towards a chemical) play an important role in ecotoxicology: Their data is used to construct SSDs, to calculate Toxic Units (TU) or for prior information to experimental design.
Several such databases are available, maintained by different institutions and with a possible overlap.
The most comprehensive database (with more than 600,000 tests) may be the US EPA ECOTOX database.
Although you can query the ECOTOX database via the web interface, this is limited by the number of rows, your internet connection and maybe you need a more customized query.
Luckily, the US EPA provides all the data as download (here) and also updates it regularly (every 3 months).
In this post I will describe how you can build a local (= on your own computer / server) version of the US EPA ECOTOX database.
To follow this post you will need
basic R knowledge - I write everything as an R script because later on, I will also use R to analyse this data.
basic SQL knowledge - I will store the data in a PostgreSQL database. Basically, this is all SQL (wrapped into R)
1) Download the US EPA ECOTOX data from here.
2) Execute the .exe file (=Unzip the data to a folder). This will result in a folder (ecotox_ascii_06_15_2016) with the following contents:
The unpacked folder contains documentation of the downloaded data (AsciiDownloadHelp5.pdf and ECOTOX_ASCII_dd.html),
release notes (release_notes_06_15_2016.txt) and the data tables:
chemical_carriers.txt : Information pertaining to the carrier and/or positive control chemicals
reported for the test.
dose_response_details.txt: Detail dose response records
dose_response_links.txt: Ties dose response to its endpoint
dose_responses.txt: Parent dose response record containing sample size, effect measurement,
response site, observation duration etc.
doses.txt: Information pertaining to the dose-response dose.
media_characteristics.txt: Water chemistry and media characteristics parameters.
results.txt : Information pertaining to the endpoint or non-endpoint result or dose-
tests.txt: Information pertaining to the experimental design.
The subfolder ~/validation contains lookup tables.
For more information see the documentation files.
Get a running PostgreSQL instance:
I use PostgreSQL as database, but the following can be easily rewritten to other database systems.
You can install a local database on your computer or a server, see here or here for installation tutorials.
Create a database named 'ecotox'.
Copy the data
I will copy all downloaded tables to the database.
For this, I will use the RPostgreSQL package (which saves me writting up all the CREATE TABLE statements…).
First we specify the path to the downloaded and extracted folder (ecotox_ascii_06_15_2016)
Next we set the server details:
And connect to the database from R:
Now we are ready to copy the data:
First the 8 data tables.
This and the next steps may take some time…
Now that we have the tables in our database, I add primary keys to them (see also documentation):
For better performance of queries I also add (lots of) indexes to columns that may be used for joins or filters:
All tables are copied to the 'public' SCHEMA.
However, I prefer to keep the data in schemas other than public and move all tables to a schema named 'ecotox'
Now the lookup tables in the /validation folder.
The workflow is the same as above…
For data-cleaning I also created custom lookup tables to convert durations and concentrations, as well as to unify groups.
You can download these tables here (in the folder /data/conversions).
Use them at your own risk!
I have some more of these tables available (contact me if interested) and you can also easily create the for your needs.
We follow the same procedure for these files:
Custom PostgreSQL Functions
Some concentrations are stored as text in the database, to convert to numeric values I provide you a with function to convert these to numerics (see here):
This function tries to convert to a numeric and if this is not possible returns NULL.
Now all data of the US EPA ECOTOX database is in our database.
But before we quit the connection to do some maintenance:
Let’s see what we can do now with this database.
Query 1: All EC50/LC50 data for Chlorpyrifos towards D. magna, irrespective of effect, test duration, etc…
The first query is a rather simple one:
I query all LC50/EC50 values of D.manga for Chlorpyrifos (CAS#: 2921-88-2):
We get 41 entries, from different exposure times and endpoints.
There are many variables stored in the database! -
Here I show only a small subset of test duration, endpoint (e.g. $EC_x$), effect (e.g. mortality, growth, etc),
concentration type (e.g active ingredient or formulation)…
We can further filter to allow only acute tests (48h duration) and allow only Mortality (MOR) or Intoxication (ITX) effects.
The first condition is AFAIK not possible with the web-interface of the US EPA.
Which reduces the data to 19 entries.
You might note, that there are different units used.
It might be desirable to harmonize these to ug/L.
We can do that using the lookup table unit_convert I provide.
Moreover, I add another condition to take only active ingredients (results.conc1_type = 'A', see table concentration_type_codes for description of types).
You might notice that the queries can get quickly complicated (and SQL knowledge indispensable).
I have a custom query with several hundred lines of code…
I encourage everybody to explore the entries in this database and to understand their relationship and meaning.
It is a huge resource for ecotoxicologists!
Here’s a plot of the (log) distribution of EC50 values:
We see that the values spread over several orders of magnitude…
Query 2: All acute EC50/LC50 for Chlorpyrifos to build an SSD.
In a previous post I showed how to calculate Species Sensitivity Distribution (SSDs) using R.
However, I did not show how to retrieve the used data.
The following query is similar to the one above. However, I do not restrict to a specific taxon:
only EC50 or LC50
only 48h tests
only ITX or MOR effects
keep only data points that can be converted to $\mu g / L$.
I also added the taxonomic group (from a custom lookup-table).
We end up with 286 entries from 115 taxa.
So we have multiple entries per taxon.
For graphical display, I will aggregate them using the geometric mean.
[Note, that further data-quality and plausibility checks could and should (!) be done:
e.g. based on the solubility, baseline-tox etc…].
The aggregation could be also carried out in R:
We are left with data from 115 taxa.
Let’s create plot show the distribution of LC50 values across taxa:
We may see an ordering of sensitivity: fish < insects < crustaceans…
This was a rather technical post, but dealing with (big)data in ecotoxicology requires some technical expertise in data wrangling.
Nevertheless, it may be useful to others working with this data from the US EPA ECOTOX database.
It is an incredibly important resource for ecotoxicology.
Thanks @USEPA for making this available!
Hopefully, I showed that it is worth having a local mirror of this database:
it gives much more flexibility than the web interface and you can get directly the data you need in R (where can further process it).
I must admit: Because the database is so huge it need some time to get familiar with it and to use it efficiently (what are the tables? What are the variables therein? What are the meaning of the codes?).
I only showed two example queries here, but don’t take them seriously!
A robust use of this database requires also a lot of quality checks! (I’ll leave this for a future post…)
I have created few more custom lookup tables - if you are interested in these or have a question feel free to contact me.