Eduard Szöcs

Data in Environmental Science and Eco(toxico-)logy

Introducing the webchem package

Last year I wrote two posts about web-scraping chemical data (here and here). These were just a collection of functions living at github. To make these functions available to a broader audience I rewrote them, added new ones and bundled them in a new R package: webchem.

Webchem is available on github and CRAN and is part of the ROpenSci project.

Functionality

Webchem is useful for all dealing with chemical data (e.g. Daniel Münch is using it for his Database of Odor Responses, I use it for my work with monitoring data). It allows to retrieve information about chemicals from the web.

Currently it provides an interface to the Chemical Identifier Resolver, chemspider, pubchem, the Chemical Translation Service and the PAN Pesticide Database. If time permits I will add other data sources (see issue page).

webchem in action

Let’s scrape some data about Imidacloprid:

Chemical Identifier Resolver

Search for CAS numbers with the Chemical Identifier Resolver:

library(webchem)
cir_query('Imidacloprid', representation = 'cas')
## [1] "138261-41-3" "105827-78-9"

The Chemical Identifier Resolver is very powerful, see ?cir_query for possible representations. Here are just a few:

# SMILES
cir_query('Imidacloprid', representation = 'smiles')
## [1] "C1=CC(=NC=C1CN2C(=NCC2)N[N+](=O)[O-])Cl"
# InChIKey
cir_query('Imidacloprid', representation = 'stdinchikey')
## [1] "InChIKey=YWTYJOPNNQFBPC-UHFFFAOYSA-N"
# Molecular weight
cir_query('Imidacloprid', representation = 'mw')
## [1] "255.6633"
# number of rings
cir_query('Imidacloprid', representation = 'ring_count')
## [1] "2"

chemspider

To use chemspider you need a security token (see here).

The workflow is similar as in our taxize package:

  1. Query ID (csid)
  2. Use this ID to query more data
# get chemspider ID
csid <- get_csid('Imidacloprid', token = token)
csid
## [1] "77934"

Two functions allow to retrieve data (normal and extended) with this ID:

csid_compinfo(csid, token)
##                                                                                             CSID 
##                                                                                          "77934" 
##                                                                                            InChI 
## "InChI=1S/C9H10ClN5O2/c10-8-2-1-7(5-12-8)6-14-4-3-11-9(14)13-15(16)17/h1-2,5H,3-4,6H2,(H,11,13)" 
##                                                                                         InChIKey 
##                                                                    "YWTYJOPNNQFBPC-UHFFFAOYSA-N" 
##                                                                                           SMILES 
##                                                             "c1cc(ncc1CN2CCN=C2N[N+](=O)[O-])Cl"
csid_extcompinfo(csid, token)
##                                                                                            CSID 
##                                                                                         "77934" 
##                                                                                              MF 
##                                                                       "C_{9}H_{10}ClN_{5}O_{2}" 
##                                                                                          SMILES 
##                                                            "c1cc(ncc1CN2CCN=C2N[N+](=O)[O-])Cl" 
##                                                                                           InChI 
## "InChI=1/C9H10ClN5O2/c10-8-2-1-7(5-12-8)6-14-4-3-11-9(14)13-15(16)17/h1-2,5H,3-4,6H2,(H,11,13)" 
##                                                                                        InChIKey 
##                                                                     "YWTYJOPNNQFBPC-UHFFFAOYAZ" 
##                                                                                     AverageMass 
##                                                                                       "255.661" 
##                                                                                 MolecularWeight 
##                                                                                       "255.661" 
##                                                                                MonoisotopicMass 
##                                                                                    "255.052307" 
##                                                                                     NominalMass 
##                                                                                           "255" 
##                                                                                           ALogP 
##                                                                                             "0" 
##                                                                                           XLogP 
##                                                                                           "2.2" 
##                                                                                      CommonName 
##                                                                                  "Imidacloprid"

pubchem

The same workflow applies to pubchem:

cid <- get_cid('Imidacloprid')
cid
## [1] "86418"    "10130527" "16212231" "44470476" "71301282" "76308929"
## [7] "76327057"

Here we get multiple matches and I use only the first one here:

cid <- get_cid('Imidacloprid', first = TRUE)
cid
## [1] "86418"

cid_compinfo() then returns a lot of information from pubchem

cid_info <- cid_compinfo(cid)

Here I display only selected entries (see ?cid_compinfo for a list):

# Inchikey
cid_info$InChIKey
## [1] "YWTYJOPNNQFBPC-UHFFFAOYSA-N"
# SMILES
cid_info$CanonicalSmiles
## [1] "C1CN(C(=N1)N[N+](=O)[O-])CC2=CN=C(C=C2)Cl"
# Molecular weight
cid_info$MolecularWeight
## [1] "255.661000"

Chemical Translation Service (CTS)

CTS is very useful if you need to query different identifiers for your compound (see the CTS page for a complete list):

Let’s use CTS to query the CAS, InChIKey, chemspider ID, and pubchem id:

# CAS
cts_convert('Imidacloprid', 'Chemical Name', 'cas')
## [1] NA
# Inchikey
cts_convert('Imidacloprid', 'Chemical Name', 'inchikey')
## [1] "YWTYJOPNNQFBPC-UHFFFAOYSA-N"
# Chemspider ID
cts_convert('Imidacloprid', 'Chemical Name', 'chemspider')
## [1] "77934"
# Pubchem ID
cts_convert('Imidacloprid', 'Chemical Name', 'pubchem cid')
## [1] "86418"

CTS also provides basic informations from their database. However, you need the inchikey for this:

inch <- cts_convert('Imidacloprid', 'Chemical Name', 'inchikey')
cts_compinfo(inch)[c(1,3, 5)]
## $inchikey
## [1] "YWTYJOPNNQFBPC-UHFFFAOYSA-N"
## 
## $molweight
## [1] 255.66
## 
## $formula
## [1] "C9H10ClN5O2"

Note, I show only a subset of all data.

PAN Database

The PAN database stores a lot of information and might be particularly useful for ecotoxicologists:

pan_data <- pan('Imidacloprid', match = 'best')
# Matched Name
pan_data$`Chemical Name and matching synonym`
## [1] "Imidacloprid\nImidacloprid"
# CAS
pan_data$`CAS Number`
## [1] "105827-78-9, 138261-41-3"
# Use
pan_data$`Use Type`
## [1] "Insecticide"
# Class
pan_data$`Chemical Class`
## [1] "Neonicotinoid"
# Molecular Weight
pan_data$`Molecular Weight`
## [1] "255.7"

ropensci

Written on April 26, 2015