Eduard Szöcs

Data in Environmental Science and Eco(toxico-)logy

Web scraping chemical data with R

Update: These functions have been integrated into the webchem package and the functions removed from the esmisc package!

I recently came across the problem to convert CAS numbers into SMILES and retrieve additional information about the compound.

The are several sources around the web that provide chemical informations, eg. PubChem, ChemSpider and the Chemical Identifier Resolver.

I wrote up some functions to interact from R with these servers. You can find them in my esmisc package:

install.packages('devtools')
require(devtools)
install_github('esmisc', 'EDiLD')
require(esmisc)

These functions are very crude and need some further development (if you want to improve, fork the package!), however, here’s a short summary:

Covert CAS to SMILES

Suppose we have some CAS numbers and want to convert them to SMILES:

casnr <- c("107-06-2", "107-13-1", "319-86-8")
Via Cactus
cactus(casnr, output = 'smiles')
Via ChemSpider

Note, that ChemSpider requires a security token. To obtain a token please register at ChemSpider.

csid <- get_csid(casnr, token = token)
csid_to_smiles(csid, token)
Via PubChem
cid <- get_cid(casnr)
cid_to_smiles(cid)

Retrieve other data from CAS

All these web resources provide additional data. Here is an example retrieving the molecular weights:

Via cactus
cactus(casnr, output = 'mw')
Via ChemSpider
csid_to_ext(csid, token)$MolecularWeight
Via Pubchem
cid_to_ext(cid)$mw

ChemSpider and PubChem return the same values, however the results from cactus are slightly different.

Retrieve partitioning coefficients

Partition coefficients are another useful property. LOGKOW is a databank that contains experimental data, retrieved from the literature, on over 20,000 organic compounds.

get_kow() extracts the ‘Recommended values’ for a given CAS:

get_kow(casnr)

This function is very crude. For example, it returns only the first hit if multiple hits are found in the database - a better way would be to ask for user input, as we did the taxize package.

Outlook

Currently I have no time to extensively develop these functions. I would be happy if someone picks up this work - it’s fairly easy: just fork the repo and start.

In future this could be turned into a ROpenSci package as it is within their scope.

Written on July 9, 2014