| Deletions are marked like this. | Additions are marked like this. |
| Line 191: | Line 191: |
|
=== Discovering what terms can be searched on === The data that is extracted int RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question. |
=== Discovering the available search terms === The data that is extracted into RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question. |
| Line 219: | Line 219: |
|
SELECT ?molecule ?value { ?molecule cif:cell_measurement_temperature ?temp . |
SELECT ?entry ?value { ?entry cif:cell_measurement_temperature ?temp . |
| Line 228: | Line 228: |
| {{{ | |
| Line 238: | Line 238: |
| }}} |
Chempound
Chempound is a server for archiving and searching the outputs of computational chemistry calculations. It can be used as a standalone tool for managing the files on a users' personal computer, or as a managed server for curating the data generated by a group/company.
The website for chempound can be found at:
http://www.chempound.net. This also contains links to download the latest version of the software and descriptions of how to use it.
An example of a chempound server containing the results of several thousand calculations can be found here:
http://quixote.ch.cam.ac.uk.
The rest of this page is a temporary placeholder for information that will be moved to the chempound website, so please ignore for the time being!
Documentation
The existing documentation for Chempound can be found here:
-
this page:
http://quixote.wikispot.org/Chempound
-
the chempound website:
http://www.chempound.net
-
Jorge's repository:
https://bytebucket.org/jestrada/quixote-docs/wiki/main/quixote-main.html
-
Sam's in-press JODI paper:
http://wwmm.ch.cam.ac.uk/~sea36/chempound/
Repositories
The repository for the chempound packages is hosted on bitbucket:
https://bitbucket.org/chempound
Using Chempound
With a functioning chempound respository in place, we can now start to query the data held within it.
For simple searches, we can just
Browse through the files, or use the simple
Search functionality on the web interface to pull out entries of interest.
This is fine for small, arbitrary searches, but Chempound also makes it very easy to automate searches and extract subsets of the data in a variety of ways.
Chempound uses a
RESTful interface, which means that, by going to the url for a particular calculation, depending on how we make the request to the server, we can receive the requested data in a variety of formats.
The currently supported formats are:
If we take an example computational chemistry calculation done with the
Gaussian code, and hosted on the
Cambridge Chempound server, if we go to the url for the calculation with a browser, we will get an html Splash Page, with a human-readable summary of the calculation, and the ability to view the structure in jmol:
http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8893/
We get this page, because our browser has requested a text/html representation of the resource.
Getting json with Python
The following python script, sets the http header to Accept json, and then prints out the json returned.
import urllib2
import json
# url of the calculation we are interested in
url = "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/"
# Set up a a request object and add the Accept header to ask for json
request = urllib2.Request(url)
request.add_header('Accept','application/json' )
response = urllib2.urlopen(request)
# Can pass the response object to json.load, as it has a read() method
# This just creates a python dictionary, which we can query
json_output = json.load(response)
# Use json dumps method to write out formatted json
print json.dumps(json_output, sort_keys=True, indent=4)
This outputs:
{
"resources": [
{
"uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.gjf"
},
{
"uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.png"
},
{
"uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892_tn.png"
},
{
"uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.cml"
},
{
"uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.out"
}
],
"title": "C 36 H 28 P 2",
"uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/"
}
Within chempound, the various files that make up the entry for the calculation, are grouped together as an
ORE object. The resources key of the json object holds, these, and includes the uri of the original log file, the cml file, gif picture generated by jmol etc.
Requesting the RDF
Chempound is built on RDF and a primary component is a triple store containing RDF statements describing the structure of the data, and its associated metadata.
If we query the url and request the rdf serialised as xml, we can receive an object that contains the full data of the object, including the links to the files. The following python script does this and prints out the resulting rdf/xml:
import urllib2
# url of the calculation we are interested in
url = "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/"
# Set up a a request object and add the Accept header to ask for rdf
request = urllib2.Request(url)
request.add_header('Accept','application/rdf+xml' )
response = urllib2.urlopen(request)
#print out what we got back
print response.read()
SPARQL queries
SPARQL is a query language for extracting data represented as RDF, in much the same way the SQL is a language for querying data in relational databases. As the data in Chempound is stored as RDF, SPARQL is the language of choice for making complex queries against the stored data.
A good - and chemistry related - tutorial on SPARQL can be found
here.
The chempound webserver provides a page where SPARQL queries can be typed into a webpage and the results returned as html or RDF. The SPARQL page on the Cambridge server can be found
here.
The easiest way to get to grips with SPARQL is to dissect a simple query:
SELECT ?molecule
WHERE
{
?molecule <http://www.xmlcml.org/rdf-schema#formula> "H 2 O 1" .
}
The crucial line is the one stating: ?molecule <http://www.xmlcml.org/rdf-schema#formula> "H 2 O 1" .
This uses the RDF subject:predicate:object pattern. The subject is the variable molecule (variables in SPARQL are prefixed with a ?, although you can also use $), the predicate is a uri which references the CML schema, and the object is a string literal. The statement is then terminated by a full stop.
What this says is that we want to assign to the variable molecule, all the entities where the cml formula property is "'H 2 O 1"'.
The SELECT statement says that we want to the query to return the molecule variable, which will contain the list of all objects that matched the statement.
If we run this against the cambridge chempound server we get back something like the following:
Variable Bindings Result molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/258/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/261/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/262/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/263/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/264/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/265/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/266/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/267/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/268/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/269/#molecule URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_1301_1350/1325/#molecule
Which returns the uri's of all the water molecules in the database.
We can now look at a more advanced query:
PREFIX cml: <http://www.xmlcml.org/rdf-schema#>
SELECT ?formula ?inchi ?molecule
{
?molecule cml:formula ?formula .
?molecule cml:inchi ?inchi .
FILTER ( ?formula = "H 2 O 1" )
}
The first line is equivalent to declaring a namespace in xml, and associates a convenient label with a long uri, so that instead of writing <http://www.xmlcml.org/rdf-schema#>, we can just write cml.
We are now selecting 3 variables from our dataset, and they will be returned in the order we have listed them. The WHERE statement has been omitted as it is implicit.
The next two lines by themselves would select all entities in the database (and return them in the molecule variable) that had the cml properites formula and inhi. However, we are filtering the returned data to restrict the values returned to those where the value contained in the formula variable is "H 2 O 1".
Discovering the available search terms
The data that is extracted into RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question.
Please follow these links for more information on
conventions and
dictionaries.
For CIF files, the CIF
dictionary lists all the terms that are available.
For Computational Chemistry outputs, the CompChem
dictionary lists the indexed terms.
To determine how best to search for data, it is usually useful to go to the splash page for a representative structure in chempound and download the RDF file. This will show how the form of the RDF and how a structure needs to be constructed.
For example, if we wish to search on the cell_measurement_temperature, looking at the RDF for a CIF file, we see it is structured as shown below:
<iucr:cell_measurement_temperature rdf:parseType="Resource">
<rdf:value rdf:datatype="http://www.w3.org/2001/XMLSchema#double">173.0</rdf:value>
<cml:units rdf:resource="http://www.xml-cml.org/unit/sik"/>
<cml:errorValue rdf:datatype="http://www.w3.org/2001/XMLSchema#double">2.0</cml:errorValue>
</iucr:cell_measurement_temperature>
If we just search for the cell_measurement_temperature, we will be returned the RDF resource, we therefore further need to extract the value, which is done with the following query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX cif: <http://www.xml-cml.org/dictionary/cif/>
SELECT ?entry ?value
{
?entry cif:cell_measurement_temperature ?temp .
?temp rdf:value ?value .
}
A similar example for a CompChem file is shown below. This searches on a term in the compchem dictionary, and then filters the value for only those structures with a charge of 0.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX compchem: <http://www.xml-cml.org/dictionary/compchem/>
SELECT ?molecule ?charge
{
?molecule compchem:charge ?chargeR .
?chargeR rdf:value ?charge
FILTER ( ?charge = 0 )
}
Remote Chempound SPARQL queries with Python
The Chempound SPARQL page will return the results as html or rdf/xml. The rdf/xml can of course be saved and processed offline, but it is more useful to be able to query and download the results all from within a single script.
As the Chempound SPARQL endpoint exposes a RESTful API , we can query it directly. The following python script executes a SPARQL query against chempound and then saves the result as a csv (comma-separated variable) file, so that the results of the query can be imported into a spreadsheet program for (e.g.) plotting a graph of the results.
import urllib
import urllib2
import xml.etree.ElementTree as ET
# The SPARQL query we want to execute
query="""SELECT ?molecule ?inchi
WHERE
{
?molecule <http://www.xmlcml.org/rdf-schema#formula> "H 2 O 1" .
?molecule <http://www.xmlcml.org/rdf-schema#inchi> ?inchi .
}"""
# url of the chempound SPARQL endpoint we want to query
baseurl = "http://quixote.ch.cam.ac.uk/sparql/"
# The "comma-separated variable" file where the results should
# go so they can be imported into a spreadsheet program
csvFile = "/Users/jmht/sparql_results.csv"
# The real work starts here!
# SPARQL namespace - shouldn't need to change this
namespace="http://www.w3.org/2005/sparql-results#"
# NB: results format based on: http://www.w3.org/2001/sw/DataAccess/rf1/
# Set up our GET query to the SPARQL endpoint
# Encode the parts of the query string into a form suitable for POST
urlparam = { "query" : query }
querystr=urllib.urlencode(urlparam)
request = urllib2.Request(baseurl,querystr)
# Add the header to state what we want back
request.add_header('Accept','application/sparql-results+xml')
# Get the results
response = urllib2.urlopen(request)
# We now have the results in SPARQL xml so we need to turn them into
# a csv file - we use etree to do this:
# http://effbot.org/zone/element-index.htm
# Parse results to create etree & get root element
etree = ET.parse(response)
root = etree.getroot()
# Sparql query always returns 2 elements: head and results
head,results = root[:]
# Get head and create dictionary for variables
resultsDict = {}
for var in head:
resultsDict[var.get("name")] = []
# Loop through results adding the relevant bindings to the dictionary.
# Currently only support uri
nresults=len(results)
for result in results:
for binding in result:
# One element for each binding of type: uri, literal or label
# currently only deal with uri
if ( len(binding) == 1 and binding[0].tag == "{%s}uri" % namespace ):
resultsDict[binding.get("name")].append(binding[0].text)
else:
raise RuntimeError("Results only supported for uri!")
# output as csv file
rfile = open(csvFile,'w')
# column headers
headers = resultsDict.keys()
rfile.write(",".join(headers)+"\n")
# data
for i in range(nresults):
newline=[]
for header in headers:
newline.append(resultsDict[header][i])
rfile.write(",".join(newline)+"\n")
rfile.close()
Hacking Chempound
This section is for those who may be interested in altering or extending Chempound. It isn't intended to be a programmer's manual, more a brief overview of chempound's current structure and a walk-though on how to add additional CML data to the repository, which is expected to be the reason why most people would currently want to extend Chempound.
NB: additional information can be found in Jorge Estrada's
repository
Chempound is actually a very general tool for managing collections of objects (collected as
ORE aggregates) and their associated data and metadata, using
RDF for the data model. As such, almost all of the chemistry functionality is implemented using plugins, so the code that needs to be modified to change the chemistry behaviour is very localised.
Overview of the repositories
The repositories for the chempound packages is hosted on bitbucket:
https://bitbucket.org/chempound
Currently, there are 8 repositories as detailed below:
-
https://bitbucket.org/chempound/ - this contains the main server code. There is almost no chemistry-specific code here, apart from in the chempound-rdf-cml directory, which has a very small class to add some CML data to the RDF model.
-
https://bitbucket.org/chempound/chemistry - this is where the most general chemistry code lives, and where the general functions to handle the conversion of data from CML are.
-
https://bitbucket.org/chempound/chempound-client - the base classes for the command-line client (it is the client that actually handles the conversion of logfiles into CML and the generation of the jmol pictures etc) are here, although there is no chemistry-specific code here.
-
https://bitbucket.org/chempound/chempound-parent - this just contains the central maven pom.xml that is used to configure maven for chempound.
-
https://bitbucket.org/chempound/compchem - all the code to handle the data associated with computational chemistry calculations (both server and client) lives here.
-
https://bitbucket.org/chempound/crystallography - all the code to handle the crystallography-specific aspects of the data.
-
https://bitbucket.org/chempound/deposit-client - TODO - not had to look at this yet.
-
https://bitbucket.org/chempound/quixote-client - the code to drive the code-specific imports of compchem logfiles.
-
https://bitbucket.org/chempound/quixote-repository - this is more code to package chempound for use by the quixote project and create the stand-alone chempound server war file.
A slightly more detailed view of the chemistry-specific repositories and their modules follows below.
| Repository | Modules | Description and important classes | |
|
|
|
Classes to handle the generic processing of CML datatypes and the conversion to RDF | |
| * net.chempound.chemistry.cmlChemicalMine.java - mime types | |||
| * net.chempound.chemistry.Cml2RdfConverter.java - code to handle the conversion of generic, simple cml datatypes into RDF. | |||
|
|
Base classes for the client-side conversion of files and the generation of images | ||
|
|
Classes to drive jmol to generate the images, and also the jmol code itself | ||
|
|
Classes to handle the chemistry-specific search page - if you want to add more chemistry search boxes, the you'll need to edit things here. | ||
|
|
|
General code related to the compchem RDF data structures. The utility functions used by the freemarker templates to access the compchem data live here. | |
|
|
Code to handle the processing of chemical data on the server, such as display of the html pages and the freemarker templates. | ||
|
|
The classes to handle importing code-specific logfiles (NWChem, Gaussian etc) using the jumbo-classes. These classes are used by the client, not chempound itself. The test cases for checking the imports also live here. | ||
|
|
Code to test the various compchem-specific modules, as most do not contain any test code themselves. | ||
Adding New Data and editing the Splash page
Chempound extracts data from
CML in accordance with the
compchem convention. Provided that the data is a CML scalar, and is in the job's environment, initialization or finalization modules, with a dictRef (ideally) in the
compchem dictionary, then the data will already be extracted into RDF.
If additional data needs to be extracted (such as is currently done for basis sets and dft functionals), then all that may be necessary is to edit the file
CmlComp2RdfConverter.java to add the additional data to the RDF.
The html pages in chempound are generated using the
freemarker template engine. The freemarker template that is used to generate the html page for each individual structure is the file:
comp.ftl (other template and css files are in the parent directory).
In order to facilitate extracting key RDF data for use with the freemarker templates, several classes are used. For adding new terms, the following files needed to be edited:
-
CompChemCalculation.java - this defines the interface that will be used by the freemarker template to access the data.
-
CompChem.java - this creates the RDF terms that are used.
-
CompChemCalculationImpl.java - this actually implements the functions to get the data.
When the new terms have been added, the tests should be updated, or a new test added in the directory
https://bitbucket.org/chempound/compchem/src/ef32d64ba51b/compchem-importer/src/test/java/net/chempound/compchem
If the new terms are to be added to the chemistry search page, then the
CompChemSearchProvider.java file will need to be edited, and suitable tests added to the file
CompChemSearchIntegrationTest.java

