Chempound

EditEdit InfoInfo TalkTalk
Search:    
Differences:

version 38 (2012-04-23 09:15:00 by JensThomas)
←previous edit
version 39 (2012-04-23 09:18:30 by JensThomas)
next edit→
Deletions are marked like this. Additions are marked like this.
Line 191: Line 191:
=== Discovering what terms can be searched on ===

The data that is extracted int RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question.
=== Discovering the available search terms ===

The data that is extracted into RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question.
Line 219: Line 219:
SELECT ?molecule ?value
{
      ?molecule cif:cell_measurement_temperature ?temp .
SELECT ?entry ?value
{
      ?entry cif:cell_measurement_temperature ?temp .
Line 228: Line 228:
{{{
Line 238: Line 238:
}}}

    1. Chempound
    2. Documentation
    3. Repositories
    4. Using Chempound
      1. Getting json with Python
      2. Requesting the RDF
    5. SPARQL queries
      1. Discovering the available search terms
    6. Remote Chempound SPARQL queries with Python
    7. Hacking Chempound
      1. Overview of the repositories
      2. Adding New Data and editing the Splash page

Chempound

Chempound is a server for archiving and searching the outputs of computational chemistry calculations. It can be used as a standalone tool for managing the files on a users' personal computer, or as a managed server for curating the data generated by a group/company.

The website for chempound can be found at: [WWW]http://www.chempound.net. This also contains links to download the latest version of the software and descriptions of how to use it.

An example of a chempound server containing the results of several thousand calculations can be found here: [WWW]http://quixote.ch.cam.ac.uk.

The rest of this page is a temporary placeholder for information that will be moved to the chempound website, so please ignore for the time being!

Documentation

The existing documentation for Chempound can be found here:

Repositories

The repository for the chempound packages is hosted on bitbucket: [WWW]https://bitbucket.org/chempound

Using Chempound

With a functioning chempound respository in place, we can now start to query the data held within it.

For simple searches, we can just [WWW]Browse through the files, or use the simple [WWW]Search functionality on the web interface to pull out entries of interest.

This is fine for small, arbitrary searches, but Chempound also makes it very easy to automate searches and extract subsets of the data in a variety of ways.

Chempound uses a [WWW]RESTful interface, which means that, by going to the url for a particular calculation, depending on how we make the request to the server, we can receive the requested data in a variety of formats.

The currently supported formats are:

If we take an example computational chemistry calculation done with the [WWW]Gaussian code, and hosted on the [WWW]Cambridge Chempound server, if we go to the url for the calculation with a browser, we will get an html Splash Page, with a human-readable summary of the calculation, and the ability to view the structure in jmol:

[WWW]http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8893/

We get this page, because our browser has requested a text/html representation of the resource.

Getting json with Python

The following python script, sets the http header to Accept json, and then prints out the json returned.

import urllib2
import json

# url of the calculation we are interested in
url = "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/"

# Set up a a request object and add the Accept header to ask for json
request = urllib2.Request(url)
request.add_header('Accept','application/json' )
response = urllib2.urlopen(request)

# Can pass the response object to json.load, as it has a read() method
# This just creates a python dictionary, which we can query
json_output = json.load(response)

# Use json dumps method to write out formatted json
print json.dumps(json_output, sort_keys=True, indent=4)

This outputs:

{
    "resources": [
        {
            "uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.gjf"
        },
        {
            "uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.png"
        },
        {
            "uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892_tn.png"
        },
        {
            "uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.cml"
        },
        {
            "uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/to-8892.out"
        }
    ],
    "title": "C 36 H 28 P 2",
    "uri": "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/"
}

Within chempound, the various files that make up the entry for the calculation, are grouped together as an [WWW]ORE object. The resources key of the json object holds, these, and includes the uri of the original log file, the cml file, gif picture generated by jmol etc.

Requesting the RDF

Chempound is built on RDF and a primary component is a triple store containing RDF statements describing the structure of the data, and its associated metadata.

If we query the url and request the rdf serialised as xml, we can receive an object that contains the full data of the object, including the links to the files. The following python script does this and prints out the resulting rdf/xml:

import urllib2

# url of the calculation we are interested in
url = "http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8892/"

# Set up a a request object and add the Accept header to ask for rdf
request = urllib2.Request(url)
request.add_header('Accept','application/rdf+xml' )
response = urllib2.urlopen(request)

#print out what we got back
print response.read()

SPARQL queries

SPARQL is a query language for extracting data represented as RDF, in much the same way the SQL is a language for querying data in relational databases. As the data in Chempound is stored as RDF, SPARQL is the language of choice for making complex queries against the stored data.

A good - and chemistry related - tutorial on SPARQL can be found [WWW]here.

The chempound webserver provides a page where SPARQL queries can be typed into a webpage and the results returned as html or RDF. The SPARQL page on the Cambridge server can be found [WWW]here.

The easiest way to get to grips with SPARQL is to dissect a simple query:

SELECT  ?molecule
WHERE
{
      ?molecule <http://www.xmlcml.org/rdf-schema#formula> "H 2 O 1" .
}

The crucial line is the one stating: ?molecule <http://www.xmlcml.org/rdf-schema#formula> "H 2 O 1" .

This uses the RDF subject:predicate:object pattern. The subject is the variable molecule (variables in SPARQL are prefixed with a ?, although you can also use $), the predicate is a uri which references the CML schema, and the object is a string literal. The statement is then terminated by a full stop.

What this says is that we want to assign to the variable molecule, all the entities where the cml formula property is "'H 2 O 1"'.

The SELECT statement says that we want to the query to return the molecule variable, which will contain the list of all objects that matched the statement.

If we run this against the cambridge chempound server we get back something like the following:

Variable Bindings Result

molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/258/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/261/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/262/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/263/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/264/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/265/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/266/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/267/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/268/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_251_300/269/#molecule
URI http://quixote.ch.cam.ac.uk/content/compchem/bangor/anna_1301_1350/1325/#molecule

Which returns the uri's of all the water molecules in the database.

We can now look at a more advanced query:

PREFIX cml: <http://www.xmlcml.org/rdf-schema#>

SELECT  ?formula ?inchi ?molecule
{
      ?molecule cml:formula ?formula .
      ?molecule cml:inchi ?inchi .
      FILTER ( ?formula = "H 2 O 1" )
}

The first line is equivalent to declaring a namespace in xml, and associates a convenient label with a long uri, so that instead of writing <http://www.xmlcml.org/rdf-schema#>, we can just write cml.

We are now selecting 3 variables from our dataset, and they will be returned in the order we have listed them. The WHERE statement has been omitted as it is implicit.

The next two lines by themselves would select all entities in the database (and return them in the molecule variable) that had the cml properites formula and inhi. However, we are filtering the returned data to restrict the values returned to those where the value contained in the formula variable is "H 2 O 1".

Discovering the available search terms

The data that is extracted into RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question.

Please follow these links for more information on [WWW]conventions and [WWW]dictionaries.

For CIF files, the CIF [WWW]dictionary lists all the terms that are available.

For Computational Chemistry outputs, the CompChem [WWW]dictionary lists the indexed terms.

To determine how best to search for data, it is usually useful to go to the splash page for a representative structure in chempound and download the RDF file. This will show how the form of the RDF and how a structure needs to be constructed.

For example, if we wish to search on the cell_measurement_temperature, looking at the RDF for a CIF file, we see it is structured as shown below:

    <iucr:cell_measurement_temperature rdf:parseType="Resource">
      <rdf:value rdf:datatype="http://www.w3.org/2001/XMLSchema#double">173.0</rdf:value>
      <cml:units rdf:resource="http://www.xml-cml.org/unit/sik"/>
      <cml:errorValue rdf:datatype="http://www.w3.org/2001/XMLSchema#double">2.0</cml:errorValue>
    </iucr:cell_measurement_temperature>

If we just search for the cell_measurement_temperature, we will be returned the RDF resource, we therefore further need to extract the value, which is done with the following query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX cif: <http://www.xml-cml.org/dictionary/cif/>

SELECT  ?entry ?value
{
      ?entry cif:cell_measurement_temperature ?temp  .
      ?temp rdf:value ?value .
}

A similar example for a CompChem file is shown below. This searches on a term in the compchem dictionary, and then filters the value for only those structures with a charge of 0.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX compchem: <http://www.xml-cml.org/dictionary/compchem/>

SELECT  ?molecule ?charge
{
      ?molecule compchem:charge ?chargeR  .
      ?chargeR rdf:value ?charge
      FILTER ( ?charge = 0 )
}

Remote Chempound SPARQL queries with Python

The Chempound SPARQL page will return the results as html or rdf/xml. The rdf/xml can of course be saved and processed offline, but it is more useful to be able to query and download the results all from within a single script.

As the Chempound SPARQL endpoint exposes a RESTful API , we can query it directly. The following python script executes a SPARQL query against chempound and then saves the result as a csv (comma-separated variable) file, so that the results of the query can be imported into a spreadsheet program for (e.g.) plotting a graph of the results.

import urllib
import urllib2
import xml.etree.ElementTree as ET

# The SPARQL query we want to execute
query="""SELECT ?molecule ?inchi
WHERE
{
?molecule <http://www.xmlcml.org/rdf-schema#formula> "H 2 O 1" .
?molecule <http://www.xmlcml.org/rdf-schema#inchi> ?inchi .
}"""

# url of the chempound SPARQL endpoint we want to query
baseurl = "http://quixote.ch.cam.ac.uk/sparql/"

# The "comma-separated variable" file where the results should
# go so they can be imported into a spreadsheet program
csvFile = "/Users/jmht/sparql_results.csv"


# The real work starts here!

# SPARQL namespace - shouldn't need to change this
namespace="http://www.w3.org/2005/sparql-results#"
# NB: results format based on: http://www.w3.org/2001/sw/DataAccess/rf1/

# Set up our GET query to the SPARQL endpoint
# Encode the parts of the query string into a form suitable for POST
urlparam = { "query" : query }
querystr=urllib.urlencode(urlparam)
request = urllib2.Request(baseurl,querystr)

# Add the header to state what we want back
request.add_header('Accept','application/sparql-results+xml')

# Get the results
response = urllib2.urlopen(request)

# We now have the results in SPARQL xml so we need to turn them into
# a csv file - we use etree to do this:
# http://effbot.org/zone/element-index.htm

# Parse results to create etree & get root element
etree = ET.parse(response)
root = etree.getroot()

# Sparql query always returns 2 elements: head and results
head,results = root[:]

# Get head and create dictionary for variables
resultsDict = {}
for var in head:
    resultsDict[var.get("name")] = []

# Loop through results adding the relevant bindings to the dictionary.
# Currently only support uri
nresults=len(results)
for result in results:
    for binding in result:
        # One element for each binding of type: uri, literal or label
        # currently only deal with uri
        if ( len(binding) == 1 and binding[0].tag == "{%s}uri" % namespace ):
            resultsDict[binding.get("name")].append(binding[0].text)
        else:
            raise RuntimeError("Results only supported for uri!")

# output as csv file
rfile = open(csvFile,'w')

# column headers
headers = resultsDict.keys()
rfile.write(",".join(headers)+"\n")

# data
for i in range(nresults):
    newline=[]
    for header in headers:
        newline.append(resultsDict[header][i])
    rfile.write(",".join(newline)+"\n")

rfile.close()

Hacking Chempound

This section is for those who may be interested in altering or extending Chempound. It isn't intended to be a programmer's manual, more a brief overview of chempound's current structure and a walk-though on how to add additional CML data to the repository, which is expected to be the reason why most people would currently want to extend Chempound.

NB: additional information can be found in Jorge Estrada's [WWW]repository

Chempound is actually a very general tool for managing collections of objects (collected as [WWW]ORE aggregates) and their associated data and metadata, using [WWW]RDF for the data model. As such, almost all of the chemistry functionality is implemented using plugins, so the code that needs to be modified to change the chemistry behaviour is very localised.

Overview of the repositories

The repositories for the chempound packages is hosted on bitbucket: [WWW]https://bitbucket.org/chempound

Currently, there are 8 repositories as detailed below:

A slightly more detailed view of the chemistry-specific repositories and their modules follows below.

Repository Modules Description and important classes
[WWW]chemistry [WWW]chemistry-common Classes to handle the generic processing of CML datatypes and the conversion to RDF
* net.chempound.chemistry.cmlChemicalMine.java - mime types
* net.chempound.chemistry.Cml2RdfConverter.java - code to handle the conversion of generic, simple cml datatypes into RDF.
[WWW]chemistry-importer Base classes for the client-side conversion of files and the generation of images
[WWW]chemistry-jmol-plugin Classes to drive jmol to generate the images, and also the jmol code itself
[WWW]chemistry-search-structure Classes to handle the chemistry-specific search page - if you want to add more chemistry search boxes, the you'll need to edit things here.
[WWW]compchem [WWW]compchem-common General code related to the compchem RDF data structures. The utility functions used by the freemarker templates to access the compchem data live here.
[WWW]compchem-handler Code to handle the processing of chemical data on the server, such as display of the html pages and the freemarker templates.
[WWW]compchem-importer The classes to handle importing code-specific logfiles (NWChem, Gaussian etc) using the jumbo-classes. These classes are used by the client, not chempound itself. The test cases for checking the imports also live here.
[WWW]compchem-test-harness Code to test the various compchem-specific modules, as most do not contain any test code themselves.

Adding New Data and editing the Splash page

Chempound extracts data from [WWW]CML in accordance with the [WWW]compchem convention. Provided that the data is a CML scalar, and is in the job's environment, initialization or finalization modules, with a dictRef (ideally) in the [WWW]compchem dictionary, then the data will already be extracted into RDF.

If additional data needs to be extracted (such as is currently done for basis sets and dft functionals), then all that may be necessary is to edit the file [WWW]CmlComp2RdfConverter.java to add the additional data to the RDF.

The html pages in chempound are generated using the [WWW]freemarker template engine. The freemarker template that is used to generate the html page for each individual structure is the file: [WWW]comp.ftl (other template and css files are in the parent directory).

In order to facilitate extracting key RDF data for use with the freemarker templates, several classes are used. For adding new terms, the following files needed to be edited:

When the new terms have been added, the tests should be updated, or a new test added in the directory [WWW]https://bitbucket.org/chempound/compchem/src/ef32d64ba51b/compchem-importer/src/test/java/net/chempound/compchem

If the new terms are to be added to the chemistry search page, then the [WWW]CompChemSearchProvider.java file will need to be edited, and suitable tests added to the file [WWW]CompChemSearchIntegrationTest.java

This is a Wiki Spot wiki. Wiki Spot is a 501(c)3 non-profit organization that helps communities collaborate via wikis.