Sachem

High-performance structure search.

Sachem is a high-performance cartridge suitable for storing small molecules and searching them using structure search. It was developed in 2017 at IOCB CAS in Bioinformatics research group of Jiří Vondrášek, by Jakub Galgonek and Mirek Kratochvíl.

Used technology is described in an article published in J. Cheminformatic: https://doi.org/10.1186/s13321-018-0282-y

Download

You can download a source code snapshot using Git:

git clone --branch ver_1.0.0 --recursive https://bioinfo.uochb.cas.cz/gitlab/chemdb/sachem.git

Installation and running

Java 1.8 and PostgreSQL 9.6 or later with libpq header files are required on target machine. Installation of Sachem then follows the standard procedure:

$ cd sachem
$ autoreconf -i
$ ./configure
$ make
# make install

Database setup

After installation, you need to set up PostgreSQL database for the indexes (please replace sachem_user with your PostgreSQL username):

CREATE SCHEMA sachem;
ALTER ROLE sachem_user SET search_path TO sachem, public;
CREATE EXTENSION sachem_lucene WITH SCHEMA sachem;

(Note that the last step requires SUPER privileges.)

Loading data

To fill in the indexes with search data, you need to set up the data download infrastructure. This is done by filling several values in "properties" files -- we provide several of these example files in Sachem source, in subdirectory config/. To use any of those, first fill in values marked by % in the file, and run the supplied update script. For ChEBI data, this works:

$ vi config/chebi.properties   # ...fill in values marked by %
$ sachem-update-data config/chebi.properties

We provide scripts for handling several other databases' specifics, notably sachem-update-pubchem and sachem-update-drugbank. Example configs are supplied accordingly.

Custom datasets can be loaded as such:

$ vi config/common-loader.properties   # ...fill in values marked by %
$ sachem-load-data config/common-loader.properties /sdf/data/directory

You can also put your SDF data into table compounds and index them using following SQL command:

SELECT "sachem_sync_data"();

The schemas of the compounds table and the sync function are, for practical reasons, simplified to the smallest possible amount of information:

CREATE TABLE compounds (
    id                    INT NOT NULL,
    molfile               TEXT NOT NULL,
    PRIMARY KEY (id)
);

CREATE FUNCTION sachem_sync_data(
    verbose boolean = false,
    optimize boolean = true
) RETURNS void;

Function parameter verbose causes printing out the operation progress; parameter optimize triggers running the index optimization on inverted indexes (this only affects Lucy and Lucene versions). Possible indexing failures (e.g. invalid molfiles or unknown atoms) are logged in table sachem_molecule_errors.

For maintenance purposes (and for saving some disk space), you may additionally want to run function sachem_cleanup() to remove old versions of the indexes. Note that this function is run automatically if you use the supplied data loading scripts.

Fingerprint statistics

After all data is loaded and indexed, you need to generate statistics about fingerprint usage that is used to drive the search pruning heuristic (see article section "Screening performance optimization by bit selection"). This is only required for the inverted-index-based Lucene and Lucy variants of Sachem. The file with the statistic is called fporder, you need to create it by running this SQL procedure:

SELECT "sachem_generate_fporder"();

(Note that fporder does not need to be updated after each change in the data, as the heuristic remains very effective even after major changes in database. It should be updated when molecules with new, previously unindexed features are inserted.)

Search and search functions

Substructure search is provided by function of this schema:

CREATE FUNCTION sachem_substructure_search(
    query varchar,
    query_type int,
    top_n int = 0,
    graph_mode int = 0,
    charge_mode int = 2,
    isotope_mode int = 0,
    stereo_mode int = 0,
    tautomer_mode int = 0,
    vf2_timeout int = 5000
) RETURNS int;

The parameters are as follows

query a molfile/SMILES/RGroup string query
query_type selects the type of the query string (0 for autodetection, 1 for SMILES, 2 for MDL (version V2000 or V3000), 3 for RGroup)
positive top_n returns only the specified amount of results
graph_mode (0 for substructure search, 1 for exact search)
charge_mode (0 for ignoring all charge information, 1 interprets unspecified query charge as zero charge, 2 interprets unspecified query charge as matching any charge)
isotope_mode (0 for ignoring all isotope information, 1 and 2 as with charge_mode)
stereo_mode (0 for ignoring stereochemistry annotations, 1 for strict stereochemistry matching)
tautomer_mode (0 for ignoring tautomerism, 1 for matching tautomer structures generated by InCHI)
vf2_timeout specifies the timeout (in milliseconds) for the graph isomorphism matching; 0 disables the timeout

Similarity search is provided by this function:

CREATE FUNCTION sachem_similarity_search(
    query_type varchar,
    query_type int,
    cutoff float4,
    top_n int = 0)
RETURNS TABLE (compound int, score float4);

The parameters are as follows

query, query_type and top_n are as with substructure search
cutoff specifies the minimum required similarity score of the results (values between 0 and 1)

Datasets used for testing

File	Download
1M dataset	dataset-1m.sdf.xz (262 MiB)
10M dataset	dataset-10m.sdf.xz (2.6 GiB)
1M dataset	dataset-94m.sdf.xz (26 GiB)