GDBChEMBL Similarity Search

What is a Fingerprint?

Fingerprint is the numerical representation of a molecule. The Fingerprint of a molecule contains set of different numbers, which usually describe the properties of a molecule such as physio-chemical properties, composition, topological features, substructure etc. Once the fingerprints of the molecules are computed they can be used for variety of different calculation such as similarity searching or model building.

What is MQN?

MQN stands for Molecular Quantum Numbers and is one of the fingerprints we developed in our group. MQN represents the molecule using 42 integer value descriptors of molecular structure, which count different type of atoms, different bond types, polar groups, and topological features.

Reference: Classification of Organic Molecules by Molecular Quantum Numbers. K. T. Nguyen, L. C. Blum, R. van Deursen, J.-L. Reymond, ChemMedChem 2009, 4, 1803-1805.

What is ECfp4?

Extended-Connectivity Fingerprints (ECFPs) are circular topological fingerprints designed for molecular characterization, similarity searching, and structure-activity modeling. It is among the most popular similarity search tools in drug discovery and they are effectively used in a wide variety of applications. ECfp4 encodes substructure patterns from molecules on to the bit string of length 1024 (length is variable). ECFP generate the substructure patterns by considering the atoms into multiple circular layers up to a given diameter. In our browser we have used the ECFP with diameter of 4 (ECfp4).

Reference: Extended-Connectivity Fingerprints. Rogers, D.; Hahn, M. J. Chem. Inf. Model. 2010, 50(5), 742-754.

What is MHFP6?

MHFP6 (MinHash fingerprint, up to six bonds) is a molecular fingerprint which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Github repository can be found here: MHFP6 Github repo

Reference: A Probabilistic Molecular Fingerprint for Big Data Settings Daniel Probst, Jean-Louis Reymond, J. Cheminformatics, 2018, 66(10), doi:10.1186/s13321-018-0321-8.

What is the GDBChEMBL database?

GDBChEMBL is a subset of the GDB17 database (all virtually enumerated compounds upto 17 hevay atoms). It is a collection of highly diverse molecular structures that are rich in substructures also frequently found among high confidence score molecules of ChEMBL 24.1.

Reference: xxx.xxx.xxx

How many molecular structures are present in GDBChEMBL?

~10 million

What is a ChEMBL17_DrugBank17_UNPD17 database?

ChEMBL_DrugBank_UNPD contains the compound from DrugBank, ChEMBL22 and Universal natural product directory (UNPD). It only contains the compounds upto 17 heavy atoms.

How many compounds are present in a ChEMBL_DrugBank_UNPD database?

~128 thousands compounds

What are the nearest neighbors of a query molecule?

Nearest neighbors are compounds from database which are most similar to a query compound.

What similarity metric is use for similarity calculation?

In case of MQN city block distance (Manhattan) is use. In case of ECfp4 and MHFP6 Jaccard distance is use.

How the similarity searching is implemented?

We used approximate nearest neighbor search algorithm called "Annoy (Approximate Nearest Neighbors Oh Yeah)". Github repository can be found here: Annoy Github repo

How the similarity searching with MQN-MHFP6 works?

Given a query molecule, first we retrieve "N" number of nearest neighbors using MQN and then we resort the MQN nearest neighbors as per their MHFP6 Jaccard distances with respect to a query.