Fingerprint is the numerical representation of a molecule. The Fingerprint of a molecule contains set of different numbers, which usually describe the properties of a molecule such as physio-chemical properties, composition, topological features, substructure etc. Once the fingerprints of the molecules are computed they can be used for variety of different calculation such as similarity searching or model building.
MQN stands for Molecular Quantum Numbers and is one of the fingerprints we developed in our group. MQN represents the molecule using 42 integer value descriptors of molecular structure, which count different type of atoms, different bond types, polar groups, and topological features.
Reference: Classification of Organic Molecules by Molecular Quantum Numbers. K. T. Nguyen, L. C. Blum, R. van Deursen, J.-L. Reymond, ChemMedChem 2009, 4, 1803-1805.
Extended-Connectivity Fingerprints (ECFPs) are circular topological fingerprints designed for molecular characterization, similarity searching, and structure-activity modeling. It is among the most popular similarity search tools in drug discovery and they are effectively used in a wide variety of applications. ECfp4 encodes substructure patterns from molecules on to the bit string of length 1024 (length is variable). ECFP generate the substructure patterns by considering the atoms into multiple circular layers up to a given diameter. In our browser we have used the ECFP with diameter of 4 (ECfp4).
Reference: Extended-Connectivity Fingerprints. Rogers, D.; Hahn, M. J. Chem. Inf. Model. 2010, 50(5), 742-754.
MHFP6 (MinHash fingerprint, up to six bonds) is a molecular fingerprint which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Github repository can be found here: MHFP6 Github repo
Reference: A Probabilistic Molecular Fingerprint for Big Data Settings Daniel Probst, Jean-Louis Reymond, J. Cheminformatics, 2018, 66(10), doi:10.1186/s13321-018-0321-8.
GDBChEMBL is a subset of the GDB17 database (all virtually enumerated compounds upto 17 hevay atoms). It is a collection of highly diverse molecular structures that are rich in substructures also frequently found among high confidence score molecules of ChEMBL 24.1.
Reference: xxx.xxx.xxx
~10 million
ChEMBL_DrugBank_UNPD contains the compound from DrugBank, ChEMBL22 and Universal natural product directory (UNPD). It only contains the compounds upto 17 heavy atoms.
~128 thousands compounds
Nearest neighbors are compounds from database which are most similar to a query compound.
In case of MQN city block distance (Manhattan) is use. In case of ECfp4 and MHFP6 Jaccard distance is use.
We used approximate nearest neighbor search algorithm called "Annoy (Approximate Nearest Neighbors Oh Yeah)". Github repository can be found here: Annoy Github repo
Given a query molecule, first we retrieve "N" number of nearest neighbors using MQN and then we resort the MQN nearest neighbors as per their MHFP6 Jaccard distances with respect to a query.