Setup

The RDKit pacakge only supports conda installation.

!conda install -c rdkit rdkit -y

Install Conda and RDKit in Google Colab:

!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -c rdkit rdkit -y 

import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

Basic

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem

from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

rdkit.__version__
'2020.09.1'

Chem vs. AllChem

As mentioned in the Getting Started:

The majority of “basic” chemical functionality (e.g. reading/writing molecules, substructure searching, molecular cleanup, etc.) is in the rdkit.Chem module. More advanced, or less frequently used, functionality is in rdkit.Chem.AllChem.

If you find the Chem/AllChem thing annoying or confusing, you can use python’s “import … as …” syntax to remove the irritation:python from rdkit.Chem import AllChem as Chem

Get a RDKit molecule from SMILES. RDKit molecule enable several features to handle molecules: drawing, computing fingerprints/properties, molecular curation etc.

smiles = 'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'
mol = Chem.MolFromSmiles(smiles)
print(mol)
<rdkit.Chem.rdchem.Mol object at 0x7ff9f4f04e90>

The RDKit molecules can be directly printed in jupyter enviroment.

mol

Convert a RDKit molecule to SMILES.

smi = Chem.MolToSmiles(mol)
smi
'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'

Convert a RDKit molecule to InchiKey.

Chem.MolToInchiKey(mol)
'VSIUFPQOEIKNCY-UHFFFAOYSA-N'

Convert a RDKit molecule to coordinative representation (which can be stored in .sdf file).

mol_block = Chem.MolToMolBlock(mol)
print(mol_block)

     RDKit          2D

 23 24  0  0  0  0  0  0  0  0999 V2000
    5.2500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7500   -1.2990    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7500    1.2990    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7500   -1.2990    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2500    1.2990    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -6.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.5000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2500   -1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.0000    2.5981    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.7500    3.8971    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.0000    5.1962    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.7500    6.4952    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000    5.1962    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.5000    2.5981    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7500    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.5000    2.5981    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  3  4  2  0
  3  5  1  0
  5  6  2  0
  6  7  1  0
  7  8  1  0
  8  9  2  0
  9 10  1  0
 10 11  1  0
 11 12  1  0
 12 13  1  0
 12 14  1  0
 10 15  2  0
 15 16  1  0
 16 17  1  0
 17 18  1  0
 17 19  1  0
 15 20  1  0
 20 21  2  0
 21 22  1  0
 22 23  2  0
 22  5  1  0
 21  8  1  0
M  END

SMILES Canonicalization

Chem.MolToSmiles(Chem.MolFromSmiles(smi))
'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'

Reading sets of molecules

Major types of molecular file formats:

  1. .csv file that includes a column of SMILES. See PandasTools section.
  2. .smi/.txt file that includes SMILES. Collect the SMILES as a list. The following code is an example to read a .smi file that contains one SMILES per line.
file_name = 'somedata.smi'

with open(file_name, "r") as ins:
    smiles = []
    for line in ins:
        smiles.append(line.split('\n')[0])
print('# of SMILES:', len(smiles))
  1. .sdf file that includes atom coordinates. Reading molecules from .sdf file. Code Example

Draw molecules in Jupter environment

Print molecules in grid.

smiles = [
    'N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1',
    'c1ccc2c(c1)ccc1c2ccc2c3ccccc3ccc21',
    'C=C(C)C1Cc2c(ccc3c2OC2COc4cc(OC)c(OC)cc4C2C3=O)O1',
    'ClC(Cl)=C(c1ccc(Cl)cc1)c1ccc(Cl)cc1'
]

mols = [Chem.MolFromSmiles(smi) for smi in smiles]
Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200))

Disable Error Message

from rdkit import RDLogger 
RDLogger.DisableLog('rdApp.*')    

Structure Standardization

PandasTools

PandasTools enables using RDKit molecules as columns of a Pandas Dataframe.

import pandas as pd
from rdkit.Chem import PandasTools
url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'

esol_data = pd.read_csv(url)
esol_data.head(1)
smiles logSolubility
0 N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c... -0.77

Add ROMol to Pandas Dataframe.

PandasTools.AddMoleculeColumnToFrame(esol_data, smilesCol='smiles')
esol_data.head(1)
smiles logSolubility ROMol
0 N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c... -0.77 Mol

ROMol column stores rdchem.Mol object.

print(type(esol_data.ROMol[0]))
<class 'rdkit.Chem.rdchem.Mol'>

Draw the structures in grid.

PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="logSolubility", molsPerRow=4)

Adding new columns of properites use Pandas map method.

esol_data["n_Atoms"] = esol_data['ROMol'].map(lambda x: x.GetNumAtoms())
esol_data.head(1)
smiles logSolubility ROMol n_Atoms
0 N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c... -0.77 Mol 32

Before saving the dataframe as csv file, it is recommanded to drop the ROMol column.

esol_data = esol_data.drop(['ROMol'], axis=1)
esol_data.head(1)
smiles logSolubility n_Atoms
0 N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c... -0.77 32

Descriptors/Fingerprints

The RDKit has a variety of built-in functionality for generating molecular fingerprints/descriptors. A detialed description can be found here.

url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'

esol_data = pd.read_csv(url)
PandasTools.AddMoleculeColumnToFrame(esol_data, smilesCol='smiles')
esol_data.head(1)
smiles logSolubility ROMol
0 N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c... -0.77 Mol

Morgan Fingerprint (ECFPx)

AllChem.GetMorganFingerprintAsBitVect Parameters:

  1. radius: no default value, usually set 2 for similarity search and 3 for machine learning.
  2. nBits: number of bits, default is 2048. 1024 is also widely used.
  3. other parameterss are ususlly left to default

More examples can be found in this notebook from my previous work.

radius=3
nBits=1024

ECFP6 = [AllChem.GetMorganFingerprintAsBitVect(x,radius=radius, nBits=nBits) for x in esol_data['ROMol']]
ECFP6[0]
<rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x7ff9f686bd50>

ECFP6 fingerprint for each molecule has 1024 bits.

len(ECFP6[0])
1024

Save as a .csv file for futher use (e.g., machine learning). I usually save (1) SMILES as index and (2) each bit as a column to the csv file.

ecfp6_name = [f'Bit_{i}' for i in range(nBits)]
ecfp6_bits = [list(l) for l in ECFP6]
df_morgan = pd.DataFrame(ecfp6_bits, index = esol_data.smiles, columns=ecfp6_name)
df_morgan.head(1)

Bit_0 Bit_1 Bit_2 Bit_3 Bit_4 Bit_5 Bit_6 Bit_7 Bit_8 Bit_9 ... Bit_1014 Bit_1015 Bit_1016 Bit_1017 Bit_1018 Bit_1019 Bit_1020 Bit_1021 Bit_1022 Bit_1023
smiles
N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1 0 1 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

1 rows Ă— 1024 columns

ECFP6 (Count version)

ECFP6_counts = [AllChem.GetHashedMorganFingerprint(m,radius=radius, nBits=nBits) for m in mols]

print(ECFP6_counts[0])
print(len(list(ECFP6_counts[0])))
print(list(ECFP6_counts[0]))

<rdkit.DataStructs.cDataStructs.UIntSparseIntVect object at 0x7ff9f686bcb0>
1024
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0]

MACCS Keys

from rdkit.Chem import MACCSkeys

maccs_keys = [MACCSkeys.GenMACCSKeys(m) for m in mols]

print(maccs_keys[0])
print(len(maccs_keys[0]))
print(list(maccs_keys[0]))

<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff9f600cf30>
167
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0]

Molecular Quantun Numbers (MQN) Descriptors

from rdkit.Chem import rdMolDescriptors

mqn_ds = [rdMolDescriptors.MQNs_(m) for m in mols]

print(len(mqn_ds[0]))
print(mqn_ds[0])

42
[20, 0, 0, 0, 0, 0, 0, 1, 0, 9, 2, 32, 15, 0, 1, 15, 3, 0, 7, 23, 12, 7, 7, 0, 0, 8, 5, 1, 0, 7, 11, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0]

RDKit 2D Descriptors

from rdkit.Chem import Descriptors

desc_list = [n[0] for n in Descriptors._descList]
print(len(desc_list))
print(desc_list)

208
['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR', 'fr_Al_COO', 'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_ArN', 'fr_Ar_COO', 'fr_Ar_N', 'fr_Ar_NH', 'fr_Ar_OH', 'fr_COO', 'fr_COO2', 'fr_C_O', 'fr_C_O_noCOO', 'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_NH0', 'fr_NH1', 'fr_NH2', 'fr_N_O', 'fr_Ndealkylation1', 'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde', 'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide', 'fr_amidine', 'fr_aniline', 'fr_aryl_methyl', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzene', 'fr_benzodiazepine', 'fr_bicyclic', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_ether', 'fr_furan', 'fr_guanido', 'fr_halogen', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide', 'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss', 'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile', 'fr_nitro', 'fr_nitro_arom', 'fr_nitro_arom_nonortho', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_para_hydroxylation', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperdine', 'fr_piperzine', 'fr_priamide', 'fr_prisulfonamd', 'fr_pyridine', 'fr_quatN', 'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene', 'fr_unbrch_alkane', 'fr_urea']
from rdkit.ML.Descriptors import MoleculeDescriptors

calc = MoleculeDescriptors.MolecularDescriptorCalculator(desc_list)

rdkit_desc = [calc.CalcDescriptors(m) for m in mols]

print(len(rdkit_desc[0]))
print(rdkit_desc[0])

208
(10.253328884035504, -1.7016049382716052, 10.253328884035504, 0.48660209297610124, 0.21751778620947723, 457.4320000000001, 430.2160000000003, 457.15841068399993, 178, 0, 0.1882658561787514, -0.3935665150995826, 0.3935665150995826, 0.1882658561787514, 0.8125, 1.375, 1.96875, 16.732837958750743, 9.980560326593041, 2.4752382296424567, -2.4250918642238615, 2.28493545052844, -2.6139256091055985, 5.216692900751268, -0.3337292050602669, 1.6549374848621088, 759.6629378441627, 23.41348460451408, 16.86251979526223, 16.86251979526223, 15.277294663077212, 9.99881597892019, 9.99881597892019, 7.601218129882768, 7.601218129882768, 5.431494058773282, 5.431494058773282, 3.5069302432183513, 3.5069302432183513, -1.73, 12121197.595769433, 24.90347401250815, 10.926355847078131, 5.251706120285141, 182.9353272788593, 54.693143579085124, 48.83173110198641, 18.684019846418728, 0.0, 0.0, 0.0, 0.0, 0.0, 5.261891554738487, 0.0, 30.33183534230805, 5.563451491696996, 0.0, 19.28298524181811, 54.693143579085124, 0.0, 5.261891554738487, 0.0, 0.0, 67.51575094840514, 13.213763929025836, 35.89528683400505, 0.0, 6.069221312792274, 0.0, 0.0, 0.0, 0.0, 110.37124025356763, 18.947451815200196, 11.33111286753076, 11.6674178794453, 30.33183534230805, 0.0, 0.0, 0.0, 202.32, 80.72951487743097, 41.00758331862342, 0.0, 0.0, 5.563451491696996, 0.0, 0.0, 30.33183534230805, 6.069221312792274, 0.0, 18.947451815200196, 21.59235497971106, 0.0, 0.0, 79.05525356967499, 0.48660209297610124, 0.0, 10.315179555141746, -16.47644309455915, -1.1396137696114104, 0.0, 0.65, 32, 7, 12, 0, 2, 2, 1, 0, 1, 12, 7, 12, 7, 0, 2, 2, 3, -3.108019999999997, 102.28160000000001, 0, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

In total, we got 208 descriptors. There are two major categories: (1) physicochemical properties and (2) Fraction of a substructure (e.g., 'fr_Al_COO'). For most of the molecules, you will get a lot zeros for the 2nd category descriptors. The following code will only compute the 1st category descriptors.

phc_desc = [i for i in desc_list if not i.startswith('fr_')]
len(phc_desc)
print(phc_desc)

['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR']
calc = MoleculeDescriptors.MolecularDescriptorCalculator(phc_desc)

rdkit_desc_sub = [calc.CalcDescriptors(m) for m in mols]

print(len(rdkit_desc_sub[0]))
print(rdkit_desc_sub[0])

123
(10.253328884035504, -1.7016049382716052, 10.253328884035504, 0.48660209297610124, 0.21751778620947723, 457.4320000000001, 430.2160000000003, 457.15841068399993, 178, 0, 0.1882658561787514, -0.3935665150995826, 0.3935665150995826, 0.1882658561787514, 0.8125, 1.375, 1.96875, 16.732837958750743, 9.980560326593041, 2.4752382296424567, -2.4250918642238615, 2.28493545052844, -2.6139256091055985, 5.216692900751268, -0.3337292050602669, 1.6549374848621088, 759.6629378441627, 23.41348460451408, 16.86251979526223, 16.86251979526223, 15.277294663077212, 9.99881597892019, 9.99881597892019, 7.601218129882768, 7.601218129882768, 5.431494058773282, 5.431494058773282, 3.5069302432183513, 3.5069302432183513, -1.73, 12121197.595769433, 24.90347401250815, 10.926355847078131, 5.251706120285141, 182.9353272788593, 54.693143579085124, 48.83173110198641, 18.684019846418728, 0.0, 0.0, 0.0, 0.0, 0.0, 5.261891554738487, 0.0, 30.33183534230805, 5.563451491696996, 0.0, 19.28298524181811, 54.693143579085124, 0.0, 5.261891554738487, 0.0, 0.0, 67.51575094840514, 13.213763929025836, 35.89528683400505, 0.0, 6.069221312792274, 0.0, 0.0, 0.0, 0.0, 110.37124025356763, 18.947451815200196, 11.33111286753076, 11.6674178794453, 30.33183534230805, 0.0, 0.0, 0.0, 202.32, 80.72951487743097, 41.00758331862342, 0.0, 0.0, 5.563451491696996, 0.0, 0.0, 30.33183534230805, 6.069221312792274, 0.0, 18.947451815200196, 21.59235497971106, 0.0, 0.0, 79.05525356967499, 0.48660209297610124, 0.0, 10.315179555141746, -16.47644309455915, -1.1396137696114104, 0.0, 0.65, 32, 7, 12, 0, 2, 2, 1, 0, 1, 12, 7, 12, 7, 0, 2, 2, 3, -3.108019999999997, 102.28160000000001)

Compute the similarity of a reference molecule and a list of molecules. Here is an example of using ECFP4 fingerprint to compute the Tanimoto Similarity (the default metric of DataStructs.FingerprintSimilarity.

chemfp is a set of command-line tools and a Python library for fingerprint generation and high-performance similarity search.

  1. compute fingerprints
ref_smiles = 'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'
ref_mol = Chem.MolFromSmiles(ref_smiles)
ref_ECFP4_fps = AllChem.GetMorganFingerprintAsBitVect(ref_mol,2)
ref_mol
bulk_ECFP4_fps = [AllChem.GetMorganFingerprintAsBitVect(x,2) for x in esol_data['ROMol']]
from rdkit import DataStructs

similarity_efcp4 = [DataStructs.FingerprintSimilarity(ref_ECFP4_fps,x) for x in bulk_ECFP4_fps]

We can also add the similarity_efcp4 to the dataframe and visualize the structure and similarity.

esol_data['Tanimoto_Similarity (ECFP4)'] = similarity_efcp4
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=4)

Sort the result from highest to lowest.

esol_data = esol_data.sort_values(['Tanimoto_Similarity (ECFP4)'], ascending=False)
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=4)

Native Method

This method is suitable for search substructures in a small dataset. The query substructure can be built from either SMILES or SMARTS. But the semantics of them are not exactly equivalent. In practice, SMARTS is more commonly used. SMARTS allows complex atom and bond expressions.

m = Chem.MolFromSmiles('c1ccccc1O')
patt = Chem.MolFromSmarts('ccO')
m.HasSubstructMatch(patt)
True

Here is an exmaple to find the matched molecules from a dataset.

patt = Chem.MolFromSmarts('c1ccncn1')
patt
matches = [m for m in esol_data['ROMol'] if m.HasSubstructMatch(patt)]
print(f'There are {len(matches)} matched molecules')
matches[0]
There are 48 matched molecules

MCS

from rdkit.Chem import rdFMCS
smiles = [
    'N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1',
    'c1ccc2c(c1)ccc1c2ccc2c3ccccc3ccc21',
    'C=C(C)C1Cc2c(ccc3c2OC2COc4cc(OC)c(OC)cc4C2C3=O)O1',
    'ClC(Cl)=C(c1ccc(Cl)cc1)c1ccc(Cl)cc1'
]

mols = [Chem.MolFromSmiles(smi) for smi in smiles]

res = rdFMCS.FindMCS(mols)
res.smartsString
'[#6](-,:[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1)-,:[#6]'
res_mol = Chem.MolFromSmarts(res.smartsString)
res_mol

Highlight the MCS in molecules.

highlight_mcs = [mMol.GetSubstructMatch(res_mol) for mMol in mols]
Draw.MolsToGridImage(mols, 
                     highlightAtomLists = highlight_mcs,
                     subImgSize=(250,250), useSVG=False, molsPerRow=4)

Molecular Transformations

Molecular Decomposition

More Reading

  1. Offical documentation.
  2. Getting Started with the RDKit in Python
  3. The RDKit Book
  4. RDKit Cookbook

    This document provides example recipes of how to carry out particular tasks using the RDKit functionality from Python. The contents have been contributed by the RDKit community, tested with the latest RDKit release, and then compiled into this document.