My RDKit Cheatsheet
- Setup
- Basic
- Structure Standardization
- PandasTools
- Descriptors/Fingerprints
- Similarity Search
- Substructure Search
- MCS
- Molecular Transformations
- Molecular Decomposition
- More Reading
The RDKit
pacakge only supports conda
installation.
!conda install -c rdkit rdkit -y
Install Conda and RDKit in Google Colab:
!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
rdkit.__version__
Chem vs. AllChem
As mentioned in the Getting Started:
The majority of “basic” chemical functionality (e.g. reading/writing molecules, substructure searching, molecular cleanup, etc.) is in the
rdkit.Chem
module. More advanced, or less frequently used, functionality is inrdkit.Chem.AllChem
.If you find the Chem/AllChem thing annoying or confusing, you can use python’s “import … as …” syntax to remove the irritation:
python from rdkit.Chem import AllChem as Chem
Get a RDKit molecule
from SMILES. RDKit molecule
enable several features to handle molecules: drawing, computing fingerprints/properties, molecular curation etc.
smiles = 'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'
mol = Chem.MolFromSmiles(smiles)
print(mol)
The RDKit molecules can be directly printed in jupyter enviroment.
mol
Convert a RDKit molecule to SMILES.
smi = Chem.MolToSmiles(mol)
smi
Convert a RDKit molecule to InchiKey.
Chem.MolToInchiKey(mol)
Convert a RDKit molecule to coordinative representation (which can be stored in .sdf
file).
mol_block = Chem.MolToMolBlock(mol)
print(mol_block)
Chem.MolToSmiles(Chem.MolFromSmiles(smi))
Reading sets of molecules
Major types of molecular file formats:
-
.csv
file that includes a column ofSMILES
. SeePandasTools
section. -
.smi/.txt
file that includesSMILES
. Collect the SMILES as a list. The following code is an example to read a.smi
file that contains one SMILES per line.
file_name = 'somedata.smi'
with open(file_name, "r") as ins:
smiles = []
for line in ins:
smiles.append(line.split('\n')[0])
print('# of SMILES:', len(smiles))
-
.sdf
file that includesatom coordinates
. Reading molecules from.sdf
file. Code Example
Print molecules in grid.
smiles = [
'N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1',
'c1ccc2c(c1)ccc1c2ccc2c3ccccc3ccc21',
'C=C(C)C1Cc2c(ccc3c2OC2COc4cc(OC)c(OC)cc4C2C3=O)O1',
'ClC(Cl)=C(c1ccc(Cl)cc1)c1ccc(Cl)cc1'
]
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200))
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
import pandas as pd
from rdkit.Chem import PandasTools
url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'
esol_data = pd.read_csv(url)
esol_data.head(1)
Add ROMol
to Pandas Dataframe.
PandasTools.AddMoleculeColumnToFrame(esol_data, smilesCol='smiles')
esol_data.head(1)
ROMol
column stores rdchem.Mol
object.
print(type(esol_data.ROMol[0]))
Draw the structures in grid.
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="logSolubility", molsPerRow=4)
Adding new columns of properites use Pandas
map method.
esol_data["n_Atoms"] = esol_data['ROMol'].map(lambda x: x.GetNumAtoms())
esol_data.head(1)
Before saving the dataframe as csv file, it is recommanded to drop the ROMol
column.
esol_data = esol_data.drop(['ROMol'], axis=1)
esol_data.head(1)
Descriptors/Fingerprints
The RDKit has a variety of built-in functionality for generating molecular fingerprints/descriptors. A detialed description can be found here.
url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'
esol_data = pd.read_csv(url)
PandasTools.AddMoleculeColumnToFrame(esol_data, smilesCol='smiles')
esol_data.head(1)
Morgan Fingerprint (ECFPx)
AllChem.GetMorganFingerprintAsBitVect
Parameters:
-
radius
: no default value, usually set 2 for similarity search and 3 for machine learning. -
nBits
: number of bits, default is 2048. 1024 is also widely used. - other parameterss are ususlly left to default
More examples can be found in this notebook from my previous work.
radius=3
nBits=1024
ECFP6 = [AllChem.GetMorganFingerprintAsBitVect(x,radius=radius, nBits=nBits) for x in esol_data['ROMol']]
ECFP6[0]
ECFP6 fingerprint for each molecule has 1024 bits.
len(ECFP6[0])
Save as a .csv
file for futher use (e.g., machine learning). I usually save (1) SMILES as index and (2) each bit as a column to the csv file.
ecfp6_name = [f'Bit_{i}' for i in range(nBits)]
ecfp6_bits = [list(l) for l in ECFP6]
df_morgan = pd.DataFrame(ecfp6_bits, index = esol_data.smiles, columns=ecfp6_name)
df_morgan.head(1)
ECFP6_counts = [AllChem.GetHashedMorganFingerprint(m,radius=radius, nBits=nBits) for m in mols]
print(ECFP6_counts[0])
print(len(list(ECFP6_counts[0])))
print(list(ECFP6_counts[0]))
from rdkit.Chem import MACCSkeys
maccs_keys = [MACCSkeys.GenMACCSKeys(m) for m in mols]
print(maccs_keys[0])
print(len(maccs_keys[0]))
print(list(maccs_keys[0]))
from rdkit.Chem import rdMolDescriptors
mqn_ds = [rdMolDescriptors.MQNs_(m) for m in mols]
print(len(mqn_ds[0]))
print(mqn_ds[0])
from rdkit.Chem import Descriptors
desc_list = [n[0] for n in Descriptors._descList]
print(len(desc_list))
print(desc_list)
from rdkit.ML.Descriptors import MoleculeDescriptors
calc = MoleculeDescriptors.MolecularDescriptorCalculator(desc_list)
rdkit_desc = [calc.CalcDescriptors(m) for m in mols]
print(len(rdkit_desc[0]))
print(rdkit_desc[0])
In total, we got 208 descriptors. There are two major categories: (1) physicochemical properties and (2) Fraction of a substructure (e.g., 'fr_Al_COO'). For most of the molecules, you will get a lot zeros for the 2nd category descriptors. The following code will only compute the 1st category descriptors.
phc_desc = [i for i in desc_list if not i.startswith('fr_')]
len(phc_desc)
print(phc_desc)
calc = MoleculeDescriptors.MolecularDescriptorCalculator(phc_desc)
rdkit_desc_sub = [calc.CalcDescriptors(m) for m in mols]
print(len(rdkit_desc_sub[0]))
print(rdkit_desc_sub[0])
Similarity Search
Compute the similarity of a reference molecule and a list of molecules. Here is an example of using ECFP4 fingerprint to compute the Tanimoto Similarity
(the default metric of DataStructs.FingerprintSimilarity.
chemfp is a set of command-line tools and a Python library for fingerprint generation and high-performance similarity search.
- compute fingerprints
ref_smiles = 'COC(=O)c1c[nH]c2cc(OC(C)C)c(OC(C)C)cc2c1=O'
ref_mol = Chem.MolFromSmiles(ref_smiles)
ref_ECFP4_fps = AllChem.GetMorganFingerprintAsBitVect(ref_mol,2)
ref_mol
bulk_ECFP4_fps = [AllChem.GetMorganFingerprintAsBitVect(x,2) for x in esol_data['ROMol']]
from rdkit import DataStructs
similarity_efcp4 = [DataStructs.FingerprintSimilarity(ref_ECFP4_fps,x) for x in bulk_ECFP4_fps]
We can also add the similarity_efcp4
to the dataframe and visualize the structure and similarity.
esol_data['Tanimoto_Similarity (ECFP4)'] = similarity_efcp4
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=4)
Sort the result from highest to lowest.
esol_data = esol_data.sort_values(['Tanimoto_Similarity (ECFP4)'], ascending=False)
PandasTools.FrameToGridImage(esol_data.head(8), legendsCol="Tanimoto_Similarity (ECFP4)", molsPerRow=4)
m = Chem.MolFromSmiles('c1ccccc1O')
patt = Chem.MolFromSmarts('ccO')
m.HasSubstructMatch(patt)
Here is an exmaple to find the matched molecules from a dataset.
patt = Chem.MolFromSmarts('c1ccncn1')
patt
matches = [m for m in esol_data['ROMol'] if m.HasSubstructMatch(patt)]
print(f'There are {len(matches)} matched molecules')
matches[0]
from rdkit.Chem import rdFMCS
smiles = [
'N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c1ccccc1',
'c1ccc2c(c1)ccc1c2ccc2c3ccccc3ccc21',
'C=C(C)C1Cc2c(ccc3c2OC2COc4cc(OC)c(OC)cc4C2C3=O)O1',
'ClC(Cl)=C(c1ccc(Cl)cc1)c1ccc(Cl)cc1'
]
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
res = rdFMCS.FindMCS(mols)
res.smartsString
res_mol = Chem.MolFromSmarts(res.smartsString)
res_mol
Highlight the MCS in molecules.
highlight_mcs = [mMol.GetSubstructMatch(res_mol) for mMol in mols]
Draw.MolsToGridImage(mols,
highlightAtomLists = highlight_mcs,
subImgSize=(250,250), useSVG=False, molsPerRow=4)
More Reading
- Offical documentation.
- Getting Started with the RDKit in Python
- The RDKit Book
-
RDKit Cookbook
This document provides example recipes of how to carry out particular tasks using the RDKit functionality from Python. The contents have been contributed by the RDKit community, tested with the latest RDKit release, and then compiled into this document.