Molecular standardization.
RDKit WARNING: [14:01:31] Enabling RDKit 2019.09.3 jupyter extensions

remove_mixture[source]

remove_mixture(mol)

Function to remove mixture

MolStandardizer[source]

MolStandardizer(smiles)

The main function for molecular standardization:

  • Sanitizing with RDKit: santize mol; remove Hs; disconnect metals; normalize mol; reionize mol; recalculate stereochemistry.
  • Replace all atoms with the most abundant isotope for that element.
  • Remove counterions in the salts and neutralize the molecules.
  • Remove the mixture.

DSclearner[source]

DSclearner(df)

Standardize SMILES and remove the duplicates in a dataset.

The column contrains SMILES must be named as 'smiles'

Example 1: MolStandardizer

Original SMILES:

orig_smiles = ['[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(Cl)(Cl)Cl', # contrains counterions and solvent 
              'O=[13C]([O-])c1ccccc1', # Contains isotope and solvent
              'O.[Na]O[13C](=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(=O)C[C@H](C)[C@H](N)C(O)=O', # mixture
              'C(=O)CC[C@@H](N)C(O)=O', # stereochemistry 1
              'C(=O)CC[C@H](N)C(O)=O',# stereochemistry 2
              'O=C1NC=CC=C1', # Tautomer 1
              'Oc1ccccn1' # Tautomer 2
              ] 

Standardized SMILES:

std_smiles = [MolStandardizer(smi) for smi in orig_smiles]
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools

df = pd.DataFrame({'orig_smiles': orig_smiles, 'std_smiles': std_smiles})
PandasTools.AddMoleculeColumnToFrame(df, smilesCol='orig_smiles', molCol='Before Standardization')
PandasTools.AddMoleculeColumnToFrame(df, smilesCol='std_smiles', molCol='After Standardization')
RDKit ERROR: [14:59:56] SMILES Parse Error: syntax error while parsing: NaN
RDKit ERROR: [14:59:56] SMILES Parse Error: Failed parsing SMILES 'NaN' for input: 'NaN'

(1) Solvents and Metal counterions will be removed.

df.iloc[0:1,]
orig_smiles std_smiles Before Standardization After Standardization
0 [Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(Cl)(C... O=C(O)c1ccc(C[S](=O)=O)cc1 Mol Mol

(2) Non-abundant isotopes will be replaced by the abundant isotope.

df.iloc[1:2,]
orig_smiles std_smiles Before Standardization After Standardization
1 O=[13C]([O-])c1ccccc1 O=C(O)c1ccccc1 Mol Mol

(3) Mixture of molecules will be removed.

df.iloc[2:3,]
orig_smiles std_smiles Before Standardization After Standardization
2 O.[Na]O[13C](=O)c1ccc(C[S+2]([O-])([O-]))cc1.C... NaN Mol None

(4) Stereochemistry will be keep.

df.iloc[3:5,]
orig_smiles std_smiles Before Standardization After Standardization
3 C(=O)CC[C@@H](N)C(O)=O N[C@H](CCC=O)C(=O)O Mol Mol
4 C(=O)CC[C@H](N)C(O)=O N[C@@H](CCC=O)C(=O)O Mol Mol

(5) Tautomers will be keep.

df.iloc[5:,]
orig_smiles std_smiles Before Standardization After Standardization
5 O=C1NC=CC=C1 O=c1cccc[nH]1 Mol Mol
6 Oc1ccccn1 Oc1ccccn1 Mol Mol

Example 2: DSclearner

sample_smiles = ['[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(Cl)(Cl)Cl', # contrains counterions and solvent 
              'O=[13C]([O-])c1ccccc1', # Contains isotope and solvent
              'O.[Na]O[13C](=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(=O)C[C@H](C)[C@H](N)C(O)=O', # mixture
              'C(=O)CC[C@@H](N)C(O)=O', # stereochemistry 1
              'C(=O)CC[C@H](N)C(O)=O',# stereochemistry 2
              'O=C1NC=CC=C1', # Tautomer 1
              'Oc1ccccn1', # Tautomer 2
              'Oc1ccccn1' # duplicates
              ] 

df_smiles = pd.DataFrame({'smiles': sample_smiles})
df_smiles.shape
(8, 1)

Dataset before Standardization:

df_smiles
smiles
0 [Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(Cl)(C...
1 O=[13C]([O-])c1ccccc1
2 O.[Na]O[13C](=O)c1ccc(C[S+2]([O-])([O-]))cc1.C...
3 C(=O)CC[C@@H](N)C(O)=O
4 C(=O)CC[C@H](N)C(O)=O
5 O=C1NC=CC=C1
6 Oc1ccccn1
7 Oc1ccccn1

Dataset after Standardization

df_clean = DSclearner(df_smiles)
df_clean
smiles
0 O=C(O)c1ccc(C[S](=O)=O)cc1
1 O=C(O)c1ccccc1
3 N[C@H](CCC=O)C(=O)O
4 N[C@@H](CCC=O)C(=O)O
5 O=c1cccc[nH]1
6 Oc1ccccn1