Molecular standardization.
Example 1: MolStandardizer¶
Original SMILES:
orig_smiles = ['[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(Cl)(Cl)Cl', # contrains counterions and solvent
'O=[13C]([O-])c1ccccc1', # Contains isotope and solvent
'O.[Na]O[13C](=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(=O)C[C@H](C)[C@H](N)C(O)=O', # mixture
'C(=O)CC[C@@H](N)C(O)=O', # stereochemistry 1
'C(=O)CC[C@H](N)C(O)=O',# stereochemistry 2
'O=C1NC=CC=C1', # Tautomer 1
'Oc1ccccn1' # Tautomer 2
]
Standardized SMILES:
std_smiles = [MolStandardizer(smi) for smi in orig_smiles]
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
df = pd.DataFrame({'orig_smiles': orig_smiles, 'std_smiles': std_smiles})
PandasTools.AddMoleculeColumnToFrame(df, smilesCol='orig_smiles', molCol='Before Standardization')
PandasTools.AddMoleculeColumnToFrame(df, smilesCol='std_smiles', molCol='After Standardization')
(1) Solvents and Metal counterions will be removed.
df.iloc[0:1,]
(2) Non-abundant isotopes will be replaced by the abundant isotope.
df.iloc[1:2,]
(3) Mixture of molecules will be removed.
df.iloc[2:3,]
(4) Stereochemistry will be keep.
df.iloc[3:5,]
(5) Tautomers will be keep.
df.iloc[5:,]
Example 2: DSclearner¶
sample_smiles = ['[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(Cl)(Cl)Cl', # contrains counterions and solvent
'O=[13C]([O-])c1ccccc1', # Contains isotope and solvent
'O.[Na]O[13C](=O)c1ccc(C[S+2]([O-])([O-]))cc1.C(=O)C[C@H](C)[C@H](N)C(O)=O', # mixture
'C(=O)CC[C@@H](N)C(O)=O', # stereochemistry 1
'C(=O)CC[C@H](N)C(O)=O',# stereochemistry 2
'O=C1NC=CC=C1', # Tautomer 1
'Oc1ccccn1', # Tautomer 2
'Oc1ccccn1' # duplicates
]
df_smiles = pd.DataFrame({'smiles': sample_smiles})
df_smiles.shape
Dataset before Standardization:
df_smiles
Dataset after Standardization
df_clean = DSclearner(df_smiles)
df_clean