Functions for `Random Split` and `Scaffold Split`
Example 1: Generating Scaffolds from SMILES¶
smiles = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
'O=C(O)c1ccccc1',
'N[C@H](CCC=O)C(=O)O',
'N[C@@H](CCC=O)C(=O)O',
'O=c1cccc[nH]1',
'Oc1ccccn1',
'Cc1cc(Oc2nccc(CCC)c2)ccc1',
'COc1cc(OC)c(S(=O)(=O)N2c3ccccc3CCC2C)cc1NC(=O)CSCC(=O)O',
'Nc1ccc(-c2nc3ccc(O)cc3s2)cc1',
'O=C(O)c1cccc(N2CCC(CN3CCC(Oc4ccc(Cl)c(Cl)c4)CC3)CC2)c1',
'CSc1c(C(=O)NC2C3CC4CC(C3)CC2C4)cnn1-c1ccc(C(=O)O)cc1'
]
scaffolds = []
for smi in smiles:
scaffolds.append(generate_scaffold(smi))
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
data = pd.DataFrame({'smiles': smiles, 'scaffold': scaffolds})
PandasTools.AddMoleculeColumnToFrame(data, smilesCol='smiles', molCol='Mol (SMILES)')
PandasTools.AddMoleculeColumnToFrame(data, smilesCol='scaffold', molCol='Mol (Scaffold)')
data.iloc[:,2:]
Example 2: Collecting scaffolds from a list of SMILES¶
lipo_data = pd.read_csv('../clean_data/Lipophilicity.csv')
lipo_data.shape
lipo_data.head(1)
scaffolds = scaffold_to_smiles(lipo_data.smiles, use_indices=True)
counts = 0
for i in scaffolds.keys():
if len(scaffolds[i]) ==1:
counts +=1
print(f'There are {counts} ({counts/len(scaffolds):.2f} of the original data) scafoolds appearing only once.')
max_counts = 0
for i in scaffolds.keys():
if len(scaffolds[i]) >= max_counts:
max_counts = len(scaffolds[i])
scaffold_max = i
print(f'The scaffold {scaffold_max} appears {max_counts} times, which is the most.')
Chem.MolFromSmiles(scaffold_max)
Example 3: Scaffold Split¶
splits, splits_index = scaffold_split(lipo_data.smiles, balanced=True, seed = 0)
train, val, test = splits
print(len(train), len(val), len(test))
Example 4: Random Split¶
splits, splits_index = random_split(lipo_data.smiles, seed = 0)
train, val, test = splits
print(len(train), len(val), len(test))
Example 5: Add multiple splits to a dataset.¶
lipo_splits = generate_folds(lipo_data, 'random', num_folds=10)
lipo_splits.head(1)
We can further add scaffold split
to the dataset.
lipo_splits = generate_folds(lipo_splits, 'scaffold', num_folds=10)
lipo_splits.head(1)