Diagnostic metrics for datasets.
Fragment Similarity¶
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
smi = 'O=C(O)c1ccc(C[S](=O)=O)cc1'
mol = Chem.MolFromSmiles(smi)
mol
frgs = fragments_generator(smi)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in frgs])
smiles_list = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
'O=C(O)c1ccccc1',
'N[C@H](CCC=O)C(=O)O',
'N[C@@H](CCC=O)C(=O)O'
]
collect_fragments(smiles_list)
smi_list2 = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
'O=C(O)c1ccccc1',
'N[C@H](CCC=O)C(=O)O',
'N[C@@H](CCC=O)C(=O)O',
'O=c1cccc[nH]1',
'Oc1ccccn1', 'CSc1c(C(=O)NC2C3CC4CC(C3)CC2C4)cnn1-c1ccc(C(=O)O)cc1']
Functional Groups Similarity¶
smi = 'CSc1c(C(=O)NC2C3CC4CC(C3)CC2C4)cnn1-c1ccc(C(=O)O)cc1'
mol = Chem.MolFromSmiles(smi)
mol
funcgps = funcgps_generator(smi)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in funcgps])
smiles_list = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
'O=C(O)c1ccccc1',
'N[C@H](CCC=O)C(=O)O',
'N[C@@H](CCC=O)C(=O)O'
]
collect_funcgps(smiles_list)
Scaffold Similarity¶
smi = 'O=C(O)c1ccc(C[S](=O)=O)cc1'
mol = Chem.MolFromSmiles(smi)
mol
scaffold = generate_scaffold(smi)
Chem.MolFromSmiles(scaffold)
collect_scaffolds(smiles_list)
Nearest Neighbor Similarity (SNN)¶
SNN is the average Tanimoto Similarity between a molecule from the test set ad its nearest neighbor molecule in the training set.
Internal Diversity¶
Internal diversity arresses the chemical diversity within a set of molecules. A higher values corresponds to a higher diversity.