Diagnostic metrics for datasets.
RDKit WARNING: [14:09:23] Enabling RDKit 2019.09.3 jupyter extensions

mapper[source]

mapper(n_jobs)

Function for parallel computing: Original Implementation: https://github.com/molecularsets/moses/blob/master/moses/utils.py

Returns function for map call. If n_jobs == 1, will use standard map If n_jobs > 1, will use multiprocessing pool If n_jobs is a pool object, will return its map function

cos_similarity[source]

cos_similarity(train_counts, test_counts)

Computes cosine similarity between two (e.g., train and test) dictionaries of form {smiles: count}. Non-present elements are considered zero: sim = <r, g> / ||r|| / ||g||

fingerprints_generator[source]

fingerprints_generator(smi)

collect_fingerprints[source]

collect_fingerprints(smi_list, n_jobs=1)

Generates Morgan fingerprint (radius=3, bit= 1024) for a list of SMILES.

average_agg_tanimoto[source]

average_agg_tanimoto(stock_vecs, gen_vecs, batch_size=5000, agg='max', device='cpu', p=1)

For each molecule in gen_vecs finds closest molecule in stock_vecs. Returns average tanimoto score for between these molecules Parameters: stock_vecs: numpy array gen_vecs: numpy array <n_vectors' x dim> agg: max or mean p: power for averaging: (mean x^p)^(1/p)

Fragment Similarity

fragments_generator[source]

fragments_generator(smi)

fragment mol using BRICS and return smiles list

from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

smi = 'O=C(O)c1ccc(C[S](=O)=O)cc1'
mol = Chem.MolFromSmiles(smi)
mol
frgs = fragments_generator(smi)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in frgs])

collect_fragments[source]

collect_fragments(smi_list, n_jobs=1)

fragment a list of smiles using BRICS and return a dictionary of form {'fragment smiles': count}

smiles_list = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
         'O=C(O)c1ccccc1',
          'N[C@H](CCC=O)C(=O)O',
          'N[C@@H](CCC=O)C(=O)O'
         ]

collect_fragments(smiles_list)
Counter({'[16*]c1ccc([16*])cc1': 1,
         '[6*]C(=O)O': 2,
         '[8*]C[S](=O)=O': 1,
         '[16*]c1ccccc1': 1,
         'N[C@H](CCC=O)C(=O)O': 1,
         'N[C@@H](CCC=O)C(=O)O': 1})
smi_list2 = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
         'O=C(O)c1ccccc1',
          'N[C@H](CCC=O)C(=O)O',
          'N[C@@H](CCC=O)C(=O)O',
          'O=c1cccc[nH]1',
          'Oc1ccccn1', 'CSc1c(C(=O)NC2C3CC4CC(C3)CC2C4)cnn1-c1ccc(C(=O)O)cc1']

Functional Groups Similarity

merge[source]

merge(mol, marked, aset)

funcgps_generator[source]

funcgps_generator(smi)

smi = 'CSc1c(C(=O)NC2C3CC4CC(C3)CC2C4)cnn1-c1ccc(C(=O)O)cc1'
mol = Chem.MolFromSmiles(smi)
mol
funcgps = funcgps_generator(smi)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in funcgps])

collect_funcgps[source]

collect_funcgps(smi_list, n_jobs=1)

find the all functional groups from a list of smiles and return a dictionary of form {'FG smiles': count}

smiles_list = ['O=C(O)c1ccc(C[S](=O)=O)cc1',
         'O=C(O)c1ccccc1',
          'N[C@H](CCC=O)C(=O)O',
          'N[C@@H](CCC=O)C(=O)O'
         ]

collect_funcgps(smiles_list)
Counter({'[*][C](=[O])[O][H]': 4,
         '[*][S](=[O])=[O]': 1,
         '[*][N]([H])[H]': 2,
         '[*][C]([H])=[O]': 2})

Scaffold Similarity

collect_scaffolds[source]

collect_scaffolds(smi_list, n_jobs=1)

find the all scaffolds from a list of smiles and return a dictionary of form {'scaffold': count} the liner molecules have no scaffold, will be represented as ''.

smi = 'O=C(O)c1ccc(C[S](=O)=O)cc1'
mol = Chem.MolFromSmiles(smi)
mol
scaffold = generate_scaffold(smi)
Chem.MolFromSmiles(scaffold)
collect_scaffolds(smiles_list)
Counter({'c1ccccc1': 2, '': 2})

Nearest Neighbor Similarity (SNN)

SNN is the average Tanimoto Similarity between a molecule from the test set ad its nearest neighbor molecule in the training set.

SNN[source]

SNN(train_smiles, test_smiles, n_jobs=1, device='cpu', fp_type='morgan', p=1)

Computes average max similarities of test SMILES to train SMILES

Internal Diversity

Internal diversity arresses the chemical diversity within a set of molecules. A higher values corresponds to a higher diversity.

internal_diversity[source]

internal_diversity(smi_list, n_jobs=1, device='cpu', p=1)

Computes internal diversity as: 1/|A|^2 sum_{x, y in AxA} (1-tanimoto(x, y))