Train a SmilesPE learner



Require RDKit library.

Generate a new SMILES string for the same molecule.

Perform a randomization of a SMILES string must be RDKit sanitizable.


corpus_augment(infile, outdir, cycles)

infile: line separated SMILES file outdir: directory to save the augmented SMILE file. Each round of augmentation will save as a separated file, named as infile_Ri. cycles: number of rounds for SMILES augmentation


get_vocabulary(smiles, augmentation=0, exclusive_tokens=False)

Read text and return dictionary that encodes vocabulary


update_pair_statistics(pair, changed, stats, indices)

Minimally update the indices and frequency of symbol pairs if we merge a pair of symbols, only pairs that overlap with occurrences of this pair are affected, and need to be updated.



Count frequency of all symbol pairs, and create index


replace_pair(pair, vocab, indices)

Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'


prune_stats(stats, big_stats, threshold)

Prune statistics dict for efficiency of max() The frequency of a symbol pair never increases, so pruning is generally safe (until we the most frequent pair is less frequent than a pair we previously pruned) big_stats keeps full statistics for when we need to access pruned items


learn_SPE(infile, outfile, num_symbols, min_frequency=2, augmentation=0, verbose=False, total_symbols=False)

Learn num_symbols SPE operations from infile and write to outfile.

infile: a list of SMILES

num_symbols: maximum total number of SPE symbols

min_frequency: the minimum frequency of SPE symbols appears.

augmentation: times of SMILES augmentation

verbose: if True, print the merging process

total_symbols: if True; the maximum total of SPE symbols = num_symbols - number of atom-level tokens