randomize_smiles[source]
randomize_smiles(smiles)
Require RDKit library.
Generate a new SMILES string for the same molecule.
Perform a randomization of a SMILES string must be RDKit sanitizable.
corpus_augment[source]
corpus_augment(infile,outdir,cycles)
infile: line separated SMILES file
outdir: directory to save the augmented SMILE file.
Each round of augmentation will save as a separated file, named as infile_Ri.
cycles: number of rounds for SMILES augmentation
get_vocabulary[source]
get_vocabulary(smiles,augmentation=0,exclusive_tokens=False)
Read text and return dictionary that encodes vocabulary
update_pair_statistics[source]
update_pair_statistics(pair,changed,stats,indices)
Minimally update the indices and frequency of symbol pairs if we merge a pair of symbols, only pairs that overlap with occurrences of this pair are affected, and need to be updated.
get_pair_statistics[source]
get_pair_statistics(vocab)
Count frequency of all symbol pairs, and create index
replace_pair[source]
replace_pair(pair,vocab,indices)
Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'
prune_stats[source]
prune_stats(stats,big_stats,threshold)
Prune statistics dict for efficiency of max() The frequency of a symbol pair never increases, so pruning is generally safe (until we the most frequent pair is less frequent than a pair we previously pruned) big_stats keeps full statistics for when we need to access pruned items
learn_SPE[source]
learn_SPE(infile,outfile,num_symbols,min_frequency=2,augmentation=0,verbose=False,total_symbols=False)
Learn num_symbols SPE operations from infile and write to outfile.
infile: a list of SMILES
num_symbols: maximum total number of SPE symbols
min_frequency: the minimum frequency of SPE symbols appears.
augmentation: times of SMILES augmentation
verbose: if True, print the merging process
total_symbols: if True; the maximum total of SPE symbols = num_symbols - number of atom-level tokens