randomize_smiles
[source]
randomize_smiles
(smiles
)
Require RDKit
library.
Generate a new SMILES string for the same molecule.
Perform a randomization of a SMILES string must be RDKit sanitizable.
corpus_augment
[source]
corpus_augment
(infile
,outdir
,cycles
)
infile: line separated SMILES file
outdir: directory to save the augmented SMILE file.
Each round of augmentation will save as a separated file, named as infile_Ri
.
cycles: number of rounds for SMILES augmentation
get_vocabulary
[source]
get_vocabulary
(smiles
,augmentation
=0
,exclusive_tokens
=False
)
Read text and return dictionary that encodes vocabulary
update_pair_statistics
[source]
update_pair_statistics
(pair
,changed
,stats
,indices
)
Minimally update the indices and frequency of symbol pairs if we merge a pair of symbols, only pairs that overlap with occurrences of this pair are affected, and need to be updated.
get_pair_statistics
[source]
get_pair_statistics
(vocab
)
Count frequency of all symbol pairs, and create index
replace_pair
[source]
replace_pair
(pair
,vocab
,indices
)
Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'
prune_stats
[source]
prune_stats
(stats
,big_stats
,threshold
)
Prune statistics dict for efficiency of max() The frequency of a symbol pair never increases, so pruning is generally safe (until we the most frequent pair is less frequent than a pair we previously pruned) big_stats keeps full statistics for when we need to access pruned items
learn_SPE
[source]
learn_SPE
(infile
,outfile
,num_symbols
,min_frequency
=2
,augmentation
=0
,verbose
=False
,total_symbols
=False
)
Learn num_symbols SPE operations from infile and write to outfile.
infile: a list of SMILES
num_symbols: maximum total number of SPE symbols
min_frequency: the minimum frequency of SPE symbols appears.
augmentation: times of SMILES augmentation
verbose: if True, print the merging process
total_symbols: if True; the maximum total of SPE symbols = num_symbols - number of atom-level tokens