Tokenize SMILES (Simplified Molecular-Input Line-Entry System) into substructure units.
SPE_Tokenizer(codes, merges=-1, glossaries=None, exclusive_tokens=None)
Tokenize SMILES based on the learned SPE tokens.
codes: output file of learn_SPE()
merges: number of learned SPE tokens you want to use. -1 means using all of them. 1000 means use the most frequent 1000.
exclusive_tokens: argument that passes to atomwise_tokenizer()
glossaries: argument that passes to isolate_glossary()
dropout: See BPE-Dropout: Simple and Effective Subword Regularization.
If dropout is set to 0, the segmentation is equivalent to the standard BPE; if dropout is set to 1, the segmentation splits words into distinct characters.
encode(orig, bpe_codes, bpe_codes_reverse, cache, exclusive_tokens=None, glossaries_regex=None, dropout=0)
Encode word based on list of SPE merge operations, which are applied consecutively.
isolate_glossary(word, glossary)
Isolate a glossary present inside a word.
Returns a list of subwords. In which all 'glossary' glossaries are isolated.
For example, if 'USA' is the glossary and '1934USABUSA' the word, the return value is:
['1934', 'USA', 'B', 'USA']