Tokenize SMILES (Simplified Molecular-Input Line-Entry System) into substructure units.

class SPE_Tokenizer[source]

SPE_Tokenizer(codes, merges=-1, glossaries=None, exclusive_tokens=None)

Tokenize SMILES based on the learned SPE tokens.

codes: output file of learn_SPE()

merges: number of learned SPE tokens you want to use. -1 means using all of them. 1000 means use the most frequent 1000.

exclusive_tokens: argument that passes to atomwise_tokenizer()

glossaries: argument that passes to isolate_glossary()

dropout: See BPE-Dropout: Simple and Effective Subword Regularization. If dropout is set to 0, the segmentation is equivalent to the standard BPE; if dropout is set to 1, the segmentation splits words into distinct characters.

encode[source]

encode(orig, bpe_codes, bpe_codes_reverse, cache, exclusive_tokens=None, glossaries_regex=None, dropout=0)

Encode word based on list of SPE merge operations, which are applied consecutively.

isolate_glossary[source]

isolate_glossary(word, glossary)

Isolate a glossary present inside a word.

Returns a list of subwords. In which all 'glossary' glossaries are isolated.

For example, if 'USA' is the glossary and '1934USABUSA' the word, the return value is: ['1934', 'USA', 'B', 'USA']