Tokenize SMILES (Simplified Molecular-Input Line-Entry System) into substructure units.
SPE_Tokenizer
(codes
, merges
=-1
, glossaries
=None
, exclusive_tokens
=None
)
Tokenize SMILES based on the learned SPE tokens.
codes: output file of learn_SPE()
merges: number of learned SPE tokens you want to use. -1
means using all of them. 1000
means use the most frequent 1000.
exclusive_tokens: argument that passes to atomwise_tokenizer()
glossaries: argument that passes to isolate_glossary()
dropout: See BPE-Dropout: Simple and Effective Subword Regularization.
If dropout
is set to 0, the segmentation is equivalent to the standard BPE; if dropout
is set to 1, the segmentation splits words into distinct characters.
encode
(orig
, bpe_codes
, bpe_codes_reverse
, cache
, exclusive_tokens
=None
, glossaries_regex
=None
, dropout
=0
)
Encode word based on list of SPE merge operations, which are applied consecutively.
isolate_glossary
(word
, glossary
)
Isolate a glossary present inside a word.
Returns a list of subwords. In which all 'glossary' glossaries are isolated.
For example, if 'USA' is the glossary and '1934USABUSA' the word, the return value is:
['1934', 'USA', 'B', 'USA']