Tokenize SMILES (Simplified Molecular-Input Line-Entry System) into units.
Tokenize a SMILES string on atom-level.
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = atomwise_tokenizer(smi)
print(toks)
Tokenize a SMILES string on atom-level. Only include specifcal symbols in the exclusive_tokens
list. The symbols with bracket which are not in exclusive_tokens
will be replaced with [UNK]
sep_tokens = ['[C@@H]', '[C@@]']
smi = 'CC(C)C[C@@H]1N2C(=O)[C@](NC(=O)[C@H]3CN(C)[C@@H]4Cc5c(Br)[nH]c6cccc(C4=C3)c56)(O[C@@]2(O)[C@@H]7CCCN7C1=O)C(C)C'
toks = atomwise_tokenizer(smi, exclusive_tokens=sep_tokens)
print(toks)
Tokenize a SMILES string into 4-mers.
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = kmer_tokenizer(smi, ngram=4)
print(toks)