Tokenize SMILES (Simplified Molecular-Input Line-Entry System) into units.

atomwise_tokenizer[source]

atomwise_tokenizer(smi, exclusive_tokens=None)

Tokenize a SMILES molecule at atom-level: (1) 'Br' and 'Cl' are two-character tokens (2) Symbols with bracket are considered as tokens

exclusive_tokens: A list of specifical symbols with bracket you want to keep. e.g., ['[C@@H]', '[nH]']. Other symbols with bracket will be replaced by '[UNK]'. default is None.

Tokenize a SMILES string on atom-level.

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = atomwise_tokenizer(smi)
print(toks)
['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']

Tokenize a SMILES string on atom-level. Only include specifcal symbols in the exclusive_tokens list. The symbols with bracket which are not in exclusive_tokens will be replaced with [UNK]

sep_tokens = ['[C@@H]', '[C@@]']
smi = 'CC(C)C[C@@H]1N2C(=O)[C@](NC(=O)[C@H]3CN(C)[C@@H]4Cc5c(Br)[nH]c6cccc(C4=C3)c56)(O[C@@]2(O)[C@@H]7CCCN7C1=O)C(C)C'
toks = atomwise_tokenizer(smi, exclusive_tokens=sep_tokens)
print(toks)
['C', 'C', '(', 'C', ')', 'C', '[C@@H]', '1', 'N', '2', 'C', '(', '=', 'O', ')', '[UNK]', '(', 'N', 'C', '(', '=', 'O', ')', '[UNK]', '3', 'C', 'N', '(', 'C', ')', '[C@@H]', '4', 'C', 'c', '5', 'c', '(', 'Br', ')', '[UNK]', 'c', '6', 'c', 'c', 'c', 'c', '(', 'C', '4', '=', 'C', '3', ')', 'c', '5', '6', ')', '(', 'O', '[C@@]', '2', '(', 'O', ')', '[C@@H]', '7', 'C', 'C', 'C', 'N', '7', 'C', '1', '=', 'O', ')', 'C', '(', 'C', ')', 'C']

kmer_tokenizer[source]

kmer_tokenizer(smiles, ngram=4, stride=1, remove_last=False, exclusive_tokens=None)

tokens_to_mer[source]

tokens_to_mer(toks)

Tokenize a SMILES string into 4-mers.

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = kmer_tokenizer(smi, ngram=4)
print(toks)
['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']