A walkthrough of TMAP

7 minute read

Published:

The notebook version is here.

1. What is TMAP?

TMAP is a really cool interactive visualization for big data. Examples can be found here.

Tree MAP (TMAP) is an algorithm developed by Dr. Probst for visualization of very large high-dimensional data sets as minimum spanning trees.

Visualizations based on TMAP are better suited than t‑SNE or UMAP for the exploration and interpretation of large data sets due to their tree‑like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on.

The four projects mentioned are:

  1. A probabilistic molecular fingerprint for big data settings
  2. SmilesDrawer: parsing and drawing SMILES‑encoded molecular structures using client‑side JavaScript
  3. FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web
  4. Visualization of very large high-dimensional data sets as minimum spanning trees

2. How it works?

TMAP consists of 4 phases:

  • LSH forest indexing
  • Construction of a c-approximate k-nearest neighbor graph
  • Calculation of a minimum spanning tree (MST) of the c-approximate k-nearest neighbor graph
  • Generation of a graph layout for the resulting MST

The first two steps are designed for big data setting and can be replaced by other k-nearest neighbor algorithms. In the resulting graph, the nodes are molecules (or other entities) and the edges are weighted by the distances between nodes. In the 3rd step, a tree is basicly a graph wihch doest not contrain cycle. A MST is the tree has the minimum sum of edge weights. The MST is ready for visualization. In the last step, we choose a layout to visualize the MST.

2. Code Walkthrough

Packages

Installation of (1) RDKit (2) TMAP (3) MHFP and (4) Faerun.

conda install -c rdkit rdkit
conda install -c tmap tmap
pip install mhfp
pip install faerun
import pandas as pd
import tmap
from faerun import Faerun
from mhfp.encoder import MHFPEncoder
from rdkit.Chem import AllChem

Data

Here we use ESOL dataset includes 1117 molecules.

url = 'https://raw.githubusercontent.com/XinhaoLi74/molds/master/clean_data/ESOL.csv'

df = pd.read_csv(url)
df.shape
(1117, 2)
df.head(1)
smileslogSolubility
0N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c...-0.77

Step 1: Compute descriptors

MHFP6

MHFP6 (MinHash fingerprint, up to six bonds) is a molecular fingerprint which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms.

# The number of permutations used by the MinHashing algorithm
perm = 512

# Initializing the MHFP encoder with 512 permutations
enc = MHFPEncoder(perm)

# Create MHFP fingerprints from SMILES
# The fingerprint vectors have to be of the tm.VectorUint data type
fingerprints = [tmap.VectorUint(enc.encode(s)) for s in df["smiles"]]

Data types for TMAP: VectorUnit and VectorFloat

Step 2: LSH indexing and coordinates generation

# Initialize the LSH Forest
lf = tmap.LSHForest(perm)

# Add the Fingerprints to the LSH Forest and index
lf.batch_add(fingerprints)
lf.index()
# Get the coordinates
x, y, s, t, _ = tmap.layout_from_lsh_forest(lf)

x and y are the coordinates of the nodes. s and t store the indexes of start nodes and to nodes in the MST, respectively.

Step 3: Plotting

# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
    "ESOL_Basic",
    {   "x": x, 
        "y": y, 
        "c": list(df.logSolubility.values), 
        "labels": df["smiles"]},
    point_scale=5,
    colormap = ['rainbow'],
    has_legend=True,
    legend_title = ['ESOL (mol/L)'],
    categorical=[False],
    shader = 'smoothCircle'
)

faerun.add_tree("ESOL_Basic_tree", {"from": s, "to": t}, point_helper="ESOL_Basic")

# Choose the "smiles" template to display structure on hover
faerun.plot('ESOL_Basic', template="smiles", notebook_height=750)

/images/TMAP/ESOL_Basic.html

3. Advances

3.1. Add more legends

Sometime we want to plot multiple labels for a single dataset. There are two types of lables, continues and categorical. The type of labels need to explicitly assign to the categorical in faerun.add_scatter.

We compute two categorical labels: (1) the number of rings (2) Linear molecules (is_linear): 0 if numrings > 1 and 1 if numrings = 0.

from rdkit.Chem import rdMolDescriptors
from rdkit import Chem
numrings = [rdMolDescriptors.CalcNumRings(Chem.MolFromSmiles(s)) for s in df["smiles"]]
set(numrings)
{0, 1, 2, 3, 4, 5, 6, 7, 8}
is_linear = [1 if r == 0 else 0 for r in numrings]

The molecules in ESOL datast contains rings from 0 to 8. Now we are going to plot three labels. We need to change setting in faerun.add_scatter, information of multiple labels are passed as lists.

# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
    "ESOL",
    {   "x": x, 
        "y": y, 
        "c": [list(df.logSolubility.values), numrings, is_linear], 
        "labels": df["smiles"]},
    point_scale=5,
    colormap = ['rainbow', 'Set1'],
    has_legend=True,
    categorical=[False, True, True],
    series_title = ['ESOL (mol/L)', 'Rings', 'is_linear'],
    legend_labels = [None, None, [(0, "No"), (1, "Yes")]],
    shader = 'smoothCircle'
)

faerun.add_tree("ESOL_tree", {"from": s, "to": t}, point_helper="ESOL")

# Choose the "smiles" template to display structure on hover
faerun.plot('ESOL', template="smiles", notebook_height=750)

/images/TMAP/ESOL.html

3.2. Use different descriptors/fingerprints

We can also use other descriptors/fingerprints. The descriptors/fingerprints need to be converted to Minhash vectors first. It supports binary, indexed, string and also int and float weighted vectors as input and returns a list of Minhash vectors (List of VectorUint)

Methods for different types of input:

batch_from_binary_array. MinHash vectors from binary arrays. The input vectors need to be a list of VectorUchar.

batch_from_int_weight_array. Create MinHash vectors from integer arrays (not only zeros and ones). The input vectors need to be a list of VectorUint.

batch_from_sparse_binary_array. Create MinHash vectors from sparse binary arrays. The input vectors need to be a list of VectorUint – A list of vectors containing indices of ones in a binary array.

batch_from_string_array. Create MinHash vectors from string arrays. The input vector is a list of list or string.

batch_from_weight_array. Create MinHash vectors from float arrays. The input vectors need to be a list of VectorFloat. – A list of vectors containing float values. Keyword Arguments: method (str) – The weighted hashing method to use (ICWS (default) or I2CWS).

Let’s try ECFP4 (binary).

bits = 1024

mols = [Chem.MolFromSmiles(s) for s in df['smiles']]
ECFP4_fps = [AllChem.GetMorganFingerprintAsBitVect(x,2,bits) for x in mols]
ecfp4_lists = [tmap.VectorUchar(list(fp)) for fp in ECFP4_fps]
# Initialize the Minhash
enc = tmap.Minhash(bits)

# Initialize the LSH Forest
lf_ecfp4 = tmap.LSHForest(bits)

# Add the Fingerprints to the LSH Forest and index
lf_ecfp4.batch_add(enc.batch_from_binary_array(ecfp4_lists))
lf_ecfp4.index()
x, y, s, t, _ = tmap.layout_from_lsh_forest(lf_ecfp4)
# Now plot the data
faerun = Faerun(view="front", coords=False)
faerun.add_scatter(
    "ESOL_ECFP4",
    {   "x": x, 
        "y": y, 
        "c": [list(df.logSolubility.values), numrings, is_linear], 
        "labels": df["smiles"]},
    point_scale=5,
    colormap = ['rainbow', 'Set1'],
    has_legend=True,
    categorical=[False, True, True],
    series_title = ['ESOL (mol/L)', 'Rings', 'is_linear'],
    legend_labels = [None, None, [(0, "No"), (1, "Yes")]],
    shader = 'smoothCircle'
)

faerun.add_tree("ESOL_ECFP4_tree", {"from": s, "to": t}, point_helper="ESOL_ECFP4")

# Choose the "smiles" template to display structure on hover
faerun.plot("ESOL_ECFP4",template="smiles", notebook_height=750)

/images/TMAP/ESOL_ECFP4.html

Leave a Comment