HDBSCAN Clustering Suite

PUBLISHED ON DEC 1, 2017 — CAL STATE LONG BEACH, RESEARCH

Purpose

During my time as a graduate researcher at Dr. Eric Sorin’s computational biophysics lab at Cal State Long Beach, I was tasked with analyzing protein folding simulation data associated with Alzheimer’s disease.

Upon completion of the simulations, resulting data must be clustered to find energetically stable states of the proteins. Using this information, drugs can then be designed to inhibit the spreading of Alzheimer’s. However, clustering high-dimensional, high-volume data is an extremely difficult problem both quantitatively and qualitatively. Assessing the results is a nontrivial task due the arbitrary nature of many clustering algorithms. Furthermore, since clustering algorithms generally rely on Euclidean distance, curse of dimensionality begins to kick in with higher dimensional data.

After months of research into many of the most cutting edge clustering algorithms, I converged on HDBSCAN. Though not perfect, this algorithm gave us the most consistent and reliable results by far. We decided to pursue it full blast.

After implementing and running HDBSCAN against our dataset millions of times, I realized that I was unnecessarily repeating myself. HDBSCAN requires two hyperparameter inputs, and some unknown combination of these parameters will give the most satisfactory results with respect to your dataset. Thus, I thought it would be best to create a suite around HDBSCAN that allows someone to run the algorithm on some dataset with a range of hyperparameter values, analyze the results, and store the results for later use. My plan was to do this using a single command and a corresponding config file. And that is exactly what I implemented.

Users can specify the following parameters:

Parameter Description Example
runs Number of times to run HDBSCAN with set of parameters 50
data Path to dataset to be clustered “data/luteo-1796-1798.txt”
partition:column Column used to partition data 3
partition:start Starting value for partition (everything less is left out) 6000
sample Sample size of the dataset. Set 0 to use entire dataset post partition 12000
norm:method Desired method for normalization. Available methods: standard_score, feature_scale “feature_scale”
norm:columns Columns to be normalized (ex. [4,5,10]) [4,5]
range Columns to be included in the clustering, starting at 0 4,12
parameters:range If set of desired parameters are a range. Set to false if running with individual values true
parameters:option Cluster criterion, minimum cluster size or minimum sample size. Look below for more information. “min_cluster_size”
parameters:min Range or set of values to be used for each run. [2,10,1] will use parameters from 2 to 10 in steps of 1 if range is set to true. [2,5,10,30] will use parameters 2,5,10, and 30. [2,10,1]
threads Number of threads to use within HDBSCAN algorithm 4

Given a specified set of parameters, the data will be clustered as many times as necessary.

Results

Creating this suite streamlined the process of clustering simulation data and it is now used by various members of the lab.

Technologies

  • Python
  • Additional Python libraries: scikit-learn, seaborn, pandas

Resources

comments powered by Disqus