Cluster analysis approaches

Step 1: Performing K-means Clustering

Create K-means Clustering models:

Identify the number of clusters to derive

## 1. **Determine the Optimal Number of Clusters**:
# fviz_nbclust(key_measures_scaled_head, kmeans, method = "wss")  # Elbow Method


Visualise K-means models:

Pairwise Scatter Plot Matrix

Plot classifications

Step 2: Performing Model-Based Clustering

Test number of clusters in model based approach (LPA)


fit_output <- explore_model_fit(key_measures_rep_sub_sample, n_profiles_range = 1:12)

Select model

Plot model classifications

Step 3: Performing Density-Based Clustering

Optimal value of “eps” parameter

Plot DBSCAN results

Heat map

Step 4: Evaluate Clustering Models

When evaluating which cluster model is most suitable, we isolate the cluster vectors from each of the model objects:

Next, we calcuate the following measures of model fit on samples (25%) of each cluster vectors:

We calculate 3 metrics:

  1. Silhouette Score
  2. Davies-Bouldin Index (DB Index)
  3. Within-Cluster Sum of Squares (WCSS)
Explanation of Metrics:

Silhouette Score:

  • Interpretation: Measures how similar an object is to its own cluster compared to other clusters. Values range from -1 to 1.
  • Higher values indicate better-defined clusters. Negative values suggest that samples might be assigned to the wrong cluster.

DB Index (Davies-Bouldin Index):

  • Interpretation: Measures the average similarity ratio of each cluster with the cluster that is most similar to it. Lower values are better.
  • Lower values indicate better clustering. Higher values suggest that clusters are not well separated.

WCSS (Within-Cluster Sum of Squares):

  • Interpretation: Measures the total variance within each cluster. Lower values are better.
  • Lower values indicate that clusters are more compact. Higher values suggest that clusters are more spread out.
For ease of comparison, we combine each of the eval-objects into a single table:

We can then visualise the evaluation metrics:

From the above visualisation, we can note:

  • Silhouette Score: lpa_clusters_5 has the highest score, indicating well-defined clusters.
  • DB Index: dbscan_clusters has the lowest index, indicating well-separated clusters.
  • WCSS: kmeans_clusters_12 has the lowest WCSS, indicating compact clusters.

In the absence of a well-performing model across multiple metrics, we can normalise each metric and create a composite measure. In doing so, we calculate an average normalised score across the 3 measures of model fit and divide by 3 to calcuate the composite score.

An alternative approach to to sum the normalised score and apply a weighting but taking an average incorporates the assumption that each metric is equally weighted.

The above composite score implies that the k-means model with 12 clusters is the most suitable model, however whether or not these clusters suit our understanding and intended use of the model is open for debate and adjustment.

K-means (12 cluster) model output:

