## Check for highly correlated variables. Highly correlated variables might need to be removed or combined.# pull in clean variable namescor_matrix <-cor( key_measures_rep_sample_ex_outlier |>pivot_longer(-person_id) |>left_join(key_measure_lookup, by =c("name"="variable")) |>select(-name) |>pivot_wider(id_cols ="person_id",names_from ="variable_clean",values_from ="value") |>select(-person_id) ) # Exclude person_idcorrplot_plot <- corrplot::corrplot( cor_matrix,method ="square",type ="lower",tl.col ="black" )
5. Assess Multicollinearity
## Check for multicollinearity, which can affect the clustering results.vif_results <- car::vif(lm(care_contacts ~ ., data = key_measures_rep_sample_ex_outlier[, -1])) # Example using VIF
A Variance Inflation Factor (VIF) indicates the degree to which a predictor variable in a regression model is correlated with other predictor variables.
Interpretation:
A value of 1 signifying no correlation and higher values representing increasing levels of multicollinearity,
A VIF between 1 and 5 is considered moderately correlated,
Values above 5 suggest potentially problematic multicollinearity that might require further investigation or corrective actions like removing or combining highly correlated variables.
6. Dimensionality Reduction
Using PCA (Principal Component Analysis) to reduce dimensionality.
pca_result <-prcomp(key_measures_rep_sample_ex_outlier |>select(-person_id), center =TRUE, scale. =TRUE)
Visualise the progressive proportion of variance explained by each variable:
screeplot(pca_result, type ="lines")
If PC1 explains 23.82% of the variance and PC2 explains 12.84%, together they explain 36.66% of the variance. This means that the first two principal components capture a significant portion of the information in the dataset.