Preclustering data checks

Author

Alexander Lawless

Published

07-03-2025

expand for setup code chunk
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, fig.width =12, fig.height = 9)

library(tidyverse)
library(janitor)

# Read in sample data
key_measures <-
  read_csv("key_measures_sample_cluster.csv") |>
  dplyr::select(2:13)

# Clean names for graphs
key_measure_lookup <-
  tribble(
    ~variable, ~variable_clean,
    "person_id"                 , "0. Person ID",
    "care_contacts"             , "A. Care contacts",
    "referrals"                 , "B. Referrals",
    "length_care"               , "C. Care length",
    "average_daily_contacts"    , "D. Average daily contacts",
    "contact_prop_aft_midpoint" , "E. Contact proportion after midpoint",
    "team_input_count"          , "F. Team input count",
    "intermittent_care_periods" , "G. Intermittent care periods",
    "specialist_contact_prop"   , "H. Specialist care proportion",
    "remote_contact_prop"       , "I. Remote care proportion",
    "avg_contact_duration"      , "J. Average contact duration",
    "acute_admissions_2223"     , "K. Acute admission count"
  )

# Load in pre-cluster check outputs
load("pre_cluster_check_outputs.RData")

Pre-cluster checks

Data summary

Data summary
Name key_measures
Number of rows 101668
Number of columns 12
_______________________
Column type frequency:
character 1
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
person_id 0 1 15 15 0 101668 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
care_contacts 0 1 -0.09 0.70 -0.50 -0.50 -0.34 -0.04 4.85 ▇▁▁▁▁
referrals 0 1 -0.05 0.86 -0.64 -0.64 -0.64 -0.06 3.99 ▇▂▁▁▁
length_care 0 1 -0.03 0.97 -1.27 -0.89 -0.27 0.77 2.03 ▇▅▃▃▂
average_daily_contacts 0 1 -0.01 0.98 -0.23 -0.23 -0.23 -0.23 5.03 ▇▁▁▁▁
contact_prop_aft_midpoint 0 1 0.00 1.00 -2.70 -0.59 -0.05 0.62 2.69 ▁▅▇▃▁
team_input_count 0 1 -0.01 0.98 -0.58 -0.58 -0.58 0.62 4.21 ▇▂▁▁▁
intermittent_care_periods 0 1 -0.02 0.96 -1.14 -0.59 -0.04 0.51 3.25 ▇▆▁▁▁
specialist_contact_prop 0 1 0.00 1.00 -3.03 0.41 0.41 0.41 0.41 ▁▁▁▁▇
remote_contact_prop 0 1 0.01 1.01 -0.82 -0.82 -0.47 0.76 2.34 ▇▂▂▁▁
avg_contact_duration 0 1 -0.03 0.94 -1.85 -0.54 0.11 0.37 3.06 ▃▇▂▁▁
acute_admissions_2223 0 1 -0.05 0.90 -0.58 -0.58 -0.58 0.53 3.86 ▇▃▁▁▁

1. Check for Missing Values

sum(is.na(key_measures))
[1] 0

2. Examine Data Distribution

Do I need to transform skewed variables?

3. Outlier Detection

remove_outliers <-
  key_measures |>
  pivot_longer(cols = -person_id) |>
  group_by(name) |>
  mutate(quantile_99 = quantile(value, 0.99)) |>
  ungroup() |>
  filter(value > quantile_99) |>  # remove values above 99th percentile
  select(person_id) |>
  distinct()

key_measures_rep_sample_ex_outlier <-
  key_measures |>
  anti_join(remove_outliers, by = "person_id")

Plot distrubution with outliers removed:

4. Correlation Analysis

## Check for highly correlated variables. Highly correlated variables might need to be removed or combined.
# pull in clean variable names
cor_matrix <-
  cor(
    key_measures_rep_sample_ex_outlier |>
      pivot_longer(-person_id) |>
      left_join(key_measure_lookup, by = c("name" = "variable")) |>
      select(-name) |>
      pivot_wider(id_cols = "person_id",
                  names_from = "variable_clean",
                  values_from = "value") |>
      select(-person_id)
  )  # Exclude person_id

corrplot_plot <-
  corrplot::corrplot(
    cor_matrix,
    method = "square",
    type = "lower",
    tl.col = "black"
    )

5. Assess Multicollinearity

## Check for multicollinearity, which can affect the clustering results.
vif_results <- car::vif(lm(care_contacts ~ ., data = key_measures_rep_sample_ex_outlier[, -1]))  # Example using VIF
                referrals               length_care    average_daily_contacts 
                 1.462757                  2.269194                  1.091457 
contact_prop_aft_midpoint          team_input_count intermittent_care_periods 
                 1.062328                  1.164460                  2.172074 
  specialist_contact_prop       remote_contact_prop      avg_contact_duration 
                 1.096525                  1.027449                  1.051520 
    acute_admissions_2223 
                 1.006754 

A Variance Inflation Factor (VIF) indicates the degree to which a predictor variable in a regression model is correlated with other predictor variables.

Interpretation:

  • A value of 1 signifying no correlation and higher values representing increasing levels of multicollinearity,
  • A VIF between 1 and 5 is considered moderately correlated,
  • Values above 5 suggest potentially problematic multicollinearity that might require further investigation or corrective actions like removing or combining highly correlated variables.

6. Dimensionality Reduction

Using PCA (Principal Component Analysis) to reduce dimensionality.

pca_result <- prcomp(key_measures_rep_sample_ex_outlier |> select(-person_id), center = TRUE, scale. = TRUE)
summary(pca_result)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     1.6141 1.1909 1.0495 1.02811 1.00306 0.96419 0.89643
Proportion of Variance 0.2369 0.1289 0.1001 0.09609 0.09147 0.08451 0.07305
Cumulative Proportion  0.2369 0.3658 0.4659 0.56201 0.65347 0.73799 0.81104
                          PC8     PC9    PC10    PC11
Standard deviation     0.8636 0.79915 0.65199 0.51868
Proportion of Variance 0.0678 0.05806 0.03864 0.02446
Cumulative Proportion  0.8788 0.93690 0.97554 1.00000
rotation
                                  PC1          PC2          PC3          PC4
care_contacts              0.43366980  0.194192536  0.045505412  0.006796634
referrals                  0.47564153  0.229474622  0.003409089 -0.050505395
length_care                0.49169989 -0.230318392  0.067416092  0.216643901
average_daily_contacts    -0.10928438  0.501363870  0.261154197  0.364917100
contact_prop_aft_midpoint -0.09608990  0.391180001  0.347148843  0.487834834
team_input_count           0.28185713  0.223616694  0.051567712 -0.148633632
intermittent_care_periods  0.46208616 -0.304339192  0.058858233  0.281660755
specialist_contact_prop   -0.13317146 -0.443711511  0.154891758  0.488574145
remote_contact_prop       -0.05780243  0.009204438 -0.620398220  0.460656898
avg_contact_duration      -0.08049715 -0.292153503  0.624061518 -0.122841272
acute_admissions_2223      0.04720062  0.165080302  0.014041581 -0.109073040
                                  PC5         PC6         PC7         PC8
care_contacts             -0.01958420 -0.02826863  0.35204288  0.23982148
referrals                 -0.02533663 -0.12403610  0.03991884  0.01518506
length_care                0.09075143  0.18682692 -0.09659579  0.03491047
average_daily_contacts     0.06364801 -0.01514642  0.57307901  0.05346632
contact_prop_aft_midpoint  0.06388364  0.10049543 -0.62927856 -0.01026804
team_input_count          -0.23442194 -0.63420004 -0.18596268 -0.47853771
intermittent_care_periods  0.05264000  0.16745702 -0.08084928  0.03638223
specialist_contact_prop   -0.23000273 -0.12509549  0.30013545 -0.48755086
remote_contact_prop       -0.30008066 -0.34744844 -0.08664966  0.42043747
avg_contact_duration      -0.27835069 -0.35590046 -0.04253595  0.53839918
acute_admissions_2223     -0.83931706  0.49527458 -0.01459485 -0.05833579
                                  PC9         PC10          PC11
care_contacts              0.53012401  0.555373724  0.0004347104
referrals                  0.30188444 -0.778582548  0.0580354857
length_care               -0.28324530  0.018205994 -0.7207330867
average_daily_contacts    -0.43951831 -0.081179738 -0.0001053602
contact_prop_aft_midpoint  0.25174206  0.074225997  0.0094252634
team_input_count          -0.26400005  0.235284679 -0.0014987023
intermittent_care_periods -0.30642678  0.075196234  0.6876278253
specialist_contact_prop    0.34260054 -0.080714025 -0.0496201272
remote_contact_prop       -0.05109272 -0.018981612 -0.0382101690
avg_contact_duration      -0.04932893 -0.070977559 -0.0180387476
acute_admissions_2223     -0.07101875  0.003451356 -0.0021958714

Visualise the progressive proportion of variance explained by each variable:

screeplot(pca_result, type = "lines")

If PC1 explains 23.82% of the variance and PC2 explains 12.84%, together they explain 36.66% of the variance. This means that the first two principal components capture a significant portion of the information in the dataset.

Additional visualisation:

biplot