Package tour of sconn

An R package to connect to databricks

Fran Barton

Mar 13, 2025

The problem

  • Some of us are very precious and need our familiar keyboard shortcuts and UI
  • Working in a Databricks notebook is fine but for more advanced work in R it’s easier to work in your local IDE.
  • Initial attempts to connect using {sparklyr} alone didn’t work (for me)

yum

How does {sconn} help?

  • Creates a convenience function to connect (and disconnect) from our databricks instance
  • Provides documentation for new users to get set up
  • Provides a place to track issues

Brief usage

library(sconn)
sc()

sc_disconnect()

Development history

  • Similar package used to connect to SQL Server databases
  • Connection functions are lazily bound to a secret environment (and not run?)
  • The user-facing function gets the function and thus activates it

environments, Hadley’s version

Under the hood 1

sc_conn <- function() {
  check_vars()
  sparklyr::spark_connect(
    master = Sys.getenv("DATABRICKS_HOST"),
    cluster_id = Sys.getenv("DATABRICKS_CLUSTER_ID"),
    token = Sys.getenv("DATABRICKS_TOKEN"),
    envname = Sys.getenv("DATABRICKS_VENV"),
    app_name = "sconn_sparklyr",
    method = "databricks_connect"
  )
}

.onLoad <- function(...) {
  .conns <<- rlang::new_environment()
  rlang::env_bind_lazy(.conns, sc = sc_conn())
}

empty env

Under the hood 2

sc <- function(hide_output = TRUE) {
  if (!rlang::env_has(.conns, "sc")) {
    rlang::env_bind_lazy(.conns, sc = sc_conn())
  }
  sc <- rlang::env_get(.conns, "sc", default = NULL)
  if (hide_output) invisible(sc) else sc
}

Lazy binding?

  • On load (library()/load_all()) the package should lazily bind a connection function to the .conns environment
  • When the user calls sc(), this function is activated by rlang::env_get()

Mr. Lazy

mrmen.fandom.com

Remaining questions

  • Does .onLoad work the way I expect it to?
  • Does env_get always have to activate the connection?
  • (sparklyr’s spark_connection_is_open() function triggers it)
  • Connection time-outs and reconnection? How best to handle?

Further resources