A gentle introduction to databricks

What the heck is databricks?

16 January 2025

Intro

  • I insisted on having this slot in C&C
  • I think some people want to know what a thing does; others want to know what it is
  • This is the is part of the session

What the heck is Spark?

  • Databricks is everything now, and confusingly so
  • Let’s look at the story of databricks- which starts with Spark
  • Spark was an attempt to improve on MapReduce (primarily Hadoop)

What the heck is MapReduce

  • MapReduce is a less analytically specific version of Split, apply, combine
  • What the heck is Split, apply, combine? (last layer of the onion I promise!)
  • Hadley Wickham wrote about Split, apply, combine in the intro to {plyr}
    • (For the young people plyr is what we had in the olden days before dplyr- which is Dataframe plyr- dplyr)

What the heck is Split, apply, combine?

  • Very often in an analysis you want to do the same thing to different groups
  • Split: divide a dataset up by age group
  • Apply: find the mean number of A&E attendances for 2023/4 (e.g.) for each group
  • Combine: bring the results back together and put them in a table

What the heck was I talking about again?

  • MapReduce is essentially an algorithm that relies on massive parallelisation to get jobs done quickly
  • Spark was a proposed improvement

Spark > Hadoop

  • In-memory processing- this is much faster, especially for certain data science applications
  • More tools and toys- APIs, built in modules for SQL, ML…
  • Fault tolerance- maintains all the fault tolerance of Hadoop, but works in-memory
  • Much greater flexibility on the way computation is done

The advent of databricks

  • Spark was open sourced in 2010 and moved to Apache Foundation in 2013 (Apache Spark)
  • Databricks was set up as the commercial version of Apache Spark (databricks still contributes to the open source version)

Commercial spark

  • Databricks does the enterprise-y stuff you’d expect (think Posit)
    • Provides support to enterprises
    • Curates, manages, and verifies the code in a commercial version of Spark
    • Provide a platform to deploy and manage Spark, which is not simple

The advent of Delta Lake

  • The other important thing to know about databricks is Delta Lake
  • Delta lake is open source and was developed by databricks to improve on existing data lakes

What the heck is a data lake?

  • Okay, one more
  • Like a data warehouse, but less structured
  • Widely used in data science and analytics
    • As opposed to data warehouses which are more for BI
  • Not either/ or- often orgs have both

What does Delta lake bring?

  • Scalability (particularly around simultaneous processing)
  • ACID transactions
  • What the heck is ACID? (Some will know- for those who don’t)

What the heck are ACID transactions?

  • Atomicity - each statement in a transaction (to read, write, update or delete data) is treated as a single unit
  • Consistency - ensures that transactions only make changes to tables in predefined, predictable ways
  • Isolation - isolation of user transactions ensures that concurrent transactions don’t interfere with or affect one another
  • Durability - ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure

(Source)

Build a little lakehouse in your soul

  • Databricks enables a “lakehouse”- warehouse and lake together
  • Lots of whizzy toys are available on databricks
    • There are so many now that it’s just confusing- “Generative AI”?
    • “you can search and discover data by asking a question in your own words”
  • Equally you can just write SQL against it

Why might we as an SU want databricks?

  • We started using it because it was fast
  • We can use databricks on UDAL
  • It provides a way that we can jointly organise and share data and data architecture in a RAP compliant way