A gentle introduction to databricks

What the heck is databricks?

16 January 2025

Intro

I insisted on having this slot in C&C
I think some people want to know what a thing does; others want to know what it is
This is the is part of the session

MapReduce is a less analytically specific version of Split, apply, combine
What the heck is Split, apply, combine? (last layer of the onion I promise!)
Hadley Wickham wrote about Split, apply, combine in the intro to {plyr}
- (For the young people plyr is what we had in the olden days before dplyr- which is Dataframe plyr- dplyr)

MapReduce is essentially an algorithm that relies on massive parallelisation to get jobs done quickly
Spark was a proposed improvement

In-memory processing- this is much faster, especially for certain data science applications
More tools and toys- APIs, built in modules for SQL, ML…
Fault tolerance- maintains all the fault tolerance of Hadoop, but works in-memory
Much greater flexibility on the way computation is done

Spark was open sourced in 2010 and moved to Apache Foundation in 2013 (Apache Spark)
Databricks was set up as the commercial version of Apache Spark (databricks still contributes to the open source version)

Databricks does the enterprise-y stuff you’d expect (think Posit)
- Provides support to enterprises
- Curates, manages, and verifies the code in a commercial version of Spark
- Provide a platform to deploy and manage Spark, which is not simple

The other important thing to know about databricks is Delta Lake
Delta lake is open source and was developed by databricks to improve on existing data lakes

Okay, one more
Like a data warehouse, but less structured
Widely used in data science and analytics
- As opposed to data warehouses which are more for BI
Not either/ or- often orgs have both

Atomicity - each statement in a transaction (to read, write, update or delete data) is treated as a single unit
Consistency - ensures that transactions only make changes to tables in predefined, predictable ways
Isolation - isolation of user transactions ensures that concurrent transactions don’t interfere with or affect one another
Durability - ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure

Databricks enables a “lakehouse”- warehouse and lake together
Lots of whizzy toys are available on databricks
- There are so many now that it’s just confusing- “Generative AI”?
- “you can search and discover data by asking a question in your own words”
Equally you can just write SQL against it

We started using it because it was fast
We can use databricks on UDAL
It provides a way that we can jointly organise and share data and data architecture in a RAP compliant way