A gentle introduction to databricks
What the heck is databricks?
16 January 2025
Intro
- I insisted on having this slot in C&C
- I think some people want to know what a thing does; others want to know what it is
- This is the is part of the session
What the heck is Spark?
- Databricks is everything now, and confusingly so
- Let’s look at the story of databricks- which starts with Spark
- Spark was an attempt to improve on MapReduce (primarily Hadoop)
What the heck is MapReduce
- MapReduce is a less analytically specific version of Split, apply, combine
- What the heck is Split, apply, combine? (last layer of the onion I promise!)
- Hadley Wickham wrote about Split, apply, combine in the intro to {plyr}
- (For the young people plyr is what we had in the olden days before dplyr- which is Dataframe plyr- dplyr)
What the heck is Split, apply, combine?
- Very often in an analysis you want to do the same thing to different groups
- Split: divide a dataset up by age group
- Apply: find the mean number of A&E attendances for 2023/4 (e.g.) for each group
- Combine: bring the results back together and put them in a table
What the heck was I talking about again?
- MapReduce is essentially an algorithm that relies on massive parallelisation to get jobs done quickly
- Spark was a proposed improvement
Spark > Hadoop
- In-memory processing- this is much faster, especially for certain data science applications
- More tools and toys- APIs, built in modules for SQL, ML…
- Fault tolerance- maintains all the fault tolerance of Hadoop, but works in-memory
- Much greater flexibility on the way computation is done
The advent of databricks
- Spark was open sourced in 2010 and moved to Apache Foundation in 2013 (Apache Spark)
- Databricks was set up as the commercial version of Apache Spark (databricks still contributes to the open source version)
Commercial spark
- Databricks does the enterprise-y stuff you’d expect (think Posit)
- Provides support to enterprises
- Curates, manages, and verifies the code in a commercial version of Spark
- Provide a platform to deploy and manage Spark, which is not simple
The advent of Delta Lake
- The other important thing to know about databricks is Delta Lake
- Delta lake is open source and was developed by databricks to improve on existing data lakes
What the heck is a data lake?
- Okay, one more
- Like a data warehouse, but less structured
- Widely used in data science and analytics
- As opposed to data warehouses which are more for BI
- Not either/ or- often orgs have both
What does Delta lake bring?
- Scalability (particularly around simultaneous processing)
- ACID transactions
- What the heck is ACID? (Some will know- for those who don’t)
What the heck are ACID transactions?
- Atomicity - each statement in a transaction (to read, write, update or delete data) is treated as a single unit
- Consistency - ensures that transactions only make changes to tables in predefined, predictable ways
- Isolation - isolation of user transactions ensures that concurrent transactions don’t interfere with or affect one another
- Durability - ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure
(Source)
Build a little lakehouse in your soul
- Databricks enables a “lakehouse”- warehouse and lake together
- Lots of whizzy toys are available on databricks
- There are so many now that it’s just confusing- “Generative AI”?
- “you can search and discover data by asking a question in your own words”
- Equally you can just write SQL against it
Why might we as an SU want databricks?
- We started using it because it was fast
- We can use databricks on UDAL
- It provides a way that we can jointly organise and share data and data architecture in a RAP compliant way