Insights

Data Modelling: Nested x Flated (Linkedin)

" (...) Unpopular opinion: Nested data belongs in the data warehouse.
We’ve spent years flattening everything to fit neat rows and columns, but the world—and the data we collect—isn’t flat (...) "

5 Signs That Your Data is Modeled Poorly

"(...) To be able to model your teams data properly, you need to be able to conceptualize relevant business entities and organize them in a way that is conducive to common questions asked within your organization. (...)"

The Lost Art and Science of Data Modeling

"(...) Then all that goodness mysteriously faded away without a whimper as the hype of NoSQL, Cloud, and Microservices occupied the whole stage. During this time the engineering team quietly co-opted the ownership of clean data design, and frankly, most of them didn’t know what they were doing (...)"

How to data model correctly: Kimball vs One Big Table (by Zach Wilson)

"(...) One big table data modeling sounds like a joke in some regards. The name reminds me of the “god controller” in full-stack development. Why would we have a table that has everything in it? Is that really the best abstraction that we can come up with? (...)"

Template project with PySpark and Databricks Asset Bundles

If your goal is to build a production-level, scalable, and automated Databricks ETL pipeline, check out this template project on my GitHub. It uses unit tests, integration tests, and GitHub Actions for CI/CD automation.

Data Federation vs Data Ingestion

"(...) Many organizations find that combining data integration approaches creates an optimal balance between performance, cost, and operational efficiency (...)"

Medallion Architecture (Youtube)

Medallion Architecture Best Practices with Franco Patano (Databricks)

One Big Table vs. Dimensional Modeling on Databricks SQL

"(...) Dimensional Modeling: essential then, optional now — understanding the shift (...)"

Star Schema Data Modeling Best Practices on Databricks SQL

"(...) The Star Schema is a widely-used database design in data warehousing where transactional data is extracted, transformed, and loaded into schemas consisting of a central fact table surrounded by conformed dimension tables (...)"

How to data model correctly: Kimball vs One Big Table (by Zach Wilson)

"(...) I followed this philosophy when I was working at Airbnb on pricing and availability. We moved all the pricing data into a deduped listing-level table instead of an exploded-out listing-night level table and we saw intense gains in efficiency across the warehouse! (...)"

PySpark Style Guide

"(...) This opinionated guide to PySpark code style presents common situations we've encountered and the associated best practices based on the most frequent recurring topics across PySpark repos. (...)"

Notebooks X IDEs (Linkedin)

"(...) Despite what people may think, I use notebooks too. I can't deny that it is the easiest way to prototype. On hashtag#Databricks, sometimes the only way (some functionality related to feature engineering and delta tables just does not work in VS code). On the other hand, as an MLOps practitioner, I am against using notebooks outside of the prototype phase and see many challenges when transitioning from a notebook to production-ready code. (...)"

The Rise of The Notebook Engineer

"(...) 99% of Engineers and Data Folk who regularly use Notebooks as part of their development and production lifecycles … abuse, overuse, and do so at their own peril and the peril of their Data Platforms at large … and suffer the grave consequences as such. (...)"

Test, test, and then test again.

"(...) No tool, framework, or process can overcome an engineering culture that treats testing as an afterthought. Fixing this takes time, but small steps—asking for time to test, planning for testing in project roadmaps, and holding each other accountable—can shift the balance toward quality. (...)"

Testing and Development for Databricks Environment and Code

"Every once in a great while, the question comes up: “How do I test my Databricks codebase?” It’s a fair question, and if you’re new to testing your code, it can seem a little overwhelming on the surface. However, I assure you the opposite is the case. (...)"

Best Practices for Unit Testing PySpark (Youtube)

Unit tests help you reduce production bugs and make your codebase easy to refactor. You will learn how to create PySpark unit tests that run locally and in CI via GitHub actions.

The Ultimate Guide to CI/CD for Data Engineering in Databricks

"(...) Today, while there is still no single “best” way to implement CI/CD for data engineering, the landscape has matured significantly. (...)"

Blue/Green pipelines in a medallion architecture

"(...) Ever wondered what Blue/Green pipelines look like in a medallion architecture? (...)"

Delta Lake vs Apache Iceberg. The Lake House Squabble

"(...) I’m sure Hudi might want to interject itself, but we all know that the two clear contenders are Delta Lake and Apache Iceberg (...)".

Config Driven Pipelines

"(...) It’s time to talk about evils and what to watch out for in config-driven data pipelines. These things are not made up, but what I’ve experienced and seen firsthand. They are the reality. (...)"

dbt on Databricks

"(...) Should you use dbt on Databricks? If you are a SQL based team and 50%+ of your pipelines are written in SQL, than you are doing yourself a disservice by NOT using SQL (...)"

Why Python Always Breaks. Long Live Python.

"(...) Python's actually a great language, dare I say the greatest? It's not the best overall (if there even is such a thing), and in many aspects, it will lose to its alternatives, but at the same time, it is also a terrific first choice for assorted problems.

If you want to make the most of it, though, you need to put in the time to understand it and grow in your skills. What ultimately makes or breaks most projects isn't the choice of language, but the developers responsible for its creation. (...)"

Should You Ditch Spark for DuckDb or Polars? (Benchmark)

"(...) I think the whole narrative that you should consider replacing your Spark workloads with DuckDB or Polars if your data is small is all hype (...)"

Avoiding Cloud and Platform Lock-in Is a Farce

“The cloud is like Hotel California. Your data can check in anytime you’d like, but it can never leave” ~ Brent Ozar

Getting Your Catalog in Order

“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton

Unity Catalog Architecture Patterns

"(...) In practice, effective scope design hinges on clear ownership. Scopes must be defined with accountable owners who are empowered to manage and govern the assets within their domain. Without ownership, scopes quickly become ineffective and unsustainable. (...)"

The Curse of Conway and the Data Space

"(...) It’s time for data and analytics engineers to identify as software engineers and regularly apply the practices of the wider software engineering discipline to their own sub-discipline. (...)"

Data Engineering is Not Software Engineering

"(...) Pretending like data and software are the same is counterproductive to the success of your data engineers (...)"

Is It Time to Say Goodbye to Data Engineers?

"(...) Today, we are going to dive into an issue I’ve noticed that seems to oscillate in the data world every few years. That is the removal of data engineers. After all, we tend to get in the way and slow things down, right? (...)"

Deep Dive into LLMs like ChatGPT (Youtube)

This is probably one of the best explanations of LLM architecture that I've ever seen.

Python AI Tutorial from a LangChain Engineer (YouTube)

Another amazing tutorial regarding RAG patterns for LLMs

Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet (Youtube)

In this podcast, Aravind Srinivas, CEO of Perplexity, discusses the future of AI and search technologies. He shares insights into how Perplexity aims to revolutionize the way humans find answers on the Internet.