The Modern Data Stack is Overcomplicated
Here's the architecture blueprint I wish I'd had on day one...... Part 1 of a 10-part (probably) series exploring every layer of a modern data stack, and the trade-offs nobody talks about
Let me paint you a picture.
You’ve just been hired as the first data engineer at a start-up/scale-up company. The CEO wants dashboards, the Finance team wants reliable revenue metrics, and Marketing wants to track their attribution. The current “stack” is a collection of Google Sheets, multiple Zapier workflows that break every other Wednesday, and a PostgresSQL database someone spun up two years ago that nobody understands anymore.
Sound familiar?
Having seen the stack, you decide to see what else is out there. Within the first hour you’re hit with an avalanche of logos, company vs company comparisons, and LinkedIn posts that sell the confidence that a certain combination of tools is the “only way”.
Here’s the truth, most of this complexity is unnecessary on day one, and this stays the case for a lot longer than most companies would like you to believe.
What is this series about?
I’m the Lead Data Engineer for an omni-channel (Online, Retail, & Amazon) company in the UK. Over the past few years my team and I have built and rebuilt our data infrastructure from the ground up. Throughout this time we have made good decisions and bad ones. We picked tools that saved us months of work, and others that created more problems than they have solved.
This series is the guide I wish someone had handed me at the start.
Over the next nine posts, I’m going to walk you through every layer of the Modern Data Stack. Not just which tool does what - you can read their docs for that. I want to talk about the decisions: why you’d choose one approach over another, what the real trade-offs are once you’re six months down the line, and where “best-practice” advice falls apart in the real world.
Here’s the series at a glance:
Architecture Overview: You are here
Data Ingestion: Connectors, event streams, custom pipelines
Data Warehousing: Where your data lives and why it matters more than you think
Transformation: dbt and beyond
Orchestration: Keeping everything running without losing your mind
Infrastructure as Code: The upfront cost that pays for itself (eventually)
Data Quality & Testing: What actually catches problems in production
Access Control & Governance: The boring stuff that will bite you if you ignore it
AI & ML Readiness: What “AI-ready” actually means from an engineering perspective
Lessons Learned: What I’d do differently if I started again tomorrow
The Data Stack Architecture
Before we dive deep into the different sections in later posts, let’s first zoom out. A modern data stack, in its simplest form consists of five core concepts:
Get data in: The ingestion layer pulls all of that data into one centralised place. You may have data siloed across SaaS platforms, production databases, event streams, or third-party APIs. Some of this data may arrive in near real-time, most data doesn’t need to.
Store it: Cloud data warehouses act as central hubs. This is where all data lands, and where most of the compute happens. The choice you make here almost certainly will shape all downstream decisions.
Transform it: Raw data as you likely know, is very messy. This transformation layer cleans, joins, and models that data into something useful. This is where your business logic sits and the large portion of engineering effort lies
Keeping it running: Orchestration ties everything together. Jobs need to run in the right order, at the right time, and someone needs to know when they don’t.
Make it trustworthy: Quality checks, access controls, documentation. This is the layer everyone skips at the start and regrets later. Prioritise this early.
That’s it, just five concepts! Everything else, the tooling debates, architectural patterns, and company comparisons are different ways of solving these five concepts.
A word on “best practices”
If you only spend 10 minutes searching in this space you will hear very strong opinions. From “dbt is the only sensible transformation tool” to “you must use Terraform” or “Airflow is dead”, and “Kafka for everything”
Most of these opinions come from people working at a very different scale to you. A data platform built for a 10,000-person enterprise with a dedicated platform team looks vastly different to a 200-person company consisting of 3 data engineers
Throughout this series, I’m going to be honest about where I agree with the consensus and where I don’t. I’ll also flag where “best practice” might be overkill for your situation, and where cutting corners will genuinely cost you more later.
The aim isn’t to tell you what to pick. It is to give you enough context to make your own informed decisions, and to understand the trade offs you are actually making.
A Series for Data Engineers, Analysts, Data Scientists and Data Leaders
If you’re a data engineer, first hire or part of a small team, and building or rebuilding a data platform, this is for you. I will be making some assumptions throughout this series such as: you’ve written SQL, you’ve used a cloud warehouse, and you’ve likely wrestled with a CRON job or two.
This series is also for analysts or data scientists who want to understand what’s happening under the hood of the tools you work with day-in day-out. You’re welcome here too, I’ll keep things practical and jargon-free where I can.
Finally, this series is for the Data Leader staring at a blank Miro board wondering where to start, welcome. I won’t turn you into an Engineer, but by the end of the series I hope to equip you with enough knowledge to know when a tool is helping solve a real problem verses adding yet another name to your list.
What’s next?
In Part 2, we’ll be digging into data ingestion: the first real architectural decision you’ll face. I’ll compare managed connector platforms such as Fivetran & Airbyte, event streaming with Kafka, and the “just write it in an AWS Lambda” approach that’s more common than people admit. We’ll cover cost, reliability, maintenance burden, and when each approach makes sense.
If you would like to follow along, subscribe so you don’t miss the next one.
See you there!
P.S - if you have any questions, or any war stories of your own, drop them in the comments.
Great post! The point about 'best practices' coming from people working at a completely different scale is something more people need to hear. Subscribed for the rest of the series and hearing more about your war stories!
Looks interesting! Looking forward to seeing the rest!