Monday, January 24, 2022

Building a Data Mesh Architecture in Azure – Part 1

Must read

Paul Andrew
Group Manager & Analytics Architect specialising in big data solutions on the Microsoft Azure cloud platform. Data engineering competencies include Azure Synapse Analytics, Data Factory, Data Lake, Databricks, Stream Analytics, Event Hub, IoT Hub, Functions, Automation, Logic Apps and of course the complete SQL Server business intelligence stack. Many years’ experience working within healthcare, retail and gaming verticals delivering analytics using industry leading methods and technical design patterns. STEM ambassador and very active member of the data platform community delivering training and technical sessions at conferences both nationally and internationally. Father, husband, swimmer, cyclist, runner, blood donor, geek, Lego and Star Wars fan!

Data Mesh vs Azure Theory vs practice

Use the tag Data Mesh vs Azure to follow this blog series.

Context and Background

The concepts and principals of a data mesh architecture have been around for a while now and I’ve yet to see anyone else apply/deliver such a solution in Azure. I’m wondering if the concepts are so abstract that it’s hard to translate the principals into real world requirements, and maybe even harder to think about what technology you might actually need to deploy in your Azure environment.

Given this context (and certainly no fear of going first with an idea and being wrong 🙂 ) here’s what I think we could do to build a data mesh architecture in the Microsoft cloud platform – Azure.

As a reminder, the four data mesh principals:

  1. domain-oriented decentralised data ownership and architecture.
  2. data as a product.
  3. self-serve data infrastructure as a platform.
  4. federated computational governance.

Source reference:

Before digging in, it’s also worth saying that I have my own sceptical views on the ‘data mesh’. My current criticisms are:

  • The data mesh is very input driven and lacking in the critical data insight and outputs components that ultimately deliver the business value. It’s alluded to, but not fully represented as a core component.
  • Thinking about the first principal above, what stops disconnected domains ingesting the same data, affectivity duplicating effort and potentially stressing the underlying OLTP system. Or to ask the question another way; what governs the data ownership when multiple data products require the same dataset?
  • If we deliver a self-serve data infrastructure, what data model should be targeted and where should it be built (without it becoming a monolith data warehouse amongst the mesh). Should this be part of the suggested virtualisation layer? Or is a single data model not the goal at the data product level?

That said, can we still deliver a data mesh, and live-in hope that we can utilise technology/innovation to overcome my criticisms. Right?!

Let’s explore this together in a series of blog posts where I’ll think about turning the theory into practice.

As always, I would very much welcome your feedback on my thinking and perspective here.

Let’s start breaking down this challenge, addressing the data mesh principals in turn.

1. Decentralised Data Ownership & Architecture

I’m going to explore this principal in a few different ways. Firstly:

Node ARCHITECTURE Containers

When working in Azure, the natural way to decentralise and apply this first principal could be by using Resource Groups or separate Azure Subscriptions, within the overall Azure Tenant (AD Domain). These logical structures of cloud organisation seem like a good way to separate the data ownership/architecture and allow a given data domain or business function to operate independently, within a Resource Group or Subscription.

That said, I’m very aware of the limitations Microsoft enforces at a Subscription level for things like role-based access controls (RBAC). So, let’s caveat this first principal with a design question about solution scale. I like using Resource Groups as the logical nodes within the data mesh, but if your solution needs to scale further than the service limitations allow, you should go a level up and have Subscriptions acting as each node.

Visually, the affect is similar for the ultimate data owner, the re-charging of Azure consumptions costs might just need a little more thinking about considering the use of Azure Tagging vs the natural charging bucket of a subscription. All things to be aware of when we go from theory to practice.

Azure Resource Groups
as Data Mesh Nodes
Azure Subscriptions
as Data Mesh Nodes

To offer another perspective, what I suggest we avoid, is going to a lower level than Resource Groups with the data ownership, whereby, we consider separate Azure Storage Accounts or Azure Data Lake Storage as the means of forming data ownership separation.

As a hierarchy: Tenant > Subscription > Resource Group > Storage (Resource) > Container.

Resource Group feels like the correct starting point for most solutions.

That concludes what I wanted to cover in this initial blog post. I didn’t want to make it too long and it seemed to take me a lot of words to establish the context and background.

In the next post we’ll look at mesh node interactions and interfaces, with the same practical lens.

Many thanks for reading.

More articles

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Latest articles