+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

Module #
2

2

.

Criteria for Qualifying as Unified

Learn about the advantages of qualifying as a unified system, including the criteria for qualification, the importance of a common set of building blocks, the hierarchical model, dynamic configuration management, and the unified metadata experience.

A common set of first-order solutions or building blocks

Why the hierarchical model works

The true and primary objective of the Infra Counterpart in Data Products is Unification. Unification of Purpose, Execution, and Experience. Why unification? Because it eliminates the overheads of fragmented tooling that comes with overly complex pipelines, messy overlaps in capabilities/features, and therefore, high cognitive overload for data developers.

Hierarchical Model with Unique Building Blocks | Image by Author


Unification is achieved through two design principles working in sync:

  • Modularisation: Modularisation is established through the hierarchical model or the building-block-like approach. A unified infrastructure identifies unique atomic capabilities that cannot be compromised in a data stack and serves them as ready-to-implement resources. These resources can interoperate with each other and be composed together to form high-order solutions such as transformation and modeling. Examples of these fundamental resources include and are not limited to Storage, Compute, Cluster, Workflow, Service, Policy, Stack, and Secret.
  • Dynamic Configuration Management: DCM is the ability to drive change at scale by trimming down points of change management to a single specification file. In prevalent data stacks, a single upstream change can lead to hundreds of tickets in downstream operations where the developers must manage multiple config files, dependencies, environment specs, and overall work against the tide of configuration drift. With DCM, which is also powered through the building-block-like design, all dependencies and configuration drifts can often be handled through a single specification file which abstracts underlying dependencies and complexities.

As more tools pop in to solve ad-hoc challenges, each tool increasingly develops the need to become independently operable (user feedback). For instance, two different point tools, say one for cataloguing and another for governance, are plugged into your data stack. This incites the need not just to learn the tools’ different patterns, integrate, and maintain each from scratch but eventually create parallel tracks. The governance tool starts requiring a native catalog, and the catalog tool requires policies manageable within its system.

"You end up with two catalogs and two policy engines with different objectives. Now consider the same problem at scale, beyond just two tools."


While unified data stacks come with building blocks that can be put together to enable unique capabilities, it doesn’t restrict third-party integration. In fact, it facilitates them with an interface to propagate their capabilities. The objective is to omit capability overlaps.

Deep Dive
The Infrastructure Subset of Data Products

What are primitives?

Every organisation comes with its own approach to data, and it might so happen that they are not happy with the base pattern. The good news is they don’t have to start from scratch. Architects can easily build higher-order patterns on top of primitives, which are fundamental atomic non-negotiable units of any data stack, identified and embedded as part of a DDP with a unified architecture.

Once an architect or data engineer has access to low-level primitives, they can compose them together to manifest higher-order complex design patterns. Through a low-lying infrastructure platform that composes all moving parts through a unified architecture. Such an architecture identifies fundamental atomic building blocks that are non-negotiables in a data stack. Let’s call them primitives.

Examples of primitives | Image by Authors


Storage, compute, cluster, policy, service, and workflow are all examples of primitives. With these components packaged through infrastructure as code, developers can easily implement workload-centric development where they declaratively specify config requirements and the infrastructure is provisioned and deployed with respective resources at respective environments. Whenever a point of change arises, the developer makes the required change in the declarative specification to mirror the change across all dependent environments.

Unification through Modularization. Modularization is possible through Infrastructure as Code that wraps around the finite set of primitives that have been uniquely identified as essential to the data ecosystem.’

Deep Dive
The Essence of Having Your Own Data Developer Platform

Standard uniform philosophy across capabilities

The first step to materializing any infrastructure is to create an infrastructure specification. A specification is like a PRD for an infrastructure. You need to define the purpose, the principles, the SLOs, architecture, personas, and so on before even building or adopting the infrastructure.

A data developer platform (DDP) infrastructure specification is designed in a similar flavor. It is not the infrastructure but rather a specification that any and every platform developer can adopt or enhance to develop the infrastructure from the bottom up.

The DDP specification provides data professionals with a set of building blocks that they can use to build data products and services more quickly and efficiently. By providing a unified and standardized platform for managing data, a data developer platform can help organizations make better use of their data assets and drive business value.

A Data Developer Platform Enabling the Data Product Construct | Image by Authors

DDP Capabilities | Image by Author


The entire DDP specification is aimed at being a community-driven spec and is open to modification, customization, and enhancements. datadeveloperplatform.org can shed more light and provide more context around the same. Everything from purpose to architecture has been thoroughly outlined, with enough leeway for practitioners to enhance the spec for better or for their own unique use cases.

Complementing instead of overlapping Components

Evolutionary convergence of tools into common denominators: How did disparate tools evolve into building common denominators.

Transition from patchwork solutions to primitive building blocks

Becoming data-first within weeks is possible through the high internal quality of the composable Data Operating System architecture: Unification through Modularisation. Modularisation is possible through a finite set of primitives that have been uniquely identified as essential to the data stack in its most fundamental form. These primitives can be specifically arranged to build higher-order components and applications.

They can be treated as artifacts that could be source-controlled and managed using a version control system. Every Primitive can be considered an abstraction that allows you to enumerate specific goals and outcomes in a declarative manner instead of the arduous process of defining ‘how to reach those outcomes.’

Unifying pre-existing tooling for declarative management

Being artifact-first with open standards, DDP is used as an architectural layer on top of any existing data infrastructure. It enables it to interact with heterogeneous components native and external to DDP. Thus, organizations can integrate their existing data infrastructure with new and innovative technologies without completely overhauling their existing systems.

It’s a complete self-service interface for developers to declaratively manage resources through APIs and CLI. Business users attain self-service through intuitive GUIs to directly integrate business logic into data models. The GUI interface also allows developers to visualize resource allocation and streamline resource management. This saves significant time and enhances productivity for developers, who can easily manage resources without extensive technical knowledge.

Central governance, orchestration, and metadata management

DDP operates on a three-pronged-plane conceptual architecture where the control is forked between one Control Plane for core global operations, one Development Plane where the data developer creates and manages standard configs, and one or more Data Activation Planes for domain-specific data operationalisation. The Control Plane helps data stewards govern the data ecosystem through unified management of vertical components.

Advantages of unique components over overlapping components

There are two broad patterns of evolution: Divergent and Convergent. These broad patterns apply to the data landscape as well.

The diversity of species on Earth is due to divergent evolution. Similarly, divergent evolution results in a wide range of tools and services in the data industry known as the MAD Landscape today. Convergent evolution creates variants of tools with shared features over time. For instance, both rats and tigers and very different animals, but both have similar features such as whiskers, fur, limbs, and tail.

Image by Author


Convergent evolution results in common denominators across tooling solutions, meaning users pay for redundant capabilities and bear the cog. Divergent evolution results in even higher integration costs and requires experts to understand and maintain each tool's unique philosophy.

Note that common denominators do not mean the point solutions are converging towards a unified solution. Instead, each point is developing solutions that intersect with other solutions by other points based on demand. These common capabilities have separate languages and philosophies and require niche experts.

For example, Immuta and Atlan are data governance and catalog solutions, respectively. However, Immuta is also developing a data catalog, and Atlan is adding governance capabilities. Customers tend to replace secondary capabilities with tools that specialize in them. This results in:

Time invested in understanding the language and philosophy of each product, Redundant cost of onboarding two tools with similar offerings, and High resource cost of niche experts; even more challenging since there’s a dearth of good talent.

Deep Dive
Evolution of the Data Landscape: Fragmented Tools to Unified Interfaces

Infra isolation: Data Products, Domains, Projects, etc.

Corruption or disruptive changes in one track don't impact other tracks while not limiting the visibility of data from other tracks. Each metric enables certain decisions. Thus, the parallel model enables both decision isolation as well as collaborative decision-making. Most decisions are insulated as much as possible from sour decisions but benefit from positive ones.

Yes, data products are also driven by the principle of infra isolation and, therefore, self-dependency. This design also needs to be reflected on the metric layer that sits on top of the data products layer due to the same virtues.

"It’s important to remind ourselves that bringing in the product ideology means reflection across all layers or verticals, even beyond data products. Adopting a new design approach means implementing it top-down instead of limited and incomplete implementation in selected areas."


One of the primary reasons behind transitioning into Data Product ecosystems is autonomy for business, data, and domain teams, which is currently not the case with severe bottlenecks across centralised teams and spaghetti pipelines.

To enable such autonomy, data products need to become self-dependent - from root to tip. That is, from the bottom-most layer of infrastructure resources such as compute, storage, governance, and deployment infra to the top-most layer of consumer-facing data.

Representation of Data Products as on a DDP Infrastructure Specification | Image by Author


"A tenancy pattern delineates access and authorization to specific groups of users to a specific data product (or set of data products). It does this by encapsulating isolations/namespaces on a common capability."


Due to the flexibility of isolation capabilities in Data Developer Platforms, data teams can also choose to isolate certain infra resources such as workflows or policies on use-case or domain levels. In DDPs, these isolations/namespaces go by Workspaces. Again, workspaces can be defined for data products, specific workflows and resources, dev and testing environments, domains, use cases, and much more. The structure is entirely up to the practices and comfort of the teams involved.

Deep Dive
How to Build Data Products?
Data Modeling from the POV of a Data Product Developer

Dynamic Configuration Management

Let’s all agree change is the only constant, especially so in the data space. Declarative, single-point change management, therefore, becomes paramount to developer experience and pace of development and deployment. This goes by the name of dynamic configuration management and is established through workload-centric development. Lot of heavy words, but what do they mean?

Let’s look at the flow of DDP to applications at a very high level:

A business use case pops up → The domain user furnishes the data model for the use case or updates an existing one → The data engineer writes code to enable the data model → The data engineer writes configuration files for every resource and environment to provision infrastructure components such as storage and compute for the code.

In prevalent scenarios, swamps of config files are written, resulting in heavy configuration drifts and corrupted change management since the developer has to work across multiple touchpoints. This leads to more time spent on plumbing instead of simply opening the tap or enabling the data.

Dynamic configuration management solves this by distilling all points of change into one point. How is this possible? Through a low-lying infrastructure platform that composes all moving parts through a unified architecture. Such an architecture identifies fundamental atomic building blocks that are non-negotiables in a data stack. Let’s call them primitives.

Storage, compute, cluster, policy, service, and workflow are all examples of primitives. With these components packaged through infrastructure as code, developers can easily implement workload-centric development where they declaratively specify config requirements and the infrastructure is provisioned and deployed with respective resources at respective environments. Whenever a point of change arises, the developer makes the required change in the declarative specification to mirror the change across all dependent environments.

Workload-Centric Development enabled through DDP | Image by Authors

Workload-centric development enables data developers to quickly deploy workloads by eliminating configuration drifts and vast number of config files through standard base configurations that do not require environment-specific variables. The system auto-generates manifest files for apps, enabling CRUD ops, execution, and meta storage on top.

Data developers can quickly spawn new applications and rapidly deploy them to multiple target environments or namespaces with configuration templates, abstracted credential management, and declarative and single workload specification. The impact is instantly realised with a visible increase in deployment frequency.

Deep Dive
The Essence of Having Your Own Data Developer Platform

Unified Metadata Experience: Universal Semantics, Metrics, etc.

There are multiple teams in an organisation, each dealing with their own share of data alongside cross-referencing data from other teams frequently to get more context for their data. This leads to a big jumble of random words and jargon thrown at each other. Exactly how two tribes with different languages would interact with each other. They might mean the same thing, but both are frustrated because they don’t understand they mean the same thing.

Semantics means the meaning and context behind data. It could be something as simple as a column name or even an elaborate description or equation. Given there are so many teams and domains, each with its standard way of defining the same data, it becomes extremely difficult to have a unified meaning for the data. This leads to semantic untrustworthiness that leads to multiple iterations between teams to understand and verify the meaning that has been described.

Semantic untrustworthiness also stems from a chaotic clutter of MDS, overwhelmed with tools, integrations, and unstable pipelines. Another level of semantics is required to understand the low-level semantics, complicating the problem further.

Image by Author


A DDP solves this problem since the low-level layers are powered through a unified architecture. Irrespective of disparate domains and distributed teams in the above layers, all logic, definitions, ontologies, and taxonomies defined across multiple layers and verticals plugs into a common business glossary that is embedded with the ability to accommodate synonymous jargon and diverging meanings.

The platform connects to all data source systems, resources, environments, and hierarchical business layers allowing data producers to augment the glossary with business semantics to make data products addressable. This holds true even for metrics that are similar to the semantic meaning. So now, when a marketing guy refers to, say, the business development team’s data, they don’t have to struggle with understanding new jargon, tallying columns, or iterating with the analysts to understand what the data means. Every time the data changes, they wouldn’t need to repeat the process either. Simply put, the amount of time and effort saved here is tremendous.