New Release
Learn More
Your submission has been received!
Thank you for submitting!
Thank you for submitting!
Download your PDF
Oops! Something went wrong while submitting the form.
Table of Contents
Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!
the following is a revised edition.
Today, data moves at lightning speed within diverse systems and processes; thus, ensuring integrity, quality, and compatibility is paramount in the data engineering space. That is where Data Contracts come into play.
So what is a data contract? Data Contracts can be understood as a formal agreement between the Data Producers and Data Consumers. It assures that data meets the prescribed prerequisites of quality, governance, SLA, and semantics and is fit for consumption by downstream data pipelines.
Not surprisingly, data contracts in practice also play a key role in enabling organisations to transition into a unified data architecture to leverage true unified experiences across data governance, metadata and semantics. Being created together with data producers and consumers, they act as good communication tool promoting effective data collaboration as well across different personas.
To understand this better, Data Quality Camp brought together data experts who have notably voiced their opinions and shared innovations around data contracts during the past few years. It was an honor to be among this crowd and address some important open questions in the data engineering realm. The piece covers fundamentals as well as advanced specifics. A few interesting bytes →
Chad Sanderson simply explains Contracts as APIs for data. If implemented at every node in a lineage graph supporting some data product, the schema checks they facilitate are equivalent to end-to-end integration testing.
It seemed almost the entire panel loved an API analogy. Typically, an API is an interface that defines how two systems communicate. They have rules that direct and govern the exchange of information. In a similar frame, when navigating the data engineering space, contracts do the same for data. It lays down a set of quality, semantics, and governance SLOs as codified checks and governs the exchange of data between any two exchange points.
This thread had tons of contrasting opinions. Jean-Georges Perrin shared that they didn't put relationships in the data contract, but it depends on which level of relationship you want to consider. They didn't define relationships in the physical structure of the data contracts at PayPal because they didn’t need it at the time. But notably, he points out that if you want to have a data mesh of multiple data products, the contract should not describe the mesh and stay within the boundary of the data product.
We have a different approach to this. In our approach of Data Contract implementation, Data Contracts have a one-on-one relationship with data entities to preserve the independence of these entities at the contract level. By design, relationships would fall under the higher layer of the data model. If a contract for an entity includes relations, it would mean dependency of the entity on other entities to establish its SLOs. The contract and the entity it defines should be independently validate-able. It's worth noting that this philosophy aligns seamlessly with data engineering principles, where the structure and flow of data are meticulously crafted to ensure efficiency and reliability. Interestingly, there are different folks with different implementations on both sides of this boat.
Andrew Jones of GoCardless emphasizes that Data Products require an interface that defines the expectations around that data, the schema, the version, how it evolves, and so on – all of which are key parts of a Data Contract.
In fact, I'd say if those expectations are not defined, you don't have a data product. Or put another way, you don’t have a data product if you don't have a data contract around it.
A data product has four fundamental stages: Inputs, Transformations, SLOs, and Outputs, where SLOs are a bunch of requirements applicable to the other stages. Contracts come in handy at each stage. In the realm of data engineering, the careful orchestration of these stages ensures seamless data processing, enabling organisations to obtain optimal value from the data and boost performance.
Shane Murray, Monte Carlo shared that the value of Data Contracts largely depends on how data teams organise their teams and the overall data ecosystem. That said, design often makes the communication gap for decentralized data teams wider in the space of data engineering.
Different domains need to produce interoperable data products, or else Data Mesh can become Data Silo. He added that data contracts can instil trust in the underlying data product to encourage a “build once, use many times” model.
These were just the highlights from a handful in the panel, feel free to dive into the entire piece here from Chad Sanderson: Practical Data Contracts. The piece covers opinions across a broad range of data experts including Ananth Packkildurai, Andrea Gioia, Andrew Jones, Chad Sanderson, Jean-Georges Perrin, Sarah Floris, Shane Murray, Shirshanka Das, and yours truly.
Data Contracts have seen amazing contributions from the community, especially from excellent data contract advocates such as Chad Sanderson, Andrew Jones, and Jean-Georges Perrin! In a different take this time, we’ll highlight some of the biggest stirs in the Data Conversations that helped us take grand strides.
The Rise of Data Contracts by Chad Sanderson was one of the most defining articles in the data contract evolution. He guides readers right from basics to understand the value of contracts.
A question I often get when talking about data contracts is ‘what happens to my existing pipelines? Do they go away?” In my opinion, NO.
How? Find out here.
Andrew Jones published his implementation story with Data Contracts, which was one of the biggest stirs to get the conversation started on Contracts.
It’s been 6 months since I introduced Data Contracts as our initiative to improve data quality at GoCardless. So, how are we getting on? What’s gone well, and what are the challenges we’ve faced?
Read more…
Jean-Georges Perrin, along with the PayPal team, released the first widely known data contract specification, which stirred several data practitioners and drove the community several steps towards data contract standardization.
💡 A data contract specification, commonly, defines a YAML format to describe the attributes of specific datasets. Know more about the keys and values expected in a data contract’s YAML format from this link here.
In this resource for data contract in , you can find the data contract template with insightful explanations and examples. The piece further, offers resources for open data contract standards for easy understanding for its newly released version.
In its current version (v2.1.1), PayPal’s data contract focuses on eight sections: demographics, dataset & schema, data quality (including schema validation aspects), pricing, stakeholders, roles, service-level agreement, and other properties. As you can read, it does not limit itself to a mere schema.
View the open specification.
Economists are predicting that the developed world is heading for a recession this year, which means that budgets will be tight and technology and data teams will need to do more with less. Enterprises that make the best use of their available data to provide real solutions to customer problems and are able to effectively secure that data, will have a real competitive advantage.
Speakers Include - Wouter Van Groenestijn(Head of Data and Analytics - EYP), Deep Thomas(Group CDO - Nomura), Samuel Koh(Director - Alteryx), Robin Fong(VP & GM - Denodo), and many more such experienced folks from the industry.
Event Date - 29 Aug 2023 | Mode - Offline | Register
The 2023 ANA Data, Analytics & Measurement Conference, presented by Google, will showcase the power of measurement and the value of a data-driven marketing strategy. The conference will go beyond the numbers – with a program set to inform and inspire you to harness all the data at your fingertips and bring the numbers to life.
Speakers Include - Andy Hasselwander(CAO - MarketBridge), Christine Turner(MD - Google), Marc Guldimann(Founder & CEOFounder & CEO - Adelaide), Kyle Shank(Director - The Hershy Company), and more such experienced people from business world.
Event Date - 21 - 23 August, 2023 | Mode - Chicago and Virtual | Register
Here’s a breather for you for sticking around till the end!
Follow for more on LinkedIn and Twitter to get the latest updates on what's buzzing in the modern data space.