+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

Are you feeding your data debt?
Are you feeding your data debt?

Are you feeding your data debt?

10 min
|
The four horsemen of data debt, generational changes, and potential solutions
Nov 22, 2022
,
  and

Originally published on

Modern Data 101 Newsletter

,

the following is a revised edition.

Immature development and delivery processes force business users to build their own solutions that result in an ever-expanding universe of data silos, intensely fractured data environments, and general data heterogeneity. These examples can be considered as Data Debt, which stems naturally from the way companies do business, especially when they are running their business as a loosely connected portfolio or making “free rider” decisions about Data Management and Governance.

~
Petr Travkin, Data Architect

The rise in technical debt over the last few years is not a surprise


As found by Mckinsey’s extensive report, the majority of the CIOs feel that tech debt has risen since about half a decade ago, which is about the same time when the mass of organizations started leveraging data at scale.

The idea of data was propagated as the new fuel for industries, and several vendors started coming up with a way to manage this fuel without considering that mismanaged data had the potential to make existing infrastructures flammable.

Organizations poured heavy investments into data technologies which were immature solutions that only tried to solve fractions of the huge problem. It was similar to installing a measly pipe to control the direction of a river and continuously fixing the pipe when it ruptured.

Image credit: Reddit


Organizations took the natural next step of installing several more of these pipes or point solutions to control and harness the power of the data when instead, they probably needed just one wholistic dam. This chaotic web of tools and their overheads resulted in massive data debts.

Organizations are consistently chasing the unified whole that could govern the entirety of the data ecosystem to make it more reliable. Lower data debt would make it easier for organizations to mine actionable business insights from raw data instead of going around in loops to fix and revive fragile processes and tools.

This article is a snapshot of data debt, the problems it creates for data teams, and an overview of potential solutions that are aligned with DataOps.

What is Data Debt?


To understand data debt, we must first get a hang of technical debt.

Technical debt is the cost of continuous rework when a constructive, challenging, or time-taking approach is replaced with an easier or instant solution to hold the fort. Technical debt rises over time as the reworked jobs and their consequent costs build up.

The concept of data debt is a simple extension of technical debt, where instead of general technologies, the rework and the cost behind the rework are triggered by data-specific technologies. Data debt is often used as a directional measure to understand which problem to prioritize and fix first and it also acts as a directive for investment channels, often deflecting investments instead of harnessing them.

Given the rise of point solutions or multiple pipes that try to control data, the overheads behind each of those solutions have also increased. While the work to install and integrate each of these solutions is already significant, the rework cost triggered by these tools and processes is gradually killing the strategic, innovative, and scalable potential of data teams. Instead of moving forward, teams are stuck in a horizontal plane shuttling between points.

The Four Horsemen of Data Debt


Professionals started dabbling with data only recently whereas software has been under experimentation for decades. As a result, data development is quite scattered and does not necessarily follow the optimized processes that are standard for software development. Sub-optimal processes and approaches are the prime reasons behind the skyrocketing cost of data debt.

Let’s zoom out to an analogy: The tale of the four horsemen. According to the book of Revelation, their appearance brings forth the cataclysm of the apocalypse. The “four horsemen” is an interesting analogy for pre-apocalypse and no doubt, the data world is internally in an apocalyptic stage, given the massive data debt that is being shouldered by weak data pipelines and data teams every day.

Image credit: Wikimedia Commons

The four horsemen of data debt or an apocalyptic combination of processes are currently the common issue in most organizations that are either not data-first or are on the steep and rocky climb of achieving data-first status.

  1. Complex architecture and no accountability Going back to the river analogy, the expanse, and vastness of data was barely realized when organizations first started leveraging data. As a result, the data stack they built was scattered across multiple tools that could barely manage pockets of the river.

    Even efforts to build a more modern data stack (MDS) resulted in a finer and more complex plumbing system instead of replacing the plumbing with a dam or a unified architecture that was free from a million parts demanding consistent maintenance.

    There is a surreal expectation that the MDS would fix the data debt, even though there is no person accountable for each of the multiple components and integrations hosted by the MDS architecture.

  2. A wide gap between data and business teams Raw data has no value in the business context. It has to be processed and converted into business-friendly data models so it becomes viable enough to generate good insights. The data-to-insights journey, however, is not a simple one. It is loaded with countless pipelines, integrations, isolated environments, and many more components that business teams have no control over.

    Organizations today sit on considerable raw data, but most of them are not able to convert it into actionable or usable insights. Even if business teams generate some form of analytics insights, the delay from data to insights waters down the benefits and almost creates a negative customer experience.

    This friction is a result of disruptive communication between business and data teams- a leading cause behind the dormant nature of data or why it is not easily convertible into insights. Often data changes so fast that today’s data must be leveraged today instead of months after.

  3. Fragile data pipelines Change is one of the defining features of data. Perhaps its most defining feature. Data arrives from multiple sources and something as simple as a consultant changing the column name of an attribute could break the workflow.

    There is no single source of truth that is enforced across every touchpoint, breaking all downstream processes. The result is a frustrated data engineer who is sandwiched between cross-functional teams, shouldering the blame for fragile data.

    Once a data pipeline breaks, data engineers have to kickstart the long winding process of detecting the root cause, fixing bugs across multiple integrations, propagating the changes across various teams, and maintaining the updated system 365 days a year. Would data engineers or analysts fix old pipelines or build strategic ones to mine actionable insights?

  4. Delayed delivery in isolated spurts While data flow through the entire organization, it gets duplicated and modified in pockets that do not communicate with each other. A data analyst might use tools like Tableau or excel for analysis, but the insights from that analysis do not merge into the main data stream.

    Tools are isolated environments that create an instance of the data, and even if the data is written back to a central store, there is usually no standard process followed by organizations to avoid data duplication, discrepancy, or corruption.

    This also leads to delays in insight delivery since users have to communicate with each other across multiple iterations and ensure that the data is coherent, updated, and non-corrupted. The isolated insights from the data analyst would definitely reach the sales or marketing associate (say) in an excel file, but when another associate asks for a similar report next week, the data has to be updated, insights must be merged, re-processed, and then delivered.

Challenges as a result of Data Debt


If the above practices or processes are standard in your team or organization, it is evident that the organization is feeding data debt instead of solving it. Data debt results in scores of issues in the data ecosystem, including some of those mentioned in the previous section. but the issues could be categorized into three key pillars:

  • Data Untrustworthiness
    Data untrustworthiness is a result of discrepancies, corruption, and volatility in data, all of which are after-effects of data debt. At this point, it might be beneficial to revisit the meaning of data debt which boils down to a simple choice- an effective and challenging approach or an easier and faster approach.

    Trustworthy data is only guaranteed when the entire journey of data is in good hands. In other words, a concrete approach and architecture that tracks and maintains both data lineage and data quality. This feat is nearly impossible with a scattered collection of tools and integrations and with next to no accountability.

  • Data Swamp
    A data swamp, in layman's terms, is a chaotic dump of data that is more costly to process compared to the revenue it would generate with insights that can be drawn from it. The swamp is a result of vague Data Modeling and sub-optimal storage mechanisms.

    Organizations usually dump data haphazardly since data engineers do not have much clarity on business models or requirements. The reason behind this is a wide gap between business teams and data teams which, technically, is unavoidable unless the approach toward data modeling is revisited. Also, managing data exchanges from multiple points in a scattered architecture eventually creates a chaotic swamp.

  • High Cost
    The cost of processing data in a system clogged with debt is a burden not just in terms of finance, but also in terms of resources, effort, and time. Reworking data pipelines due to frequent breakages or fragility diverts most of the team’s brainpower to recovery instead of strategic tasks. Innovation, revenue, and scalability are all bogged down to repay the interest on data debt.

Potential Solution: Replicating the Agile Revolution for Data


The most effective way to combat the four horsemen of data debt is to follow the battle cards of Agile development that revolutionized the software industry and most importantly, upgraded the SaaS experience. Agile development is in alignment with the ideologies of DataOps. While DataOps is a data development culture, agile can be described better as a methodology.

The agile manifesto has some key principles such as being change-friendly, enabling early and continuous delivery, and prioritizing users over processes. The next article in the series will outline the top agile principles and illustrate how exactly they could be replicated from the software world to the data ecosystem, gradually eradicating data debt one step at a time.

Since its inception, ModernData101 has garnered a select group of Data Leaders and Practitioners among its readership. We’d love to welcome more experts in the field to share their story here and connect with more folks building for better. If you have a story to tell, feel free to email your title and a brief synopsis to the Editor.
// Text truncation functionality const elements = document.querySelectorAll('[ms-code-truncate]'); elements.forEach((element) => { const charLimit = parseInt(element.getAttribute('ms-code-truncate')); // Helper function to recursively traverse the DOM and truncate text nodes const traverseNodes = (node, count) => { for (let child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { if (count + child.textContent.length > charLimit) { child.textContent = child.textContent.slice(0, charLimit - count) + '...'; return count + child.textContent.length; } count += child.textContent.length; } else if (child.nodeType === Node.ELEMENT_NODE) { count = traverseNodes(child, count); } } return count; } // Create a clone to work on without modifying the original element const clone = element.cloneNode(true); traverseNodes(clone, 0); // Replace the original element with the truncated version element.parentNode.replaceChild(clone, element); }); });