+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

Challenges of Multiple Data Products, Duplication Management, and Governance
Challenges of Multiple Data Products, Duplication Management, and Governance

Challenges of Multiple Data Products, Duplication Management, and Governance

5 min
|
Non-Disruptive Management of the Evolving Landscape Of Data Products: Part 1
Feb 8, 2024
Ayush Sharma
,
 and
,
  and

Originally published on

Modern Data 101 Newsletter

,

the following is a revised edition.



Ayush Sharma is a Technical Lead and Sr. Consultant at ThoughtWorks, and we highly appreciate his contribution and readiness to share his knowledge among MD101 readers.

We actively collaborate with data experts and practitioners to bring the best resources to the community. If you have something to say on Modern Data practices and innovations, feel free to reach out to us! Note: All submissions are vetted for quality and relevance.


This is the first of a two-part series exploring the scope of data products in a dynamic landscape. This article delves into managing the evolving scope of individual data products in a Data Mesh architecture.

The building blocks of the Data Mesh architecture are data products and discrete items that make data usable to others. While a good Data Mesh will contain many data products, as the number proliferates, this can introduce complexity and governance challenges. Therefore, it’s essential to have a robust framework to manage dependencies, coordinate updates and ensure data consistency across the ecosystem.

This involves defining clear ownership and accountability for each data product and assigning dedicated teams or individuals to handle development, maintenance, and evolution.

I. Challenges Of Multiple Data Products (DPs) & The Role Of Data Governance


1. Challenges are caused by having multiple data products for the same data source
2. Impact of multiple DPs reading from the same operational system
3. Duplication due to different data products addressing the same business needs
4. The need for strong governance and ownership structures


Organizations attempting to shift into a federated data environment via Data Mesh require strong data governance to avoid the accidental explosion of data products that can lead to increased complexity, duplication of effort, and ownership issues.

When multiple data products are created for the same data source, it puts a strain on the source systems. This also has the potential to result in performance issues for transaction applications. While it’s common to have different data products reading various categories of data from the source systems, it is crucial to carefully design the relationships between data products and source data, considering the specific needs and requirements of each data product.

Image: Depicting reusability in data products | Source: Author


Leveraging source-oriented data products (SoDP) and promoting data product reusability serves as a strategic means to alleviate undue stress on source systems, yielding a streamlined and efficient operational landscape. This approach helps to avoid unnecessary replication of data in the Data Mesh or data catalogue, leading to a more streamlined and efficient data ecosystem. It enables consumers to access the required information from the appropriate data product without encountering redundant or conflicting data.

This helps:

1. Improve data quality
2. Simplify data discovery
3. Ensure consistency across the organization’s data landscape


By addressing complexities and governance issues, organizations can streamline their data product ecosystem, ensuring efficient collaboration, adherence to standards, and effective management of data assets. This helps teams to work cohesively, and fosters trust in the data products and the Data Mesh architecture.

To strike a balance between granularity and simplicity, creating data products that are modular, agile; and aligned with the overall objectives of the organisation is crucial.

Duplication can also develop because of different domain teams solving the same or similar data use cases. This is particularly visible in consumer-oriented data products (CODPs) generated through the collaboration of multiple data products. Another reason could be that different teams require the same data use cases to be addressed or that one team is attempting to build a data product for a specific use case.

Managing Duplicates Using A Feedback-Based Ranking System


One way to manage duplications in the Data Mesh is by implementing a feedback-based ranking system. This system considers the maintenance of Service Level Objectives (SLOs) and Service Level Indicators (SLIs) of data products. A well-maintained data product that meets the required SLOs and SLIs will receive higher quality points, resulting in a higher ranking during the discovery phase. The data product will receive more visibility with a higher rank, leading to increased search and use. Hence, it promotes continuous improvement.

This approach is referred to as a “feedback-based ranking system” or a “quality-based ranking system.” It aligns with the principles of feedback-driven development, quality management, and data product lifecycle management. The implementation of such a system involves the following:

1. Data product discovery mechanisms
2. Feedback collection and analysis
3. Ranking algorithms
4. Retirement and archival processes.
📝 Note from Editor: Learn more about SLOs and SLO evolution here
How to Build Data Products | Evolve: Part 4/4

As part of this ranking system, a fitness function can be used to evaluate the performance and quality of data products. A fitness function is a mathematical function that assesses the data product based on predefined criteria, such as data quality, reliability, timeliness, availability, user satisfaction, and adherence to service level agreements (SLAs). By incorporating relevant metrics and feedback from users, such as usage patterns, user ratings, response times, and data accuracy, the fitness function assigns scores or rankings to data products.

By leveraging a fitness function, the system can objectively evaluate and prioritize data products based on their performance against the defined criteria.

II. Versioning Data Products


Data product versioning is the systematic monitoring and control of modifications to the data product as it evolves. Like software versioning, it involves the allocation of a distinct identifier or label to diverse iterations or releases of the data product. This identifier plays a pivotal role in distinguishing and singling out the data product versions, enabling efficient administration of updates and changes.

Image: Depicting version change on DP | Source: Author


Versioning a data product:

  • Provides a way to keep track of its evolution by documenting the changes made at each stage. This record is valuable for auditing purposes, troubleshooting, and understanding the lineage of the data (read more about data product evolution here).
  • Enables controlled and coordinated updates to the data product. When changes are made to the data schema, data processing logic, or underlying infrastructure, having different versions of the data product allows for a phased rollout or deployment strategy (more on change management in data products).
  • Facilitates compatibility management between data products and their dependencies. It is possible to ensure that the suitable versions are used together, avoiding compatibility issues and data inconsistencies, by clearly specifying the version dependencies between data products (more on dependency management with Data Product Platform Bundles).
  • Enables effective collaboration and communication among teams working on the data product. It allows for discussions and decision-making based on specific versions and working with a shared understanding of the data product’s state (more on collaborative interfaces for data products)
  • Helps maintain data integrity and ensures a smooth evolution of the data product within the Data Mesh architecture.

Conclusion


Navigating the complexities of evolving individual data products within a Data Mesh architecture demands a strategic approach. Robust governance, ownership structures, and source-oriented data products (SoDP) are essential to ensure streamlined operations and data consistency.

🔑 Key Takeaways

  • Leveraging a feedback-based ranking system enhances data product quality, aligning with principles of continuous improvement and lifecycle management.
  • Effective data product versioning empowers controlled updates, compatibility, collaboration, and data integrity.
  • SoDPs alleviate stress on source systems, promote reusability, and facilitate collaboration while ensuring data consistency and quality.

Keep watching this space for Part 2!

Author Connect


We highly appreciate this contribution from Ayush Sharma! Feel free to connect with him and learn more about Data Products and Data Mesh 💠

Find me on LinkedIn and GitHub📍
// Text truncation functionality const elements = document.querySelectorAll('[ms-code-truncate]'); elements.forEach((element) => { const charLimit = parseInt(element.getAttribute('ms-code-truncate')); // Helper function to recursively traverse the DOM and truncate text nodes const traverseNodes = (node, count) => { for (let child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { if (count + child.textContent.length > charLimit) { child.textContent = child.textContent.slice(0, charLimit - count) + '...'; return count + child.textContent.length; } count += child.textContent.length; } else if (child.nodeType === Node.ELEMENT_NODE) { count = traverseNodes(child, count); } } return count; } // Create a clone to work on without modifying the original element const clone = element.cloneNode(true); traverseNodes(clone, 0); // Replace the original element with the truncated version element.parentNode.replaceChild(clone, element); }); });