New Release
Learn More
Your submission has been received!
Thank you for submitting!
Thank you for submitting!
Download your PDF
Oops! Something went wrong while submitting the form.
Table of Contents
Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!
the following is a revised edition.
We added a penned (summarised) version below for those who prefer the written word, made easy for you to skim and record top insights! 📝
Additional note from community moderators: We’re presenting the insights as-is and do not promote any specific tool, platform, or brand. This is to simply share raw experiences and opinions from actual voices in the analytics space to further discussions.
Prefer watching over listening? Watch the Full Episode here ⚡️
We’re honoured to host Ryan Brown, a top Data Engineering voice who has donned multiple hats as an analyst, engineer, consultant, and now an Architect. He has had over five years of experience working with ETL/Data Pipelines and Visualization focused on creating more optimised data infrastructures. Currently, he's serving as a Senior Data Engineering Architect at Trace3. We highly appreciate him joining the MD101 initiative and sharing his much-valued insights with us!
We’ve covered a RANGE of topics with Ryan. Dive in! 🤿
Before diving in, sign up to get notified when Episode 3 goes LIVE! ⏺️
Microsoft's Fabric platform is emerging as a powerful solution for data management, bringing together Power BI, Azure, and other data tools. In a recent implementation at an insurance company, we followed the medallion architecture (bronze, silver, gold) using Python and T-SQL to manage data layers across workspaces. A key challenge was the internal debate between using Databricks and Fabric, with Databricks eventually being selected. This led to a unique migration from Fabric to Databricks, a process that involved transforming code to run smoothly within the Databricks environment.
The migration leveraged advanced scripting tools, like Fish Shell and EXA, to manage and modify file structures, converting Python-based configurations to SQL-compatible ones. This hands-on approach enabled fine-tuning of code, avoiding the limitations of drag-and-drop interfaces. By streamlining the conversion with shell scripting, I tackled complex tasks, ensuring optimal infrastructure performance. This project underscored the need for adaptable technical solutions as new tools like Fabric evolve rapidly in response to industry demands.
The second project involved exploring Microsoft's Fabric API capabilities for automating workspace and endpoint deployments. Fabric represents a progression from Power BI, incorporating new features like lake houses and warehouses within its ecosystem. The project highlighted significant differences in API capabilities depending on authentication methods. While user authentication enabled more functionality, service account-based authentication (application registration) faced limitations, restricting access to certain endpoints.
This work focused on automating infrastructure deployment across development, testing, and production environments, addressing continuous integration and delivery (CI/CD) challenges. My main deliverable was a report detailing each endpoint’s functionality and limitations aimed at guiding future deployment strategies.
The current project involves building an Enterprise Data Warehouse (EDW) and performing ETL for a manufacturing company, using a SQL Server-centered architecture. The setup includes three SQL Server managed instances on Azure, with Azure Data Factory orchestrating the ETL processes. Stored procedures (sprocs) are used extensively, making this approach unique compared to more common cloud-native setups, like Databricks and Snowflake.
An interesting challenge arose when code that worked in development and QA environments failed in production. The issue centered on a single line in a merge statement for a soft delete mechanism, required to keep certain inactive records in the EDW. After troubleshooting with the DBA, the problem traced back to SQL Server optimizer settings—specifically, an option affecting table size calculations. This incident underscored some quirks of SQL Server, which differ from cloud platforms but still support robust ETL processes for large organizations.
In reflecting on my proudest career moments, I recall a significant achievement while working for a major cloud vendor, where I simplified a complex Azure Data Factory and Databricks setup. Initially, the project involved hundreds of individual data pipelines tied to separate tables, making it difficult to manage. By collaborating with my lead, we restructured the system by logically grouping operations into fewer pipelines, using case activities to enhance orchestration and creating a configuration file to streamline source and destination management.
This approach allowed us to reduce the number of pipelines from over 100, each with seven layers of dependencies, to a more manageable framework. The result was a metadata-driven construct that significantly simplified the infrastructure, making it easier to understand and maintain. This experience not only improved efficiency but also highlighted the importance of abstraction in data orchestration.
I'm proud of my work in data engineering and data science, particularly on a project involving fuzzy matching for user-submitted contact information. This required developing a method using cosine similarity, which I learned to understand through linear algebra. Initially implemented in Pandas, the script struggled with loading CSVs into memory. I migrated it to Databricks and Spark, which led to unexpected results: while we improved data loading and output generation, parallelizing the cosine similarity computation didn’t enhance performance. Ultimately, our team transformed a process that took a day into a more efficient workflow integrated with a Power App.
My journey began as a supply chain analyst, where I operated SQL Server on my desk using SQL Server Express and Visual Basic scripts to automate tasks. I developed a data mart and reporting front end that addressed the supply chain’s needs, contrasting with the company's broader BI strategy. Throughout my career, I have consistently aimed to enhance technical processes and optimize performance in various projects, emphasizing a commitment to delivering better solutions.
While I haven't managed large-scale data systems specifically for financial data, my experience with order management, which has financial components, offers valuable insights into data flow management. I aim to deepen my understanding of financial data systems and enhance my technical skills to build optimized infrastructures that are financially viable for businesses. Ultimately, all data practitioners, including data engineers and analysts, must provide monetary value to their organizations, as our work supports business operations that drive revenue and control costs.
It's essential for us to recognize that while we may not be directly involved in sales, our contributions significantly impact how business users operate. Everything we do ties back to the financial health of the organization, emphasizing the need for a value-driven approach in our roles. This perspective shapes my career aspirations and highlights the importance of aligning technical expertise with business objectives.
My experience with the Azure Data Stack has profoundly influenced my approach to back-end infrastructure development, primarily by instilling an "Azure mindset." While I haven't deployed Airflow in production, I've mainly worked with Azure Data Factory, which offers a drag-and-drop interface similar to Airflow and even includes a managed Airflow service.
When comparing Azure to Google Cloud, I’ve found that Google leans heavily towards a technical audience, often requiring users to engage with the command line interface (CLI). This appeals to my technical preferences, but it contrasts with Azure's approach, which aims to lower the skill barrier for users. Azure prioritizes accessibility, making it easy for a broader range of users to engage with data management tasks.
This ease of use in Azure Data Factory enables quick deployment of data pipelines, allowing users to start copying data with minimal friction. However, this can lead to challenges: the proliferation of numerous pipelines can create complexity that becomes hard to manage. Therefore, education and organizational strategies are crucial to ensure clarity and maintain control over the data infrastructure.
Overall, my insights reflect the balance between making data management accessible and the potential pitfalls of overwhelming complexity, emphasizing the importance of thoughtful organization and governance in data systems.
Data governance is increasingly crucial, but ensuring that governance frameworks are effective requires more than just setting them in place. At my current client, we're still navigating what effective governance looks like. Right now, it mainly involves managing SQL Server permissions—essentially, granting access rights.
However, I believe governance needs a balanced approach. While it's vital to protect sensitive data, overly restrictive policies can hinder productivity. Too often, organizations respond with "no" instead of finding solutions, which can drive employees toward shadow IT—where they bypass official channels and create their own solutions, often resulting in data scattered across personal systems like Excel.
Striking a balance is essential. We must safeguard data while enabling employees to perform their jobs efficiently. I envision an ideal data governance framework that incorporates automation. By integrating governance with security groups, access requests can be streamlined, minimizing delays. In my experience, many bottlenecks stem from the manual processes involved in granting access, which can leave consultants and team members waiting weeks before they can start working.
An effective governance strategy should automatically implement access changes based on organizational decisions. This way, when someone is granted access, it’s executed promptly without unnecessary friction. Ultimately, the goal is to protect data without obstructing workflow, fostering an environment where employees can operate effectively and securely.
Key Elements for Building Effective ETL Pipelines
Building effective ETL pipelines requires a thoughtful approach that balances modularity, configurability, rigorous testing, and clear communication of expectations to ensure that complex data sources are managed efficiently and effectively.
My journey as a data engineer has been shaped largely by my experiences with Azure, though I also have a fondness for big data platforms and traditional SQL databases. I enjoy exploring neglected areas in data technology, like implementing advanced architectures locally. I believe it’s essential for data professionals to understand business processes, especially in finance and accounting, to effectively demonstrate the value of our work.
Every data initiative should tie back to business value, focusing on how our efforts contribute to revenue generation or cost reduction. For instance, I once analyzed customer service data to reveal inefficiencies, emphasizing the importance of bridging the gap between our technical work and its impact on business outcomes. This approach ensures that our data efforts align with organizational goals, enhancing our understanding and communication of the value we provide.
When evaluating new tools as a data engineer, I consider several factors that determine their suitability for the project. For me, the most significant aspect is cost—not just the initial expense of the tool itself, but also the ongoing costs associated with maintaining it and the expertise needed. It's crucial to assess how much support exists in the market for the tool. However, I recognize that relying solely on well-supported options can lead to a groupthink mentality within the organization, which is why I admire the Mavericks who are willing to explore deeper, less mainstream technologies.
I focus on understanding the total cost of ownership, including the learning curve for the team and how quickly they can become productive with the tool. It's also essential to evaluate how well the tool fits within the existing vendor landscape and whether its integration will introduce complexities.
A relevant example from my experience involved evaluating Microsoft Fabric against Databricks for a client. At that time, Fabric wasn't fully developed, while Databricks had a more mature ecosystem. Tool maturity and available features are critical to the decision-making process. If I were in charge of the evaluation, my priority would be on how the tool could help increase revenue or decrease costs rather than getting caught up in the vendor's sales pitch.
In every decision, the fundamental question is about the value we're delivering—whether that means increased revenue or reduced costs. As my career progresses, I know that I'll be making more of these decisions, but even at a junior level, it's valuable to keep this lens in mind.
Ultimately, our goal should be to crunch data and print profits. That's the mindset we need to adopt to ensure our data initiatives align with the organization's objectives.
When working with Azure Data Factory (ADF), I often face recurring challenges that can impact the efficiency of data engineering tasks. One of the biggest hurdles is that ADF operates as a Platform as a Service (PaaS) system. I've found that when using the out-of-the-box connectors can lead to cryptic error messages that are difficult to decipher. This lack of clarity can be frustrating because it leaves you guessing about the underlying issues.
From my experience, I suspect that when we create ADF pipelines, we’re essentially generating a JSON specification. Every aspect of the pipeline, including dependencies, is represented in JSON. While I haven’t confirmed this with anyone from Microsoft, I have a hunch that there’s some internal API handling these JSON payloads, which then executes C# integration code behind the scenes. This abstraction means that when something goes wrong, you often lack visibility into the root cause, making it hard to address the issue effectively.
To mitigate these challenges, I recommend leveraging ADF as the executor while executing your own scripts or stored procedures as much as possible. This approach gives you more control over the process, allowing you to implement your own testing logic and provide better visibility into what’s happening. If something fails, you can quickly dive in and resolve the problem rather than feeling stuck with an unhelpful error message. Overall, this strategy helps maintain a more manageable and transparent workflow when working with ADF.
When optimizing data workflows for financial analysis and decision-making, I find there are two key aspects to consider. The first is ensuring the infrastructure operates quickly and reliably, which is crucial across all domains. To achieve this, we need to avoid creating numerous small files that can slow down processes, especially when working with Spark jobs. Instead, we should focus on strategies like parallelization. For example, in Azure Data Factory (ADF), using a batch size of 10 or 20 in activities can enhance processing speed and efficiency.
Interestingly, I’ve noticed that many organizations don’t deal with massive datasets. For instance, in a previous role at a national insurance broker, the entire data volume was less than 300 GB. In such cases, thinking about big data tools that require parallelization and multiple clusters might be an overkill. It’s essential to evaluate whether the infrastructure being provisioned aligns with the actual data size. There’s no point in paying for extensive resources if the data can be managed efficiently with simpler tools.
The second aspect focuses on how we present data in a way that supports financial analysis. This requires a deep understanding of the organization’s business model and how various processes impact financial statements. For instance, if we consider a hypothetical scenario where a railway company is privately owned and produces rail cars, we need to align data with their cost components and production processes. The goal is to connect the data to financial outcomes, enabling stakeholders to understand the drivers behind the numbers.
Ultimately, financial analysis isn’t just about maintaining records; it’s about deriving insights that help track where money is going. By organizing data effectively, we can foster better decision-making and ensure that financial operations are transparent and manageable. This approach not only enhances reporting but also leads to a deeper understanding of financial performance across the organization.
The current data engineering stack faces significant challenges, particularly in linking data processes to financial outcomes. Many tools lack integration that directly ties data engineering efforts to profitability, highlighting a gap in metadata usage. Additionally, emerging technologies like fabric are still underdeveloped, and there's a broader issue of adoption and mindset, especially among organizations that may not require extensive big data solutions.
To address these challenges, key skills are essential for data engineers. Proficiency in SQL and Python, along with financial and accounting awareness, is crucial for effective data modeling and organization. Engineers must optimize table structures to ensure efficient data processing while maintaining alignment with business goals.
I believe the future of analytics and data engineering is increasingly focused on semantic and logical models. Although I haven't implemented these concepts yet, I find them intriguing, especially the discussions around ontologies. They seem to enhance metadata by better organizing and connecting data in ways that align with what businesses care about.
I'm excited about the potential of semantic models and view them as a promising evolution in data modelling. I’m eager to learn more about these concepts, as I believe they will significantly improve how data is categorized and utilized in business contexts.
As I advance my skill set as a data engineer, I see the importance of understanding how my work impacts key performance indicators (KPIs) like ROI and ARR. I realize that being able to connect my efforts to financial metrics relies on my knowledge of finance and accounting. Without this understanding, I risk creating a disconnect with the business, leading to frustration when teams struggle to see the value of what I do.
This conversation has reinforced my belief that I need to be proactive in acquiring financial literacy. By bridging the gap between data engineering and finance, I can enhance my contributions to the organization and ensure my work aligns with business objectives, clearly demonstrating its value.
I’m not sure if data catalogs have genuinely simplified the lives of data engineers regarding resolving find-up requests, as I haven't implemented many myself. While tools like Purview seem cool for exposing and discovering data, I wonder if they might be addressing the wrong issue. Simply surfacing data may not solve the underlying problems.
The real challenge might lie in the financial linking of data and ensuring consistency in metrics. For instance, if we have multiple ways of calculating orders that don’t align with each other, what progress have we made? I don't have all the answers, but I find it an interesting question worth exploring.
Modulity and configurability are essential to future-proof data infrastructure for handling complex or PII data sets. By structuring systems to allow for safe additions of new code, we can better document configurations and manage complexity. Although I haven't implemented specific solutions, I believe methods like data masking have limitations since masked data can still be vulnerable if accessed improperly.
Exploring encryption and ensuring that decryption keys are only accessible to authorized users could enhance security. It's crucial to assume that failures might occur, so in the worst-case scenario, we should ensure that data leaks won't compromise sensitive information. Clarity is key: if any data is leaked, it reflects poorly on how it was stored and managed.
I have heard about data products and appreciate their potential, particularly regarding branding. If they are indeed products, they imply a return on investment (ROI) and can provide value to users. Drawing an analogy from manufacturing principles, we should consider whether end users would be willing to pay for these products based on their usefulness and the ROI they offer.
In my view, data products can take various forms, such as dashboards or organized datasets. For instance, a data product for a manufacturing plant could be a detailed table that tracks workflow timings and aligns with the plant manager's goals. By surfacing key metrics and anomalies, these products can help management make informed decisions that benefit the organization. Ultimately, a successful data product should clearly map to financial and business processes, providing valuable insights for its users.
To stay updated on the latest developments in analytics tools and data products, I often lurk on LinkedIn. It's an amazing resource where I follow influential people, like Zach Wilson, to discover more professionals in the field.
As I connect with more individuals, LinkedIn's algorithm shows me relevant content, allowing me to read their Substacks and other materials. This approach helps me keep a pulse on emerging trends and insights in the analytics community.
In my free time, I enjoy studying languages and have been using copywork notebooks to practice Japanese, Chinese, and Thai. It's fascinating to explore Thai as it’s influenced by Indic languages, where consonants have vowels placed around them, reminiscent of the Devanagari script. This comparison enriches my understanding of language structures and their cultural influences.
Additionally, I’m working on my personal brand and preparing for an upcoming launch. While I can't reveal all the details right now, I encourage everyone to stay tuned for more information. It’s an exciting journey that I'm eager to share!
Get notified when we’re LIVE⏺️ with Season 1, Episode 3. Subscribe!
Connect with Ryan on LinkedIn! 🤝🏻