+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

A Data Engineer's POV on the Missing Links in the Data Stack and More | S1:E4
A Data Engineer's POV on the Missing Links in the Data Stack and More | S1:E4

A Data Engineer's POV on the Missing Links in the Data Stack and More | S1:E4

24 min
|
Data Pipelines to Data Products, Shift to Unified Platforms, Nuances of On-Prem to Cloud and a dozen other hot takes!
Nov 28, 2024
Data Operations
,
 and
,
  and

Originally published on

Modern Data 101 Newsletter

,

the following is a revised edition.

We added a summarised version below for those who prefer the written word, made easy for you to skim and record top insights! 📝

Additional note from community moderators: We’re presenting the insights as-is and do not promote any specific tool, platform, or brand. This is to simply share raw experiences and opinions from actual voices in the analytics space to further discussions.

Prefer watching over listening? Watch the Full Episode here ⚡️


Introducing Abhinav Singh | Our Analytics Hero at Your Service 🦸🏻‍♂️

Abhinav Singh is a recognised Data Engineer with a strong community presence of over 40K engineers and analysts following his lead. He has over five years of experience in end-to-end ETL processes, pipeline development, reporting, and analysis across Real estate, Telecom, and Insurance industries! Abhinav has consistently collaborated with Fortune 500 companies, helping them solve complex business problems and enhance their data infrastructure. We highly appreciate him joining the MD101 initiative and sharing his much-valued insights with us!

We’ve covered a RANGE of topics with Abhinav. Dive in! 🤿

TOC

  • Experience in the Data Engineering Space
  • Shift of Majority Enterprise Projects to Cloud
  • Migration Challenges from On-Prem to Cloud-Based
  • Governance Nuances/Resistances of Cloud
  • The Distinction in Data Engineering b/w Enterprises & Startups
  • Recommendations for Getting Started as a Data Engineer
  • The Three Non-Negotiable Skills for a Data Engineer
  • Day-to-Day Toolset of a Data Engineer
  • Proactiveness: Secret to Knowledge-Sharing in Large Teams
  • The Missing Links in Current Data Analytics Stacks
  • The Industry-Wide Shift to The Unified Data Platform and Other Visible Trends
  • Taking Care of PII Data
  • A Data Engineer’s POV of Data Products
  • A Data Engineer’s POV of Data-as-a-Product Approach
  • What Would a Data Engineer Consider a Good Data Product
  • How Data Engineers Use Data Products to Build More
  • The Real Abhinav Singh

Before diving in, sign up to get notified when Episode 5 goes LIVE! ⏺️


Experience in the Data Engineering Space

I bring over five years of expertise in data engineering and analytics, working for Fortune 500 companies across diverse domains such as telecom, real estate, and insurance. Currently, I am building a robust data infrastructure and reporting setup for a U.S.-based insurance client focused on casualty coverage for truck drivers. My technical toolkit includes cloud platforms, Spark, SQL, Python, and PySpark, which are integral to delivering scalable and efficient solutions.


Shift of Majority Enterprise Projects to Cloud

Cloud is undeniably the future of data engineering, and it’s essential for both freshers and experienced professionals to prioritize mastering it. The beauty of cloud platforms lies in their ability to handle administrative tasks like auto-scaling and infrastructure management, allowing engineers to focus on IT development—the core of impactful projects. With industries rapidly adopting cloud-based solutions, except for a few like finance and banking, it’s clear that most projects will soon migrate to the cloud. Embracing this shift is key to staying relevant in the field.


Migration Challenges from On-Prem to Cloud-Based

First of all, when working with on-prem data infrastructure, you must handle administrative tasks, manage logging properties, and oversee the entire infrastructure. In contrast, cloud infrastructure simplifies this significantly. You only need to focus on learning specific services or technologies relevant to your domain—for example, in data engineering on Azure, you’d work with tools like Azure Data Lake Storage (ADLS) for storage, Azure Data Factory for ETL/ELT, and Synapse Analytics for processing.

These services are easy to learn, provided you have strong fundamentals in programming and database concepts. Once you grasp one cloud platform, transitioning to others becomes straightforward, as their services are similar, differing mainly in infrastructure and UI.


Governance Nuances/Resistances of Cloud

Security is a prime factor when dealing with sensitive data, and data masking mechanisms play a crucial role in addressing this. Traditionally, industries like banking and finance were hesitant to move to the cloud due to security concerns. However, with major cloud providers obtaining certifications like HIPAA and CPAA, clients are becoming more aware and confident, making cloud adoption increasingly sensible.


The Distinction in Data Engineering b/w Enterprises & Startups

The approach to data engineering differs between large enterprises and startups, primarily in scope and learning opportunities. In large enterprises, like Fortune 500 companies, you’re often part of a big team, working on a specific aspect of a project for months, which might limit exposure to end-to-end architecture and development. In contrast, startups typically offer opportunities to build projects from scratch, providing a different and more holistic learning experience. Each has its pros and cons, and the choice depends on the work culture and learning path you want at that point in your career.


Recommendations for Getting Started as a Data Engineer

If I were starting my data engineering journey today, the first skill I'd focus on is SQL. It's the foundation for querying databases and understanding how data flows. From there, I'd prioritize learning Python, as it's versatile and widely used in the industry. Next, I'd explore big data technologies like Apache Hadoop or Spark to understand how to process large-scale data. Finally, I'd dive into cloud platforms like AWS, Azure, or GCP because cloud services are the future of data engineering, solving challenges like security and infrastructure management. While on-premise projects can help you grasp fundamentals, transitioning to cloud projects early is essential for staying ahead.


The Non-Negotiable Skills of a Data Engineer

To become a data engineer, the path starts with mastering SQL, progresses through learning Python, and then moves to big data technologies like Apache Spark before concluding with cloud services.

  1. SQL is foundational and non-negotiable for any data role. It enables you to interact with databases, which is essential when migrating or ingesting data from upstream systems to data lakes. SQL is like the "simple English" of databases.
  2. Python is the go-to programming language for building data pipelines, error handling, logging, and creating functions. It's widely used in data engineering projects due to its versatility and beginner-friendly nature.
  3. Apache Spark, a distributed engine for processing large datasets, is vital for big data projects. It’s one of the most popular open-source tools in the big data ecosystem.
  4. Cloud Services require picking a major cloud provider like Azure, AWS, or GCP and focusing on specific tools for storage, ETL/ELT processes, and Spark-related tasks. For example, in Azure, tools like ADLS, Data Factory, Databricks, and Azure DevOps are key components.

Each step builds on the previous one, creating a comprehensive skill set for tackling data engineering challenges effectively.


Day-to-Day Toolset of a Data Engineer

The core data engineering projects I've worked on have been significant learning experiences. My expertise lies primarily in building data pipelines on Azure Cloud, though I’ve also explored AWS through POCs. For ELT ingestion, I’ve used Azure Data Factory to bring data from multiple sources into a Data Lake, layered with a medallion architecture. My toolkit includes Databricks, Spark, and Azure DevOps for CI/CD processes. On the fundamentals side, I frequently use Python, have some experience with Scala, and work with PySpark and SQL. These technologies form the foundation of my work.


Proactiveness: Secret to Knowledge-Sharing in Large Teams

To grow and expand your knowledge effectively, you must become like a sponge—absorbing insights from peers, observing their approaches, and learning from their expertise. Engage with colleagues from diverse backgrounds, examine their code, documentation, error handling, and even how they communicate through blogging. Apply these observations in your own style. For example, I once worked with a team lead with 12 years of experience who taught me how to interact with clients, gather requirements, and pitch the right technologies. By interacting, observing, and adapting, you stay ahead in a rapidly evolving tech landscape.


The Missing Links in Current Data Analytics Stacks

To excel in data engineering, focus on three key areas:

  1. Documentation – While development is enjoyable, documentation is often overlooked. It’s essential to focus more on thorough documentation to make solutions scalable and optimized.
  2. Scalability and Optimization – Always design with scalability in mind. Solutions should be created with both a microscoping lens for details and a broad vision to anticipate future challenges.
  3. Data Modeling – Data engineers, even with 2-3 years of experience, should have a basic understanding of data modeling. Knowledge of different data designs, dimension tables, schemas, and their scalability is crucial, as you'll eventually need to design data models yourself.

It’s difficult to enter the data domain without mastering these areas, especially the technical documentation, which often feels cumbersome. As someone who was once a beginner, I try to simplify concepts to make them accessible to others, aiming to be the mentor I once needed.


The Industry-Wide Shift to The Unified Data Platform and Other Visible Trends

Data engineering is evolving towards data lakehouse architectures and unified platforms. The trend is moving away from using separate tools for different tasks (data ingestion, visualization, etc.) towards integrated solutions. Companies are now offering unified platforms, with products like Databricks, Snowflake, and Microsoft Fabric leading the charge. These platforms combine multiple capabilities into one, streamlining processes. Databricks, in particular, are a key technology to watch in the next four to five years, as it is expected to become a dominant tool in the field.


Taking Care of PII Data

To ensure data privacy, particularly in sectors like banking and finance, data masking is essential. For sensitive information such as credit card details, dates of birth, and addresses, tools like Databricks offer native functionalities to mask data in your code. It’s a must-do practice to handle sensitive data responsibly.


A Data Engineer’s POV of Data Products

A data product is a tool, service, or platform that utilizes a company's data to provide insights or streamline processes. For example, in my current scenario, a platform that handles reporting for different cases of a client could be considered a data product.


A Data Engineer’s POV of Data-as-a-Product Approach

To apply product thinking to data, the first step is to understand the business use case—what problem the business is trying to solve. For beginners, this is crucial. Once you grasp the business problem, you can determine the approach to solve it. This might involve coding or selecting the right technology. For example, if tasked with a project that has the purpose X, focus on how to reach X, break it down into steps, and choose the technology best suited to build a solution for that path.


What Would a Data Engineer Consider a Good Data Product

As a data engineer, it's essential to understand the business use cases. For instance, if the goal is to create a reporting platform, you need to know the KPIs that will be displayed. Once you understand the KPIs, you can design the data model to support them and build the reporting infrastructure. After that, develop a scalable data pipeline that supports the problem statement and ensures the reporting system can handle changing data over time. The pipeline should be flexible and adaptable to evolving data needs.


How Data Engineers Use Data Products to Build More Impactfully

It will provide analytics tailored to your specific company or data project. If built correctly and aligned with its intended purpose, it will offer valuable insights. For example, metrics like employee count or other KPIs can be used to develop further data products or solutions. The interconnectivity of data products is also crucial, as various teams, personas, and stakeholders are involved in decision-making. When these components communicate and work together, it significantly enhances the process.


Knowing The Real Abhinav Singh!

I’m super into fitness right now—it’s like my whole day and nutrition revolve around it. Honestly, it’s been my anchor over the past year. Besides that, I’m all about personal development—whether it’s health or finance, I love working on myself.

Speaking of good vibes, I had an amazing 10-day trip to Meghalaya recently. It’s such a stunning place—the hills, the waterfalls, just nature everywhere. It’s like the perfect recharge spot.

If I could pick a superpower, it’d definitely be mind-reading. I know it’s kind of cliché, but it’d make life so much easier—both personally and professionally.

And yeah, I really enjoyed this conversation and hope what I shared helps people looking to transition into data roles!


📝 Note from Editor
The above insights are summarised versions of Abhinav Singh’s actual dialogue. Feel free to refer to the transcript or play the audio/video to capture the true essence and details of his as-is insights. There’s also a lot more information and hidden bytes of wonder in the interview, listen in for a treat!

Thanks for reading Modern Data 101! Subscribe for free to receive new posts and support our work.


Guest Connect 🤝🏻

Connect with me on LinkedIn 🙌🏻

// Text truncation functionality const elements = document.querySelectorAll('[ms-code-truncate]'); elements.forEach((element) => { const charLimit = parseInt(element.getAttribute('ms-code-truncate')); // Helper function to recursively traverse the DOM and truncate text nodes const traverseNodes = (node, count) => { for (let child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { if (count + child.textContent.length > charLimit) { child.textContent = child.textContent.slice(0, charLimit - count) + '...'; return count + child.textContent.length; } count += child.textContent.length; } else if (child.nodeType === Node.ELEMENT_NODE) { count = traverseNodes(child, count); } } return count; } // Create a clone to work on without modifying the original element const clone = element.cloneNode(true); traverseNodes(clone, 0); // Replace the original element with the truncated version element.parentNode.replaceChild(clone, element); }); });