+

Data Feast Weekly

Get weekly insights on modern data delivered to your inbox, straight from our hand-picked curations!

The Analytics Engineering Flywheel, Shifting Left, & More With Madison Schott | S1:E1
The Analytics Engineering Flywheel, Shifting Left, & More With Madison Schott | S1:E1

The Analytics Engineering Flywheel, Shifting Left, & More With Madison Schott | S1:E1

5 min
|
Everything from Data Modeling and Governance to Industry Trends and Best Practices
Sep 19, 2024
Data Strategy
,
  and

Originally published on

Modern Data 101 Newsletter

,

the following is a revised edition.

We’re thrilled to launch Analytics Heroes, our first talk show and an ardent initiative to get closer to All Things Analytics. Here, you’ll find top Analytics Engineers sharing their best tips and hacks, their most painful challenges, how they overcame them, and how they made their way to roles and functions they love!
Watch the full episode

We added a penned version below for those who prefer the written word, made easy for you to skim and record top insights! 📝

Additional note from community moderators: We’re presenting the insights as-is and do not promote any specific tool, platform, or brand. This is to simply share raw experiences and opinions from actual voices in the analytics space to further discussions.

Introducing Madison Schott | Our First Analytics Hero 🦸🏻‍♂️

I'm currently an analytics engineer at Convertkit, which is a creator platform where you can sell products, create newsletters, and really just do a lot to educate and reach out to your audience. We are a data team of five, so pretty small.

Many of my projects revolve around reverse ETL and ensuring that I'm sending high-quality data to HubSpot and Intercom because many of our sales and communication processes revolve around those two platforms.

And then, I work very closely with DBT. Right now, I'm working on creating a new revenue model. So, there has been a lot of research into Stripe data, which is quite complex and definitely a beast to work with. So I'm kind of at those initial stages of just learning about all the intricacies of Stripe with Convertkit and then coming up with a plan to model the data.

We’ve covered a RANGE of topics with Madison. Dive in! 🤿

Data Modeling Solutions 🧩


Motivations

It's always fun to model data because it's like a little puzzle. It's like a fun game that analytics engineers get to do….You can think of modelling as a puzzle instead of individual pieces.

You can see the full picture with a data model, and you don't have to spend time putting those small pieces together. The analysts can just look at the data model as the full picture and get what they need without spending all the time putting the puzzle pieces together.

Battling Legacy Models

In my first role as an analytics engineer, we had a bunch of data models written within Domo by a consulting company that was improperly named. They were referencing each other in strange orders using awful SQL practices, which was pretty much as bad as you can imagine.

I think I spent a lot of time trying to understand what those models were doing, whereas I should have just started from scratch and not even referenced those models.

I know that's easier said than done because many stakeholders are asking questions based on these models. But in the end, it (starting from scratch) creates a more robust data model because you don't know if that logic in the model still applies or if those data sources are still used.

So, by starting from scratch, you can kind of begin that whole new flywheel process* (referenced later). So talking with the stakeholders, seeing what they need, what they use, that's like another thing. Just because there's a model for it doesn't mean anyone’s using what's in it. Legacy models often don't have documentation, which makes it harder.

So by starting from scratch and understanding all of your data sources, all of the strange edge cases and exactly what you're trying to build, it just ends up being a lot more reliable and less of a headache because you're not trying to understand code that's been poorly written.

Semantic Modeling

Semantic engineering can be super helpful, especially in creating that single source of truth. DBT has a semantic layer that allows you to create metrics based on different dimensions and allows them to be queried by analysts easily.

I think this is always going to be a huge problem that metric calculations get caught up in the BI layer.

So DBT is just adding another tool to the DBT toolkit that allows you to shift things left. And yeah, it can be super powerful.

I think you do need to make sure you have those foundational data models in place. So before you can really take advantage of this and do things the right way.

So, just like everything else, it's like a series of steps that you need to make sure you have the foundation, have the high-quality data and data models and the testing before you can implement it.

Shifting Left

Shifting left in the context of data mainly refers to taking all of the insights and metrics out of your BI tool or your end systems like Hubspot intercom and moving them as far upstream as possible.

If you're using a tool like DBT, instead of calculating metrics within your BI tool, you would calculate it within DBT instead. This allows you to create a single source of truth and then be able to use that one metric calculation in a bunch of different tools.

I think a lot of data teams run into the problem of, you know, creating metrics within their BI tool, creating metrics within something like HubSpot, or maybe another growth or marketing tool like Google Analytics. And then, you have three different answers for the same question.

It just makes stakeholders lose trust in the data because they don't understand why there are three different answers to the same question. They don't know what to depend on. So, by shifting all of this logic left, you can really create one central source of truth.

The Analytics Engineering Flywheel 🎡


The idea of a flywheel is something that the CEO and founder of Convertkit, Nathan Barry, talks a lot about, and something that we've really tried to implement a lot at Convertkit.

If your input directly affects your output, the idea of a flywheel is to put in the same amount of input but then continually 10X your output. So you're essentially doing less and less work with more and more results.

So, with analytics engineering specifically, I think of this a lot as taking the time to really understand the problem before you jump into it and looking at like all potential solutions, all potential problems that could go wrong, and then really taking the time to document everything.

I think documentation is something that many people don't want to spend the time to do because it is very time consuming. You have to write a lot of investigative queries. You have to ask many people questions who have more historical knowledge, especially if you're newer out of the company.

But once you document everything, it allows you to move a lot faster because you document a data source once, and then you can continually build on that and use it rather than trying to understand it every time you want to use it.

First, understand your problem, then document, test, do all that kind of investigative work, and then write your data models. Writing data models, in general, will help speed up any kind of analytics workflow because you're writing reusable data sets, and you're giving analysts a way to query this transformed data so they don't have to write it every time.

And then the last step is something like a peer review, where your team can review your data modelling logic, and then you can continually iterate.

So, after the review, you can go back to analysing if this model is actually solving the problem that you initially scoped out and then reiterate it. This process just continually moves faster and faster because you're building up that foundation within your data warehouse with lots of documentation that you can reuse in the future.

I think it is definitely appreciated, especially cuz data teams have so many requests flying at them. So, to be able to answer more and more requests with the core things that you've built is a really powerful thing.
Related Reads 📝
Continuous Feedback loops for data: [1], [2], [3]

Experience of Data Engineering Giants:
Snowflake & DBT 🐘


POV on DBT

I think with DBT, it's really easy to just use the tool without using it correctly. I think that the thing that I see a lot of people doing wrong is not taking the time to understand its benefits and how to properly set it up before diving into it.

And that's something I've realised with basically every role that I've had. You really need to take the time to understand your data and the foundational aspects and then build on top of that using the best practices.

Or you can just get yourself into a worse situation than you started with building up all this tech debt and then having to fix these issues on top of building for the future.

⚠️ I think DBT exceeds any other transformation tool out there. But the one con I would say is, I guess, the lack of governance...DBT doesn't always enforce certain best practices. So, it’s easy to use the tool in a way that it's not necessarily meant for.

So I think that you just have to be conscious of setting it up the right way. But DBT has a lot of great resources and I've written so many resources on how to use the tool correctly.

✅ The main benefit of DBT is making everything modular, and I find that a lot of people don't use it in that way. I love the DBT has built in documentation. I think that's one of the most valuable things to have your documentation right beside your code and just to make it so easy to find.

DBT has so many different testing options. They have tons of packages that have been built on top of DBT for data quality that you can use. And it's an open-source tool, so anyone can access DBT, which is a huge plus.

And there's just so much that you can do with it, just the way it automates everything, like with macros, with seeds, you have all these abilities to do everything within the tool that you don't get with any other data transformation tool.

If you're using a tool like DBT, instead of calculating metrics within your BI tool, you would calculate it within DBT instead. This allows you to create a single source of truth and then be able to use that one metric calculation in a bunch of different tools.

POV on Snowflake

Snowflake has really become an all-in-one data platform. Like, I swear, every few months, they're adding another major feature, which I don't use Snowflake as much anymore. I miss it. I wish I still used it compared to Redshift.

Now they have Snow Park where you can use Python and Java and not just SQL. They have the ability to create LLMS.

Cost Management

I think the biggest thing for optimization is just like managing costs. I've used Snowflake for my personal development projects, and I'll never forget one time I left a warehouse running, and I think I had a bill for $2000 the next day, which is not fun.

But it also makes you appreciate these cost hacks and how you can save money. And it just, when you do something like that, it makes you way more aware of how much you're spending on a tool like Snowflake.

There are a lot of modern data stack tools that have been built on top of Snowflake specifically for cost management, which I think are really cool tools that you can take advantage of. I definitely recommend looking at those.

And then I think just the basics of creating a bunch of different warehouses depending on what you're doing and using those to track the costs that you're running up.

If you use just one or two warehouses within Snowflake, it's really hard to pinpoint what is bringing those costs up. So, by creating different warehouses for your transformation, for your analysts, and your ingestion tools, it's easy to see what's spending the most. And then specifically targeting that, always remember not to keep it running!

Assessing Fitness for Tools/Platforms


We recently were looking at different data ingestion tools, specifically for Stripe data and revenue reporting, as I mentioned earlier.

1️⃣ Cost

Cost is always #1

More enterprise tools tend to be better. They're more reliable. They have lots of support behind them and ways to like quality checks, but they are very expensive.

One of those was Fivetran. In an ideal world, we would have gone with fivetran because it's just a very reliable tool and I've always had a good experience, but it is very expensive.

So that's like one thing: I try to always look for tools within the company's budget 1st.

2️⃣ Usecase Match

And then the second thing is, do they have what you need available? Do they solve the use case? We were looking for a connector to ingest Stripe data. So, if a connector tool doesn't ingest Stripe data, there's no point in using it.

3️⃣ Best Practices

Then, I always look at whether the company is following best practices.

So, with data ingestion, how are they handling schema propagation? How are they handling normalisation?

These are all things that I talked about in my new series when exploring data ingestion tools.

Related Reads 📝
Here’s one from Madison’s Data Ingestion series, check it out here.

4️⃣ Ease of Use/ Going Hands-On Before Buy

Lastly, a tool can look good on paper, and you can check all the boxes, but I think you just need to try it out for yourself.

I would never commit to a tool without actually trying it.

So one of the tools, for example, we were looking at, we thought it was going to be really good and work well for us. And then when it came to actually using it, it just wasn't ingesting the data at the speed that we needed. And there was just a lot of features that were falling short and we felt like the data wasn't reliable. In the end we decided not to go with it, but I think trialling these tools is honestly always key because you're going to get the best feel for it.

And I always recommend comparing it to doing proper data validation. You don't want to just assume everything is correct, even if the tool is saying the ingestion syncs are working correctly.

Try to compare multiple tools at once if you can and see what the data looks like in one versus the other versus what's already in your warehouse. </aside>

Addressing the Trends 🔥


The community has been peaking with conversations on Governance, AI, Data Products, and Catalogs. Here’s Madison’s take on them.

Can Catalogs make a difference to the day-to-days?

I'm a big fan of data catalogues, but I think it is hard to get stakeholders to actually use them.

If you're going to answer their question in Slack, for example, why would they take the time to look through a data catalog?

So, while data catalogs are powerful, I think it's more of a cultural shift that you have to focus on to actually get stakeholders to use them and look for the answers themselves within these tools.

And that requires a lot of education also on how to use them and making sure you're setting them up in a way that stakeholders finds valuable.

How about Data Products? 📦

I think data products are used more so in a business sense like you will look at a BI tool as a business product or a data product. But yeah, I haven't spent too much time looking into it because I feel like it's another one of those buzzwords that encompasses almost everything. It's like people giving a name to something that's existed for like 20 years, right.

I look at this data product as the very end stage, almost something that an analyst creates. So I think with data engineers, it's more so setting the foundation and producing high-quality data and data models along with documentation that then can eventually be used downstream.

For example, we're using Quicksite. Our data analysts will create dashboards and reports based on the data models that we create using Quicksite, and then stakeholders typically reference that.

POV on AI integration 🤖 with DE jobs

So, I haven't really used AI. I think AI is such a buzz topic right now, and I do think it's powerful, but I think many companies and teams aren't really at the place where they should be using it.

I know for us personally, like we're just trying to build the foundational data models and still understanding and documenting all of our data and that should be done before using any of these other tools.

And I think a lot of companies kind of dive in before fully understanding the quality of their data. And if you're not properly cleaning it and you're using AI on data that's messy and unreliable; it's just going to make the issue worse because you can't debug it in a way that you can if you're the one that has full control over it.

If anything, I think you can maybe utilize AI to kind of like help speed up your workflow. Still, I don't think it should necessarily replace analytics engineers or data engineers. Instead, it makes more sense to like simply using chatGPT to look up specific queries or to optimize your queries or looking up different concepts that you might need to have a better grasp on.

Governance in the Modern Data Stack ♻️

I think MDS has definitely helped (in governance) as long as you do things right.

I think it's a lot easier to just kind of put something in place, which is good and bad because, you know, sometimes maybe people that don't really understand data quality best practices are trying to take advantage of these modern data stack tools.

But as long as you have the right foundations, they can be really powerful, and it can be just as simple as before you're using a data warehouse, like make sure you write up documentation on how you're planning to use it, like the different databases, the different schemas, the users, the roles that you want to create. And this can really help make sure that you have good data governance.

I think a lot of times, data governance goes south when people aren't documenting things, and you're kind of just acting quickly and giving stakeholders access to data whenever they request it without really thinking out why you're giving it to them or documenting the fact that you're giving it to them.

And then you just end up with a bunch of different people who have access to your data that are then maybe using it to do their own calculations and aren't necessarily correct. I know that's definitely a problem we have where you've almost given too much access to the data, and then it empowers stakeholders to find the answers themselves, which is good to an extent.

But if they don't really know how to use SQL or understand all the nuances in the data, then it can again, like, create that distrust.

Whenever you are working with data, whoever you're asking the data about, it ultimately comes down to that trust factor. Yeah, that really that is very easy to lose too. I feel like once you lose it as a data team, it's really hard to gain it back. So you want to do everything to protect that as much as possible.

Guest Connect 🤝🏻

Check out Madison’s full profile on the Modern Data 101 Network.

More from Madison Schott
// Text truncation functionality const elements = document.querySelectorAll('[ms-code-truncate]'); elements.forEach((element) => { const charLimit = parseInt(element.getAttribute('ms-code-truncate')); // Helper function to recursively traverse the DOM and truncate text nodes const traverseNodes = (node, count) => { for (let child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { if (count + child.textContent.length > charLimit) { child.textContent = child.textContent.slice(0, charLimit - count) + '...'; return count + child.textContent.length; } count += child.textContent.length; } else if (child.nodeType === Node.ELEMENT_NODE) { count = traverseNodes(child, count); } } return count; } // Create a clone to work on without modifying the original element const clone = element.cloneNode(true); traverseNodes(clone, 0); // Replace the original element with the truncated version element.parentNode.replaceChild(clone, element); }); });