Executive Summary
The fast-paced, data-centric business environment necessitates agile, reliable, and effective data analytics solutions. This document provides a comprehensive overview advocating for the implementation of Data Build Tool (dbt™) as a strategic asset in your data stack, particularly when used in conjunction with AWS services like Redshift, Glue, and S3, and integrated into CI/CD pipelines through platforms like Github, Azure DevOps, or others.
dbt empowers data teams to create production-ready data pipelines with ease, while adhering to software engineering best practices. It serves as a catalyst in eliminating data silos, ensuring data integrity, fostering collaboration, and streamlining the deployment of data products. dbt’s compatibility with AWS services amplifies its utility. When used with a data warehouse like Amazon Redshift, dbt optimizes data transformations and analytics. In combination with AWS Glue, it provides a streamlined, end-to-end ETL process. Its flexibility with AWS S3 offers additional storage options, making it a versatile tool in a data architect’s toolkit. While the technology does present a learning curve and may have limitations for extremely large datasets, the benefits significantly outweigh the challenges. The document argues that adopting dbt is not just a technological decision but a business imperative for organizations striving to be data-driven in today’s competitive landscape. By integrating dbt into your data architecture, organizations can achieve faster, more reliable, and cost-effective data analytics, thereby accelerating their data capabilities.
Introduction
While the company dbt Labs, Inc. does have multiple offerings through their SaaS product, for all purposes within in this document we are referencing the open-source tooling dbt Core, a binary file that can be installed on user or machine systems(via software package management tools like pip or homebrew). dbt Core ships with a command-line interface (CLI) for running your dbt project. The dbt CLI is free to use and available as an open source project(https://github.com/dbt-labs/dbt-core).
Intended as a tool to help data engineers, data analysts, and data architects, this document advocates for the implementation of Data Build Tool (dbt™) alongside AWS services like Redshift and Glue. Implementing dbt can lead to the creation of agile, reliable, and effective data products. The modern business landscape requires organizations to move fast, make data-driven decisions, and be adaptive. Below is a framework that sets forth reasons why dbt should be a part of your data stack to ship trusted data products faster.
What is dbt™?
https://www.getdbt.com/product/what-is-dbt
dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Enabling anyone on the data team to safely contribute to production-grade data pipelines.
Analysts using dbt can transform their data by simply writing select statements, while dbt handles turning these statements into tables and views in a data warehouse. These select statements, or “models”, form a dbt project. Models frequently build on top of one another – dbt makes it easy to manage relationships between models, and visualize these relationships, as well as assure the quality of your transformations through testing.
Engineers and Analysts can deploy safely to dev environments testing out their assumptions and completing their assertions on the data. Git-enabled version control enables collaboration and a means to return to previous states, but also fosters an ecosystem in which data team members can easily get their code peer reviewed. Within the CI/CD pipeline teams are able to test every model prior to production, and share dynamically generated documentation with all data stakeholders. Teams can write modular data transformations in .sql or .py files – dbt handles the chore of dependency management that can easily be reused.
What dbt™ isn’t?
dbt is not a data warehouse or a database itself, but rather a tool that can be used in conjunction with a data warehouse to make it easier to work with and manage data. Additionally, dbt is not a programming language, but it does use a programming-like syntax to specify how data should be transformed and loaded into a data warehouse. It is also not a visualization tool, although it can be used in conjunction with visualization tools like Tableau or Looker to help users understand and analyze their data
Why dbt™?
Eliminate Data Silos
Now your data teams can build models that connect with those built by the analytics team, each using the language they prefer. dbt supports modeling in SQL or Python, enabling a shared workspace for everyone that works on analytic code.
Reliability and Repeatability
dbt’s “version control integration” means that all changes to data models are tracked. This is crucial in ensuring that past versions can be quickly restored, guaranteeing data integrity. With dbt data teams can build observability into transformation workflows with in-app scheduling, logging, and alerting. Protection policies on branches ensure data moves through governed processes including dev, stage, and prod environments generated by every CI run.
Self-documenting Code
In dbt, metadata about data models can be embedded directly into the code. For example, you can include annotations within SQL files, which dbt will use to auto-generate documentation. This visibility is beneficial for cross-functional teams to understand data models.
Collaboration
dbt’s integration with Git for version control fosters collaboration. Team members can work on parallel branches and easily merge changes, encouraging a more team-based approach.
Code Reviews
For data engineering projects, code reviews ensure that the data transformations adhere to predefined quality standards, making the data more trustworthy.
Extensive Community Ecosystem
Through dbt Hub(https://hub.getdbt.com/) teams building with dbt can leverage community packages(from contributors like dbt Labs, Fivetran, Ay to refine the raw data in your warehouse. These community-available packages provide SQL macros that can be (re)used across dbt projects, from useful macros for performing data audits, to base models for Redshift System tables, or sophisticated privacy transformations that reduce reidentification risk but leave data usable for analysts.
Limitations
While dbt offers powerful features, it has a learning curve and requires a cultural shift toward treating data as code. Furthermore, for very large datasets, dbt’s in-database processing may limit performance.
How dbt™ Fits with CI/CD
Continuous Integration/Continuous Deployment (CI/CD) is a best practice in modern software engineering that involves automatically building, testing, and deploying code changes to production environments. In the context of data analytics, Data Build Tool (dbt) is an equivalent for data transformation, testing, and documentation. When dbt is integrated into a CI/CD pipeline, it enables automated data testing, version control, and deployment, thereby offering a comprehensive solution for agile data analytics. Azure DevOps serves as a robust platform to implement this integration, providing a variety of tools for code repository management, build and release pipelines, and more. Here’s how dbt fits seamlessly with CI/CD, particularly when using Azure DevOps.
Automated Testing
Pre-Commit Hooks: Before a developer pushes code to the repository, pre-commit hooks can run dbt tests locally to ensure data models are correct.
Pipeline Steps: As part of the CI/CD pipeline in Azure DevOps, automated tests can be configured to run every time there’s a code change. This ensures that the data models and transformations are correct before they are deployed.
Quality Gates: dbt tests can act as quality gates in the pipeline. If a test fails, the pipeline can be configured to halt, preventing the deployment of faulty data models.
Version Control
dbt Project in Azure Repos: Your dbt project can be stored in Azure Repos, which allows it to benefit from version control, branching, and pull requests, just like any other codebase.
Branch Policies: Azure DevOps allows you to set up branch policies, ensuring that dbt models can only be merged after passing specified criteria, like code reviews and successful test runs.
Artifact Management: The built and tested dbt artifacts can be stored in Azure Artifacts, providing a historical record and facilitating rollbacks if necessary.
Deployment Automation
Parameterized Runs: Azure DevOps pipelines can be parameterized to deploy dbt models to various environments (e.g., dev, staging, prod) based on the branch being merged.
Automated Rollbacks: If a dbt model fails in a higher environment like staging or production, Azure DevOps pipelines can be configured to automatically rollback to the previous stable version.
Monitoring and Notifications: Azure DevOps provides monitoring tools and can be set to send notifications if the pipeline fails or if there are issues with the deployed dbt models.
Collaboration and Access Control
Role-Based Access: Azure DevOps supports role-based access control, allowing you to specify who can perform actions like triggering pipelines or merging dbt models.
Collaboration Features: Azure Boards can be used for task tracking, linking your data tasks directly to your dbt development efforts.
Documentation: dbt’s inherent documentation generation can be integrated into the CI/CD process, keeping all stakeholders informed about the current state of the data models.
dbt and CI/CD together form an effective methodology for agile data transformation and analytics. By leveraging Azure DevOps as the platform for implementing this integration, organizations can automate data quality checks, maintain version control, and streamline the deployment process, resulting in faster, more reliable data analytics pipelines.
Complementing AWS Services
Redshift: dbt and Redshift can optimize data transformations and analytics.
Amazon Redshift is a fully managed data warehouse service, optimized for running complex queries and performing data analytics tasks. One of its key features is the ability to create materialized views, which pre-compute and store query results for faster retrieval. dbt can take advantage of Redshift’s features like materialized views to optimize data transformations. By utilizing these features, dbt models can be designed to run much faster, which is especially useful when dealing with large data sets or complex transformations. This can lead to reduced compute costs and faster time-to-insight for data analytics. Redshift uses a PostgreSQL-compatible SQL syntax, which is also the primary language for defining transformations in dbt. This compatibility means that you can seamlessly move SQL code between dbt and Redshift, thereby speeding up the development process. dbt’s built-in version control capabilities enable team members to collaborate on a single, unified model that leverages Redshift’s power. This ensures that changes are transparent, trackable, and reversible, adding a layer of reliability to your data processes.
Glue: Glue handles data ingestion, while dbt™ can focus on the transformation layer.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for users to prepare and load data for analytics. AWS Glue can focus on data ingestion and schema management, preparing the raw data for subsequent transformations. Once the data is ingested into Redshift or another data warehouse, dbt can then take over to perform the transformations, tests, and data quality checks. This separation of concerns creates a modular data pipeline that is easier to manage, monitor, and scale. Glue’s Data Catalog feature acts as a centralized repository for metadata, allowing seamless transition into dbt. This improves discovery and enhances governance, as dbt can also document transformations, tests, and lineage. Glue and dbt complement each other by creating a more streamlined, end-to-end ETL process. Glue handles the initial steps, while dbt optimizes the data transformation layer, resulting in a more agile and efficient workflow.
S3: dbt™ can also read from and write to AWS S3 buckets, making it versatile in handling various data storage solutions.
Amazon S3 (Simple Storage Service) is a widely used object storage service, suitable for storing large volumes of unstructured data. dbt’s capability to read from and write to S3 buckets adds an extra layer of versatility. This means you’re not confined to using a data warehouse for storage; you can also use S3 for raw or transformed data, giving you more options in how you architect your data solutions. Many organizations use S3 as a data lake to store raw data. dbt’s compatibility with S3 allows it to directly read this raw data, transform it, and either load it into a data warehouse like Redshift for analytical workloads or write it back to S3 in a more structured format. S3 provides a cost-effective storage solution, especially for long-term data retention. By using dbt to write transformation outputs back to S3, you can optimize costs for storage while maintaining high data quality and accessibility.
Conclusion
In the fast-paced, data-centric business environment of today, the need for agile, reliable, and effective data analytics solutions cannot be overstated. The dbt emerges as a game-changer in this context, offering a robust framework that fosters collaboration, enhances data integrity, and expedites the deployment of data products. Its SQL-first transformation workflow, Git-enabled version control, and built-in CI/CD capabilities make it an indispensable tool for data professionals including data engineers, analysts, and architects.
dbt’s compatibility with AWS services further amplifies its efficacy. When paired with Amazon Redshift, dbt leverages advanced data warehousing capabilities to optimize transformations and analytics. Its synergy with AWS Glue simplifies the ETL process, creating a streamlined workflow from data ingestion to transformation. The ability to interact with AWS S3 offers additional storage flexibility, extending its utility across different storage paradigms.
Despite its learning curve and potential limitations for handling extremely large datasets, the advantages of implementing dbt, especially in conjunction with AWS services, far outweigh the challenges. With its focus on eliminating data silos, enforcing data reliability, enabling self-documenting code, and encouraging collaboration, dbt acts as a catalyst in the evolution of modern data stacks. When integrated into a CI/CD pipeline using platforms like Azure DevOps, it adds another layer of automation and governance, making your data operations not just agile but also resilient.
In summary, dbt should not be seen merely as an option but rather as a strategic asset for any organization serious about accelerating its data capabilities. Its power to transform data engineering practices, coupled with its seamless fit within the broader AWS ecosystem, makes it a compelling choice for modernizing your data stack. Adopting dbt is not just a technological decision; it’s a business imperative for organizations aiming to be data-driven in a competitive landscape.
Appendix
Use cases/ Articles
Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core
dbt has established itself as one of the most popular tools in the modern data stack, and is aiming to bring analytics engineering to everyone. The dbt tool makes it easy to develop and implement complex data processing pipelines, with mostly SQL, and it provides developers with a simple interface to create, test, document, evolve, and deploy their workflows. For more information, see docs.getdbt.com.
dbt primarily targets cloud data warehouses such as Amazon Redshift or Snowflake. Now, you can use dbt against AWS data lakes, thanks to the following two services:
AWS Glue Interactive Sessions, a serverless Apache Spark runtime environment managed by AWS Glue with on-demand access and a 1-minute billing minimum
AWS Lake Formation, a service that makes it easy to quickly set up a secure data lake
In this post, you’ll learn how to deploy a data pipeline in your modern data platform using the dbt-glue adapter built by the AWS Professional Services team in collaboration with dbtlabs.
With this new open-source, battle-tested dbt AWS Glue adapter, developers can now use dbt for their data lakes, paying for just the compute they need, with no need to shuffle data around. They still have access to everything that makes dbt great, including the local developer experience, documentation, tests, incremental data processing, Git integration, CI/CD, and more.
How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift
SafetyCulture runs an Amazon Redshift provisioned cluster to support unpredictable and predictable workloads. A source of unpredictable workloads is dbt Cloud, which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift. To balance the needs of unpredictable and predictable workloads, SafetyCulture used Amazon Redshift workload management (WLM) to flexibly manage workload priorities.
With plans for further growth in dbt Cloud workloads, SafetyCulture needed a solution that does the following:
Caters for unpredictable workloads in a cost-effective manner
Separates unpredictable workloads from predictable workloads to scale compute resources independently
Continues to allow models to be created and modified based on production data
Manage data transformations with dbt in Amazon Redshift
In this post, we demonstrate some features in dbt that help you manage data transformations in Amazon Redshift. We also provide the dbt CLI and Amazon Redshift workshop to get started using these features.
https://aws.amazon.com/blogs/big-data/manage-data-transformations-with-dbt-in-amazon-redshift/
Automating deployment of Amazon Redshift ETL jobs with AWS CodeBuild, AWS Batch, and dbt™
In this post, we show you how to automate the deployment of Amazon Redshift ETL jobs using AWS Batch and AWS CodeBuild. AWS Batch allows you to run your data transformation jobs without having to install and manage batch computing software or server clusters. CodeBuild is a fully managed continuous integration service that builds your data transformation project into a Docker image run in AWS Batch. This deployment automation can help you shorten the time to value. These two services are also fully managed and incur fees only when run, which optimizes costs.
We also introduce a third-party tool for the ETL jobs: dbt™, which enables data analysts and engineers to write data transformation queries in a modular manner without having to maintain the execution order manually. It compiles all code into raw SQL queries that run against your Amazon Redshift cluster to use existing computing resources. It also understands dependencies within your queries and runs them in the correct order. dbt™ code is a combination of SQL and Jinja (a templating language); therefore, you can express logic such as if statements, loops, filters, and macros in your queries. For more information, see dbt™ Documentation.
Accelerating Data Teams with dbt Cloud & Snowflake
Modern businesses need modern data strategies, built on platforms that support agility, growth and operational efficiency.
Snowflake is the Data Cloud, a future-proof solution that simplifies data pipelines, so you can focus on data and analytics instead of infrastructure management.
dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines. It transforms data in the warehouse, leveraging cloud data platforms like Snowflake.
In this Quickstart, you will follow a step-by-step guide to using dbt with Snowflake, and see some of the benefits this tandem brings. https://quickstarts.snowflake.com/guide/data_teams_with_dbt_cloud/#0
dbt (Data Build Tool) Overview: What is dbt and What Can It Do for My Data Pipeline?
There are many tools on the market to help your organization transform data and make it accessible for business users. One that we recommend and use often—dbt (data build tool) —focuses solely on making the process of transforming data simpler and faster. In this blog we will discuss what dbt is, how it can transform the way your organization curates its data for decision making, and how you can get started with using dbt (data build tool).
https://www.analytics8.com/blog/dbt-overview-what-is-dbt-and-what-can-it-do-for-my-data-pipeline/
Activating ownership with data contracts in dbt
Data mesh, data contracts and shifting ownership to data producers has been presented as the solution. Data teams have largely brought into this promise but only a fraction can confidently say that they’ve seen the expected impact from their efforts. One of the reasons for this is that contracts introduce yet another technical and organizational burden for already stretched engineers.
With dbt 1.5 and the support for data contracts, data teams have the opportunity to roll out contracts themselves. While it doesn’t solve the full problem it provides a way to enforce quality at the intersection of teams that can be rolled out in days instead of months.
https://medium.com/@mikldd/activating-ownership-with-data-contracts-in-dbt-4f2de41c4657
#blog