Introduction Databricks is emerging as one of the main players in the MLOps and DevOps world. In the last month, I experienced this as part of my project at Areto and decided to write a series of hands-on blog posts on how to implement a MLOps Pipeline with Databricks. This is an ongoing learning process for me as well. I write each blog post after I experiment with different features and tools. So, I like to see this blog series as collective learning. Any comments and suggestions are welcome. In this blog post, we'll explore how to build a pipeline using the Databricks DBX tool and GitLab. This guide is designed for data engineers and scientists who are looking to streamline their data processing workflows. We'll cover key concepts, practical applications, and step-by-step instructions to help you integrate these tools effectively into your data projects. Continuous Integration (CI) CI/CD (Continuous Integration/Continuous Delivery): Continuous Integration and Continuous Delivery/Deployment (CI/CD) have become fundamental in modern software development practices. At its core, CI/CD is a method that emphasizes short, frequent updates to software through automated pipelines. While traditionally associated with software development, its principles are increasingly being adopted in data engineering and data science. The essence of CI/CD lies in its ability to automate the stages of software development, particularly in building, testing, and deploying code. Continuous Integration begins with regular code commits to a shared repository branch, ensuring collaborative development without version conflicts. Each commit undergoes an automated build and test process, validating changes to enhance the quality and reliability of the final product. Databricks: Databricks has emerged as a premier cloud-based platform, uniting the realms of data engineering, machine learning, and analytics. Databricks excels in handling large-scale data processing and complex analytical tasks, but to leverage its full potential, teams need to navigate its multifaceted environment effectively. Integrating CI/CD processes within Databricks environments can streamline workflows, ensuring consistent, reproducible, and scalable data operations. This integration is crucial for teams aiming to maintain agility and efficiency in their data projects, enabling them to deliver reliable, high-quality data solutions consistently. Databricks Workflows: Databricks workflows are particularly significant in the context of CI/CD, as they offer a robust platform for automating and orchestrating data pipelines crucial to continuous integration and deployment. These workflows allow for the seamless scheduling and execution of tasks, which is essential for maintaining the frequent, automated update cycles characteristic of CI/CD. In a CI/CD pipeline, Databricks workflows can be used to automatically process and test large datasets, ensuring that data transformations and analyses are consistently accurate and up-to-date. The diversity of tasks that can be executed within Databricks workflows, such as Notebooks for exploratory data analysis, Python scripts for structured data processing, and Python wheels for custom package deployment, aligns well with the varied needs of CI/CD pipelines. By integrating these tasks into CI/CD workflows, data teams can ensure that every aspect of their data processing and analysis is continuously tested and integrated into the larger data strategy. This integration is key for developing resilient, scalable, and efficient data operations, enabling teams to deliver high-quality, reliable data products rapidly. Databricks CLI Extension (DBX): The Databricks CLI Extension, or DBX, is a pivotal tool in integrating Databricks with CI/CD pipelines, enhancing the automation and management of data workflows. The ability to programmatically control and manipulate data processes is crucial to implementing a CI/CD workflow, and DBX fills this role effectively. It provides a command-line interface for interacting with various Databricks components, such as workspaces, workflows, and clusters, facilitating seamless integration into automated pipelines. Hands-on In our hands-on example, we will create a (very) minimal project that uses dbx to deploy and run a Databricks workflow to manipulate, analyze, and test some data. This practical exercise showcases how Databricks can be used for real-world data analysis tasks. Our CI pipeline will play a crucial role in this process, as it will automate the deployment of our code as a Databricks workflow and validate its output. This validation step is vital to ensure the accuracy and reliability of our analysis. By the end of this exercise, you'll have a clear understanding of how to set up and run a Databricks workflow within a CI pipeline and how such a setup can be beneficial in analyzing and deriving insights from large datasets. This example will not only demonstrate the technical application of Databricks and CI principles but also offer a glimpse into the practical benefits of automated data analysis in a business context. Development pattern using Databricks GUI There are various development patterns to implement our CI pipeline. The common development workflows with dbx are to develop, test, and debug your code on your local environments and then use dbx to batch-run the local code on a target cluster. By integrating a remote repository with Databricks, we use a CI/CD platform such as , , or to automate running our remote repo’s code on our clusters. GitHub Actions Azure DevOps GitLab In this tutorial, for the sake of simplicity, we use Databricks GUI to develop and test our code. We follow these steps: Create a remote repository and clone it into our Databricks workspace. We use here. Gitlab Develop the program logic and test it inside the Databricks GUI. This includes Python scripts to build a Python Wheel package, scripts to test the data quality using Pytest, and a notebook to run the Pytest. Push the code to GitLab. The git push will trigger a Gitlab Runner to build, deploy and launch our workflows on Databricks using dbx Git integration As the first step, we . Next, we create a remote repository and . To allow our Gitlab runner to communicate with Databricks API through dbx, we should add two environment variables, and to our CI/CD pipeline configurations. configure Git credentials & connect a remote repo to Databricks clone it to our Databricks repo DATABRICKS_HOST DATABRICKS_TOKEN to generate a Databricks token, in your Databricks, go to User Settings → Developer → Access tokens → manage → Generate new token. The Databricks host is the URL when you login into your Databricks workspace. It would be something like . The last part of the URL is your workspace id, and you should ignore that. https://dbc-dc87ke16-h5h2.cloud.databricks.com/ Finally we add the token and host to our Gitlab CI/CD setting. For that open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables The project skeleton the project is structured into several folders and key files, each serving a specific purpose: .dbx folder: : defining the configuration of your DBX project. It contains environment settings, dependencies, and other project-specific parameters. project.json conf folder: : outlining the deployment configurations and environment settings. It defines workflows and their respective parameters and cluster configurations. deployment.yml : our wheel package. This folder includes: my_package folder A subfolder containing the main ETL task script . The ETL task load our dataset and create two new tables. tasks sample_etl_job.py The file. it includes common utilities that provide access to components such as common.py SparkSession Contains two Jupyter notebooks: notebooks folder: : plot the distribution of different features in our dataset explorative_analysis : used to execute for unit testing run_unit_test pytest This folder is dedicated to testing: tests folder: : includes pytest fixtures conftest.py : contains unit tests to validate the data structure in our tables. test_data.py In addition to these folders, there are two important files in the root directory: : the configuration file for GitLab's Continuous Integration (CI) service, defining the instructions and commands the CI pipeline should execute. .gitlab-ci.yml : for building my Python wheel package. It defines the package's metadata, dependencies, and build instructions. setup.py dbx-tutorial/ ├─ .dbx/ │ ├─ project.json ├─ conf/ │ ├─ deployment.yml ├─ my_package/ │ ├─ tasks/ │ │ ├─ __init__.py │ │ ├─ sample_etl_job.py │ ├─ __init__.py │ ├─ common.py ├─ tests/ │ ├─ conftest.py │ ├─ test_data.py ├─ notebooks/ │ ├─ explorative_analysis │ ├─ run_unit_test ├─ .gitlab-ci.yml ├─ setup.py Deployment configuration file is crucial for defining your DBX project's configuration. This file could be generated automatically when you run . Check the dbx website for more details about the . project.json dbx --init Profile file reference The “ ” option is used for local development. If you run the command inside the CI tool, you need to specify the and environment variables, and they will the profile variable. Since we don’t want to use the local environment and develop and run our code on Databricks UI, we don’t need to specify this option here. Instead, the dbx will pick them up from the CI/CD setting that we set earlier. profile dbx databricks host databricks token overwrite Your files that dbx will automatically upload to the system will be stored in They will be stored in . You can read more about this in ml experiment. artifact_location dbx documentation { "environments": { "default": { "profile": "dbx-tutorial", "storage_type": "mlflow", "properties": { "workspace_directory": "/Shared/dbx/dbx-tutorial", "artifact_location": "dbfs:/Shared/dbx/projects/dbx-tutorial" } } }, "inplace_jinja_support": false, "failsafe_cluster_reuse_with_assets": false, "context_based_upload_for_execute": false } outlines the deployment configurations and environment settings. Here, we use and format to define the workflows. Each workflow defines an object inside the Databricks workspace: or deployment.yml 2.1 API wheel_task job delta live tables Pipeline. custom: existing_cluster_id: &existing_cluster_id existing_cluster_id: "1064-xxxxxx-xxxxxxxxx" environments: default: #this should be same envoriment name as you defined in the project.json workflows: - name: "etl_job" tasks: - task_key: "main" <<: *existing_cluster_id python_wheel_task: package_name: "my_package" entry_point: "etl_job" # take a look at the setup.py entry_points section for details on how to define an entrypoint - task_key: "eda" <<: *existing_cluster_id notebook_task: notebook_path: "/Repos/<your DB username>/dbx-tutorial/notebooks/explorative_analysis" source: WORKSPACE depends_on: - task_key: "main" - name: "test_job" tasks: - task_key: "main" <<: *existing_cluster_id notebook_task: notebook_path: "/Repos/<your DB username>/dbx-tutorial/notebooks/run_unit_test" source: WORKSPACE libraries: - pypi: package: pytest Here, we define two workflows: : consists of two tasks/jobs. "etl_job" The first task ( ) is of type . It used to run ETL task using the entry point that we define in the file. main python_wheel_task setup.py The second task ( ) is of type . It is used to run the notebook after the main tasks run successfully. Notice the use of property. etl notebook_task explorative_analysis depends_on : consists of only one job of type . It is used to run the notebook that is "test_job" notebook_task responsible for running the pytest. : in the above example, we run the workflow on an existing cluster. You can also create and run a new cluster for every deployment and launch of your work. For that, you should change the code as follows: Note custom: basic-cluster-props: &basic-cluster-props spark_version: "11.3.x-cpu-ml-scala2.12" basic-static-cluster: &basic-static-cluster new_cluster: <<: *basic-cluster-props num_workers: 1 node_type_id: "Standard_E8_v3" environments: default: workflows: - name: "et_job" tasks: - task_key: "main" <<: *basic-static-cluster .... CI pipeline configuration Your GitLab CI pipeline is structured to automate the testing and deployment processes of your Databricks project. It consists of two main stages: and . In the stage, the runs the unit tests and deploys a separate workflow for testing. The stage, activated upon successful completion of the test stage, handles the deployment of your main ETL workflow. In general, the pipeline follows these steps: test deployment test unit-test-job deploy Build the project Push the build artifacts to the Databricks workspace Install the wheel package on your cluster Create the jobs on Databricks Workflows Run the jobs image: python:3.9 stages: # List of stages for jobs, and their order of execution - test - deploy unit-test-job: # This job runs in the test stage. stage: test # It only starts when the job in the build stage completes successfully. script: - echo "Running unit tests... This will take about 60 seconds." - echo "Code coverage is 90%" - pip install -e ".[local]" - dbx deploy --deployment-file conf/deployment.yml test_job --assets-only - dbx launch test_job --from-assets --trace deploy-job: # This job runs in the deploy stage. stage: deploy # It only runs when *both* jobs in the test stage complete successfully. script: - echo "Deploying application..." - echo "Install dependencies" - pip install -e ".[local]" - echo "Deploying Job" - dbx deploy --deployment-file conf/deployment.yml etl_job - dbx launch etl_job --trace - echo "Application successfully deployed." - echo "remove all workflows." For workflow deployment and lunch, we use two commands: : to deploy the workflow definitions to the given environment. dbx deploy : to launch the given workload by its name in a given environment. dbx lunch In the we use the flag to avoid creating a job definition on our Databricks workflows during deployment. For the lunch command, we should use the flag. If this flag is provided, the launch command will search for the latest deployment that was assets-only. Lunching this, we can not see anything in the Workflows UI, but it will be a non-titled job running by submit API. You can see this on the Job Runs page **** of your Databricks workspace. Check out the dbx documentation to read about and unit-test-job --assets-only --from-assets assest-based workflows other commands If you check the “Experiments” page of your Databricks workspace, you will find the build artifacts of each run of your CI pipeline. Databricks Asset Bundles currently, Databricks recommends using Asset Bundles for CI/CD and offers a migration guide from dbx to bundles. If you understand the concepts in this post, it would be easy to switch to bundles. In the next post, I am going to explain how we can convert this project to an asset bundle project. Resources for this tutorial. Make sure you update the file with the correct and Repository conf/deployment.yml cluster_id DB username dbx documentation Databricks Best Practices: Development loop and CI/CD on Databricks with dbx Databricks Jobs API CI/CD techniques with Git and Databricks Repos What is CI/CD on Azure Databricks?