Databricks on AWS - An Architectural Perspective (part 1)

Alberto Jaen

AWS Cloud Engineer

Alfonso Jerez

AWS Cloud Engineer

Databricks has become a reference product in the field of Data Platforms providing a unified environment for engineering and analytical roles. Due to the fact that not all organizations have the same types of workloads, Databricks has designed different plans that allow organizations to adapt to different needs and this has a direct impact on the design of the platform architecture. 

This series of articles aim to address the integration of Databricks in AWS environments by analyzing the alternatives offered by the product with respect to the architectural design. Due to the length of the contents, it has been considered convenient to divide into two deliveries:

First delivery:

  • Workload plans and types
  • High level architecture
  • Networking
  • Identity and Access Management

 

Second delivery (coming soon):

  • Security
  • Scalability
  • Logging and monitoring
  • Deployment

Introduction

Databricks was created with the idea of being able to develop a unique environment in which different profiles such as Data Engineers, Data Scientists and Data Analysts could work collaboratively without the need for external service providers to offer the different functionalities that each of them needs on a daily basis. Databricks was born thanks to the collaboration of Spark founders, publishing DeltaLake and MLFlow as Databricks products following the open-source philosophy:

Spark, Delta Lake and MLFlow partnership

This new collaborative environment had a great impact on its presentation due to the novelties it offered by integrating different technologies:

  • Spark is a distributed programming framework that presents as one of its functionalities the ability to perform queries on Delta Lakes at cost/time ratios higher than those of the competition, thus optimizing the analysis processes.
  • Delta Lake is presented as a storage alternative for Spark. It combines the main advantages of Data WareHouse and Data Lakes by enabling the loading of structured and unstructured information through an enhanced version of parquet that allows supporting ACID transactions, thus ensuring the integrity of the information in the ETL processes carried out by Spark.
  • MLFlow is a platform for Machine Learning lifecycle management including: experimentation, reusability, deployment and centralized model registration.

Glossary

  • All Purpose Compute: designed for collaborative environments where the cluster is used simultaneously by Data Engineers and Data Scientists.
  • S3 Buckets: AWS object storage service.
  • Cross account role: role that is made available for Databricks to assume from their AWS account. It is used to deploy infrastructure and to be able to assume other roles within AWS.
  • Data roles: roles with access/write permissions to S3 buckets that will be assumed by the cluster through the meta instance profile.
  • Instance profile: role that is provided to the cluster with access permissions to S3 buckets and by which all users with access to the cluster will have the same permissions.
  • Jobs Compute: focused on processes orchestrated through pipelines managed by Data Engineers that may involve auto-scaling in certain tasks.
  • Jobs Light Compute: designed for non-critical processes that do not involve a very high computational load.
  • Meta instance profile: role that is provided to the cluster with permissions to assume the data roles.
  • Identity Provider (IdP): entity that maintains the identity information of individuals within an organization.
  • Secure Cluster Connectivity (SCC): SSH reverse tunnel communication between Control Plane and cluster. It allows not having open ports or public IPs in the instances.
  • Security Token Service (STS): web service that allows requesting temporary credentials with limited privileges.
  • Security Assertion Markup Language (SAML): open standard used for authentication. Based on XML, web applications use SAML to transfer authentication data between two entities, the Identity Provider and the service in question.
  • SQL Compute: cluster reserved for queries that visualize stored information in the Data Lake.
  • Virtual Private Cloud (VPC): logically isolated virtual network in AWS. 
  • VPC Endpoint: network component that allows connecting a VPC with the different services within AWS through AWS’ own network.
  • Workspace: shared environment to access all Databricks assets. It organizes the different objects (notebooks, libraries, etc…) in folders and manages access to computational resources such as clusters and jobs.

Architecture

High level architecture

Before starting to analyze the different alternatives provided by Databricks with respect to infrastructure deployment, it is useful to know the main components of the product:

High level diagram of the Data Plane Architecture (source: databricks)

Control Plane: hosts Databricks back-end services needed to make available the graphical interface, REST APIs for account management and workspaces. These services are deployed on an AWS account owned by Databricks.

Data Plane: hosts all the necessary infrastructure for data processing: persistence, clusters, logging services, spark libraries, etc. The Data Plane is deployed in the customer’s AWS account and can be managed by:

  • Databricks (Databricks Managed VPC).
  • By the customer (Customer Managed VPC).

DBFS: distributed storage system available for clusters. It is an abstraction on an object storage system, in this case S3, and allows access to files and folders without using URLs.

Databricks REST APIs: REST APIs made available by Databricks through which its resources can be managed programmatically.

External Data Sources: possible data sources hosted outside the customer’s AWS account, such as a relational database or object storage service.

Workload plans and types

The price indicated by Databricks is bound to the DBUs consumed by the cluster. This parameter is related to the processing capacity consumed by the cluster and depends directly on the type of instances selected (an approximate calculation of the DBUs consumed per hour by the cluster is provided when configuring the cluster). 

The imputed price per DBU depends on two main factors:

  • Computational factor: the definition of the cluster characteristics (Cluster Mode, Runtime, On-Demand-Spot Instances, Autoscaling, …) which will result in the allocation of a particular package (Jobs Light Compute, Jobs Compute, SQL Compute and All Purpose Compute).
  • Architecture factor: the customization of the same (Customer Managed-VPC), in some aspects will require a Premium or even Enterprise subscription, which generates that the cost of each DBU is higher as a subscription with higher privileges is obtained.

The combination of both computational and architectural factors will define the final cost of each DBU per work hour.

PLAN
Standard
Premium
Enterprise
One platform for your data analytics and ML workloads
Data analytics and ML at scale across your business
Data analytics and ML for your mission critical workloads
Job Light Compute
$0,07/DBU
$0,10/DBU
$0,13/DBU
Job Compute
$0,10/DBU
$0,15/DBU
$0,20/DBU
SQL Compute
N/A
$0,22/DBU
$0,22/DBU
All-Purpose Compute
$0,40/DBU
$0,55/DBU
$0,65/DBU
Serverless SQL Compute
N/A
$0,55/DBU
$0,55/DBU

Imputed cost per DBU for computational and architectural factors

The following table shows the main characteristics by type of workload:

WORKLOAD TYPE
FEATURE
Jobs Light Comput
Jobs compute
All-purpose compute
Managed Apache Spark
Job scheduling with libraries
Job scheduling with notebooks
Autopilot clusters
Databricks Runtime for ML
Managed MLflow
Delta Lake with Delta Engine
Interactive clusters
Notebooks and collaboration
Ecosystem integrations

Characteristics by type of workload

The following table reflects the features included in each of the plans focusing on the integration with AWS. These features can be a determining factor in the selection of the plan.

PLAN
DESCRIPTION
Standard
Premium
Enterprise

Data Plane Control

Databricks Managed VPC

Customer Managed VPC

Control Plane Networking

Cluster - Control Plane (Public Connection)
Cluster - Control Plane (Private Link)

AWS S3 Permissions Management

Instance Profile
Credentials Passthrough (SCIM)

Control Plane Encryption

Default DMK Key
DMK & CMK Keys

At Rest Encryption

S3 Bucket
S3 Bucket - EBS

Features related to the subscription plan.

Networking

The following diagram shows the communication channels between Control Plane and Data Plane:

Conexiones Control Plane-Data Plane (fuente: databricks)

Databricks Managed VPC

This alternative is characterized by the fact that the Data Plane is managed by Databricks. The infrastructure is deployed through a cross-account role that is enabled so that Databricks can set up and configure the necessary infrastructure. The implemented connections are as follows:

  • Cluster-root bucket connection through Gateway Endpoint
  • Cluster-STS and cluster-Kinesis connection via public network
  • Connection to metastore hosted on the Control Plane via NAT Gateway
  • Connecting to REST APIs through the NAT Gateway
  • Secure cluster connectivity: connection to the cluster control infrastructure with reverse SSH tunnel through the NAT Gateway
  • Cluster instances have SSH access to any IP

Customer Managed VPC [1]

This alternative is characterized by the fact that the Data Plane is managed by the client. The advantages of this alternative are the following:

  • Greater control and limitation of permissions required in the cross-account role using Databricks. For example, it is no longer necessary for this role to have the permissions to be able to create a VPC.
  • Management of the subnets within the VPC such that a range can be determined to suit the use case.
  • Consolidation of VPCs where several workspaces can use the same VPC to facilitate cluster management, leading to lower cost. It is important to note that multiple subnets within the same AZ cannot be made available in the same workspace.
  • Possibility of limiting outgoing connections through the instance’s security groups, applying firewalls or having endpoints in order to implement all communication through internal channels.
  • Inclusion of VPC Endpoints for AWS STS and Kinesis.
  • Private Links for private communication with Control Plane.
  • Possibility to implement the metastore with AWS Glue in the customer’s account (explained in the second article).

The details of the connections implementing an internal metastore with Glue, VPC Endpoints for STS and Kinesis and private connections are as follows:

  • Cluster-root bucket connection via Gateway Endpoint
  • Cluster-STS and Cluster-Kinesis connection via Interface Endpoints
  • Connecting to REST APIs through Private Link
  • Secure cluster connectivity: connection to the cluster control infrastructure with reverse SSH tunnel via Private Link
  • Cluster instances have access only to the private network and its endpoints.

Secure cluster connectivity uses a TLS certificate hosted in a Hashicorp Vault on the Control Plane for Databricks Managed VPC and Customer Managed VPC.

In the following image you can see how by opting for the Customer Managed VPC we can reuse it within different workspaces:

Diagram Databricks Managed VPC vs Customer Managed VPC (source: databricks)

It is important to note that a configuration transition from a Databricks Managed VPC to a Customer Managed VPC is not possible.

Connections through private network (Private Links)

In the case of opting for a Customer Managed VPC we will be able to make the secure connection of the cluster through an internal channel by provisioning a Private Link to connect the Control Plane and the Data Plane (back-end). In the same way a Private Link will be enabled to access all the REST APIs from the Data Plane (back-end). 

In addition, a transit VPC can also be enabled with a Private Link and Site-to-Site VPN through which the user will be able to make a private connection to the Control Plane (front-end).

The requirements to be able to deploy these Private Links are as follows:

  • Use the SCC (Secure Cluster Connectivity, i.e. the reverse ssh tunnel between cluster – Control Plane). This is the default channel used.
  • Contact a Databricks representative in order to configure the endpoints in your infrastructure.

All these connections can be seen in the following image:

Private Link implementation diagram (source: databricks)

The official documentation [2] describes the necessary steps to follow in order to establish these connections. 

In the case of enabling Private Links on both the front-end and back-end you can enable the option for Databricks to reject any connection made through public channels. For this case you could not use the default metastore hosted on the Control Plane because AWS does not yet support JDBC type connections through Private Link so you would have to use an external metastore or implement one with Glue on the Data Plane. See Metastore Possibilities section in the next article for more information.

Identity and Access Management

Single sign-on (SSO)

Single Sign-On (SSO) allows users to authenticate through an Identity Provider (IdP) provided by the organization. SAML 2.0 support is required.

The two possible alternatives when using SSO are as follows:

  • Add users via their IdP and enable automatic user creation. If the user’s account does not exist at the time of login, it will be created automatically.
  • Add users in Databricks manually and disable automatic user creation. If the user’s account does not exist at the time of login, the user will not be able to log in.

At the time of writing, the IdPs supported are as follows:

  • Microsoft Windows Active Directory
  • AWS SSO
  • Microsoft Azure Active Directory
  • Google Workspace (SSO v1.0 & v2.0)
  • Okta SSO
  • OneLogin SSO
  • Ping Identity SSO

More information can be accessed through the following link [3].

AWS S3 Access Control

This section describes the different alternatives that exist to manage user access to the different S3 buckets that exist within the customer’s infrastructure.

 

Instance profile

This method is based on making available an instance profile for the EC2 instances that make up the cluster. It is characterized by the fact that all users with access to the cluster will have the same permissions, that is to say, the permissions that the instance profile has.

The steps to be completed in order to perform this configuration are as follows:

  1. Create instance profile in AWS with the necessary permissions to access the buckets. It must be configured to trust the cross-account role.
  2. Add to the cross-account role the ability to assume this role.
  3. Register the instance profile within Databricks through the Databricks UI or API call.
  4. Launch a cluster with the instance profile attached.

It is important to note that Databricks checks if you have the necessary permissions to assume the instance profile when you register with Databricks. This check is a dry-run in which an attempt is made to launch an instance with this role without actually deploying it. In the case where the cross-account role has tag restrictions (e.g. Vendor: “Databricks”) this check will fail as this dry-run is executed without any tags.

Credentials passthrough (SCIM) 

This mechanism allows defining permissions at the Databricks user/group level through a meta instance profile associated with the cluster instances. The big difference between this alternative to the one discussed in the previous section is the fact that different permissions can be enabled for different users/groups using the same cluster.

The following image shows this relationship in more detail:

Diagram of the Meta Role relationship with respect to the Data Roles that can be assumed for each S3 Bucket (source: databricks)

Therefore, the cluster instances assume the meta instance profile that acts as a container for the different data roles that can be assumed. These data roles contain the permissions to access the different buckets. On the other hand, the SCIM API will be used to define which user groups can assume the different data roles embedded in the meta instance profile.

It is important to note that it is not possible to assume several data roles simultaneously in the same notebook. In addition, the instance profile must be registered as a meta instance profile, as opposed to the previous section where it was registered as an instance profile.

All the information on this section can be found in the following link [4].

Credentials passthrough (SSO/SAML)

In case the organization has an Identity Provider, the Credentials passthrough can be used with SSO and thus be able to use AWS IAM federation to maintain the mapping users – IAM roles within its Identity Provider. This can be interesting for organizations that want to maintain centralized management of their users and the privileges they have.

The following diagram explains the workflow:

AWS IAM Federation application diagram in Databricks (source: databricks)
  1. Configure the IdP with the AWS account so that the IdP can control which roles it can assume.
  2. Users log in to Databricks via SAML SSO, permissions on roles are defined by the IdP
  3. Databricks calls the STS and assumes the roles for the users by sending the SAML response and obtaining temporary tokens.
  4. When a user accesses S3 from a Databricks cluster, the Databricks runtime uses the temporary tokens so that the user can assume the provided role that the SAML response indicates.

Therefore, this section and the previous one have certain similarities, both will need a meta instance profile that is assumed by the cluster instances and data roles with the access permissions to the S3 buckets. The difference is that in this case it is the SAML response that indicates which user groups can assume the different data roles and in the case of Credentials Passthrough (SCIM) it is the Databricks itself that denotes it

All information on this section can be found in the following documentation [5].

Referencias

[1] Customer Managed VPC Databricks Guide. [link] (November 04, 2021)

[2] Enable AWS Private Link. [link] (December 17, 2021)

[3] SSO Set up Guide. [link] (September 28, 2017)

[4] Access S3 buckets using IAM credential passthrough with SCIM. [link] (December 3, 2017)

[5] Access S3 buckets using IAM credential passthrough SAML 2.0 federation. [link] (June 11, 2017)

Do you want to know more about what we offer and to see other success stories?
AWS Cloud Engineer

I started my career with the development, maintenance and operation of multidimensional databases and Data Lakes. From there I started to get interested in data systems and cloud architectures, getting certified in AWS as a Solutions Architect and entering the Cloud Native practice within Bluetab.

I am currently working as a Cloud Engineer developing a Data Lake with AWS for a company that organizes international sporting events.

AWS Cloud Engineer

During the last few years I have specialized as a Data Scientist in different sectors (banking, consulting,…) and the decision to switch to Bluetab was motivated by the interest in specializing as a Data Engineer and start working with the main Cloud providers (AWS, GPC and Azure). 

Thus getting first access to the Data Engineer practice group and collaborating in real data projects such as the current one with Olympics.