Exploring Data Lake using Azure Synapse (or Databricks) — Azure AD Passthrough for Data Access Control

Objective

Inderjit Rana
Microsoft Azure

--

Lakehouse architectures are becoming popular in the cloud where data is directly queried from Data Lake instead of being loaded into Data Warehouse. This article is specific to Azure Platform and I highlight the benefit of Azure AD Passthrough for controlling access to data at the core Data Lake layer for such Lakehouse architectures.

Background — Data Exploration Tool Choices in Azure

As I have described in my previous posts, Azure platform follows one size does not fit all approach and makes available a variety of tools which can be used for data lake exploration based on the skills, usage patterns and cost requirements. Azure Synapse is the key service for data and analytics, it provides quite a few compute options to query data in data lake and in this post I will focus on the following:

  1. SQL Serverless Pool
  2. SQL Dedicated Pool
  3. Synapse Spark

Primary focus of my post is Azure Synapse but it would be incomplete to leave out Azure Databricks which is a premium Spark offering nicely integrated into Azure Platform. Azure Databricks is outside Synapse umbrella but another great option for Data Lake Exploration which I will touch briefly and refer to a blog post which covers Azure AD Passthrough for Databricks.

If you are new Azure Synapse you might want to check out my other article Data Lake or Data Warehouse or a Combination of Both — Choices in Azure and Azure Synapse where I have shared my views around Data lake and Data Warehouse patterns along with guidance on considerations for choosing the various available components in Data Architecture.

Use Case Context — What problem are we solving here?

I have observed practice in large enterprises where data is collected by one team (say Data Producer/Collector) from various transactional stores into a Data Lake and then numerous downstream teams consume that data for analytics use cases. In such scenarios it might be hard to standardize on single tool for Data Lake Exploration:

  • One set of users might be coming from SQL background and more comfortable with SQL Queries, usage patterns might be ad-hoc in nature where Synapse SQL Serverless cost model of pay per query is the ideal fit.
  • Another set of users might be coming from Big Data world and prefer Spark with Python, Scala or R (Spark also makes Spark SQL available).
  • Maybe a particular team already has a Synapse SQL Dedicated Pool, prefer the predictable costs and once in a while need to query some datasets from data lake using SQL directly (External Tables) without building out data pipelines to load data into data warehouse.

Bottom line data producer team might not be able to control the consumption tooling so the question is how to setup access control for consumers. Does security need to be replicated in each querying tool? This where Azure AD Passthrough come in handy in terms of controlling access to data at core irrespective of the tooling used to query data. Benefit of using Azure AD Passthrough would be central spot for better visibility, maintainability and audit on data lake assets which is huge plus especially for large organizations.

Azure AD Passthrough — Definition

When you use Azure Platform you already have Azure Active Directory which is Identity Access and Management System and end users use their Azure AD credentials to connect with Azure Synapse and Azure Databricks. When these tools are used to execute queries against Data Lake built on ADLS Gen2 (Azure Storage) they have the capability to pass on the end user identity to the storage layer and access control rules setup at the storage layer are honored — This capability of the client to pass the identity of the end user to the storage layer is referred to as Azure AD Passthrough.

Solution Architecture

The variety of tools available for Data Lake exploration is nice but maintaining visibility and control on who has access to what can be challenging if access control needs to be set for each individual tool used for querying. This is where its luxury to have Azure AD Passthrough because you can enforce access control at the core data layer irrespective of which of the above mentioned tools is used to query data. It is a common requirement in Regulated Industries to audit access to data and be able to answer questions who is accessing specific dataset, all of this comes pretty much out of the box because of Azure AD Integration at the core data layer if Azure AD Passthrough is used. The following diagram depicts the idea as a picture.

Data Lake Exploration with various tools — Data Access Control Centralized with Azure AD Passthrough

Please note that being able to use Azure AD Passthrough is great but there will be valid cases where you might want to break away from purist approach, in this post I am just shedding light on benefits so that you can appropriately weigh pros and cons to make better decisions for your needs. For example:

  • Power BI Import Model where data is cached inside Power BI for best performance, in this implementation data is loaded into Power BI using one set of credentials and end user security is enforced at the Power BI level. Although Azure AD is still used at Power BI level but you would be breaking away from the Azure AD Passthrough at the core data storage layer.
  • Power BI Direct Query is other mode where Azure AD passthrough is honored but the performance might be slower than Import Mode.

So, its up to you to asses if the performance difference is worth breaking the Azure AD Passthrough nicety.

Implementation Tips — Environment Setup

My previous article documents Azure Synapse Environment Setup Key Considerations — Get started using ARM Template Provisioning Scripts but following is a summary of items to keep in mind when implementing an Azure synapse environment:

  • General recommendation is to use Managed VNET for Synapse.
  • Storage Account default setting of Allow All Networks is usually switched to Selected Networks in enterprise environments.
  • When Storage Accounts are used with Selected Networks, Managed Private Endpoints for Storage Account need to be created in Synapse Managed VNET for Synapse Spark to be able to talk to Storage Account.
  • When Storage Accounts are used with Selected Networks, Resource Access Rules on Storage Account need to be set to allow connections from Synapse SQL Dedicated and SQL Serverless pools..
  • Invest time to learn Access Control Model (RBAC + ACL) for Azure Data Lake Gen2 .
  • Azure Synapse can use multiple Storage Accounts, consider using Secondary Storage Account for data rather than primary Storage Account.
  • Invest effort into learning Azure Synapse Access Control, you want to avoid situations where end user does not have access to data in Storage Account but has access to Synapse Managed Identity which has permission on the Storage Account as this would break Azure AD Passthrough.

Azure AD Passthrough for Azure Databricks is covered in detail in the article here (see Pattern 3 towards end) so I will not be going into those details here, only thing I will add is related to Network Security. If Storage Account is used with selected Network settings you will need to make sure Databricks is created in your VNET referred to VNET Injection, either of the two methods — VNET Service Endpoint or Private Endpoint for Storage Account should allow network path of communication between Storage Account and Databricks.

Implementation Tips — SQL Query Setup

There a few different methods to query data from data lake but they fall into two categories:
1. Openrowset — This is only available in Synapse SQL Serverless; good to keep in mind Views can be created to hide the OpenRowset.
2. External Table — This can be used with both SQL Serverless as well as SQL Dedicated Pool

Please review the public documentation section here for SQL Serverless examples (OpenRowset, Views with Openrowset as well as External Tables) and here for SQL Dedicated Pool examples which is only External Tables.

Following Example Openrowset query for SQL Serverless does not specify any credentials because end user credentials executing the query are passed all the way through to storage layer, user could be running the query from Synapse Studio web interface or using SSMS hence logged in using Azure AD identity.

SELECT TOP 100 * 
FROM
OPENROWSET(
BULK 'https://xxxxstg.dfs.core.windows.net/data/dataset1/airports.csv', FORMAT=’CSV’,PARSER_VERSION=’2.0',HEADER_ROW = TRUE) AS [result]

External Table method of querying always requires creation of External Data Source, Openrowset method can also be used with External Data Source. Although most External Data Source creation examples show External Data Source setup with credential object, the key point to note is that you can create External Data Source without specifying any credentials at all and in that case end user’s identity gets passed to the Storage Layer resulting in Azure AD Passthrough which would be the desire if your intent is to use Azure AD Passthrough for access control at the core storage layer. This is true for both SQL Serverless as well as SQL Dedicated Pool.

SQL Serverless Example with Data Source for Azure AD Passthrough

CREATE EXTERNAL DATA SOURCE secondarystorage
WITH ( LOCATION = ‘https://xxxxstg.dfs.core.windows.net/data') ;
SELECT TOP 100 *
FROM
OPENROWSET(
BULK 'dataset1/airports.csv', data_source = ‘secondarystorage’,
FORMAT='CSV', PARSER_VERSION='2.0', FIRSTROW = 2 )
With (airport_id int,
city nvarchar(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
[state] nvarchar(250) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
[name] nvarchar(250) COLLATE Latin1_General_100_CI_AS_SC_UTF8) as rows;

SQL Dedicated Pool example with Data Source for Azure AD Passthrough

CREATE EXTERNAL DATA SOURCE secondarystorage WITH ( LOCATION='abfss://data@xxxxstg.dfs.core.windows.net', TYPE = HADOOP);
GO
CREATE EXTERNAL FILE FORMAT AirportsCsv WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = ',', FIRST_ROW = 2 ) );
GO
CREATE EXTERNAL TABLE [dbo].[ExtAirports]
( airport_id int, city varchar(50),state varchar(100), airport_name varchar(100))
WITH
(
LOCATION='dataset1/airports.csv' ,
DATA_SOURCE = secondarystorage ,
FILE_FORMAT = AirportsCsv
) ;

Troubleshooting Tips

In case the queries above are not working as expected please double check:

  1. End user executing the query has appropriate RBAC and/or ACL permissions on the Storage Account
  2. If Storage Account is set with Selected Networks only then Workspace must be allowed using Resource Access Rules as shown in the screenshot below.
Network Protected Storage — Allow connections from Synapse Workspace

Implementation Tips — Synapse Spark

This is pretty straight forward but if you are running into errors:

  • Storage Account Network Security is the most likely culprit, please check Environment Setup section above — You need Private Endpoint to the storage Account if Storage Account is setup with Selected Networks instead of All Networks
  • Permissions on the Storage Account for the user be the other reason
Synapse Spark — Azure AD Passthrough

--

--