Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS_其他

DBFS使用dbutils实现存储服务的装载（mount、挂载），用户可以把Azure Data Lake Storage Gen2和Azure Blob Storage 账户装载到DBFS中。Mount是Data Lake Storage Gen2的指针，因此数据不会同步到本地，但是用户可以访问远程文件系统中的数据。Mount操作相当于创建了一个共享文件系统。

要实现挂载，需要5个最基本条件：

Azure Data Lake Storage Gen2 account
Azure Application
Azure Key Vault
Databricks Workspace
Secret Scope

一，创建Azure Data Lake Storage Gen2 Account

从Azure Portal中搜索Storage Accounts服务，创建Data Lake V2 Account

1，创建Data Lake V2的详细步骤

Step1：配置Basics选项卡

配置Storage Account使用的Subscription和Resource group；

选择Account Kind为：StorageV2(General purpose v2)；

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

Step2：Networking 使用默认值

Step3：Data Protection 使用默认值

Step4：配置Advanced选项卡，

启用”Hierarchical namespace”，这是Data Lake Storage Gen2的独有功能，有层次结构的文件系统

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

Step5：Review+Create

2，为Data Lake Storage Gen2创建文件系统

进入到Data Lake Storage Gen2的资源页面中，从“Tools and SDKs”中选择“Storage Explorer”，

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

进入到Storage Explorer中，右击CONTAINERS，选择“Create file system”：

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

文件系统其实就是一个Directory，并可以创建子目录。

二，注册Application

为了在ADLS Gen 2和Azure Databricks之间建立连接，需要建立应用程序连接，因此我们需要注册一个Azure AD application，创建一个服务主体（Service Principal），用于生成application 验证密钥，该密钥称作Client Secret，存储在Azure Key Vault实例中。

1，注册App

从Azure Portal中搜索“Azure Active Directory”服务，从Overview界面中选择“App registration”，注册一个app：

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

在注册的App中输入Name：vic_test_app，点击“Register”，开始注册App

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

在完成App的注册之后，点击“App registrations”，从“Owned applications”中查看拥有的App，点击刚注册的app：vic_test_app。

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

从vic_test_app的overview页面中，我们得到该app实例的两个重要ID：

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

Application (client) ID：注册的app实例的ID

Directory (tenant) ID：app实例注册到Azure AD tenant ID就是Directory ID。

2，为该app添加app secret（验证密钥），以访问该app

使用app生成验证密钥（authentication key、application password、client secret 或application secret），利用该密钥来验证app。打开“Certificates & secrets”页面，点击“+New client secret”来创建新的Client Secret： Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

输入Description，选择过期日期”In 1 year”，点击“Add”生成新的client secrets：

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

当点击Add时，client secret（authentication key）将会出现。你只有一次机会可以复制key-value。当你离开这个页面时，你永远没有机会再获取到key-value。

注意：复制Client Secret 的Value字段，当执行其他操作之后，这个值将无法再被查看到。

3，得到的数据

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

三，授予Service Principal访问Data Lake V2账户的权限

我们需要为Service Principal分配访问角色（该Service Principal是在注册App时自动创建），以访问存储账户中的数据。

跳转到Azure Portal主页，打开Storage Accounts中的Data Lake Storage Gen2账户，点击Access Control (IAM)，在Access Control (IAM)页面中，点击“+ Add”，选择“Add role assignment”。打开“Add role assignment”页面，在Role列表中选择“Storage Blob Data Contributor”，在“Assign access to”列表中选择“Azure AD user、group or service principal”，在“Select”列表中选择之前注册的app，点击“Save”按钮，完成权限的授予。

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

四，创建Key Vault

Key Vault服务用于安全地存储key、密码、证书等secret，因此，我们需要把从已注册的app中获取到的Client Secret存储到Key Vault中。

1：创建Key Vault

在Key Vault创建完成之后，向Key Vault中添加一个Secret，

2：保存Secret

定义Secret的Name，把从已注册的app中获取到的Client Secret存储到Secret的Value中。

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

3，从Key Vault得到的数据

从Key Vault的Settings中点击“Properties”：

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

五，创建Azure Key Vault-backed的Secret Scope

Databricks 使用Secret Scope来管理Secret，Secret Scope是由Secret构成的，该Secret是由name来唯一标识的。

Step1，导航到创建Secret Scope的页面

根据Databricks实例，导航到创建Secret Scope的页面，注意该URI是区分大小写的。

https://<databricks-instance>#secrets/createScope

Step2，输入Secret Scope的属性

ScopeName是区分大小写的，并且DNS Name和Resource ID都必须从Key Vault中复制。

DNS Name是Key Valut 属性中Vault URI。

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

六，挂载Data Lake Storage Gen2

通过创建 Azure Data Lake Storage Gen2的文件系统，注册App、创建Key Vault、创建Secret Scope，我们完成了把Data Lake Gen2挂载到DBFS的所有准备工作，并获得了以下数据：

Client ID (a.k.a. Application ID)
Client Secret (a.k.a. Application Secret)
Directory ID (a.k.a Tenant ID)
Databricks Secret Scope Name
Key Name for Service Credentials (from Azure Key Vault, it is the secret’s name)
File System Name
Storage Account Name
Mount Name

Databricks提供了挂载命令：dbutils.mount()，通过该命令，我们可以把Azure Data Lake Storage Gen2挂载到DBFS中。挂载操作是一次性的操作，一旦挂载操作完成，就可以把远程的Data Lake Gen2的file system当作本地文件来使用。

1，挂载Azure Data Lake Storage Gen2

使用服务主体（Service Principal）和OAuth 2.0进行身份验证，把Azure Data Lake Storage Gen2帐户装载到DBFS，该装载点（mount pointer）是数据湖存储的指针，数据不需要同步到本地，但是只要远程文件系统中的数据有更新，我们就能获得数据的更新。

挂载Data Lake Storage Gen2文件系统，目前只支持OAuth 2.0 Credential：

######################################################################################
# Set the configurations. Here's what you need:
## 1.) Client ID (a.k.a Application ID)
## 2.) Client Secret (a.k.a. Application Secret)
## 3.) Directory ID
## 4.) File System Name
## 5.) Storage Account Name
## 6.) Mount Name
######################################################################################
configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<client-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

######################################################################################
# Optionally, you can add <directory-name> to the source URI of your mount point.
######################################################################################
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

参数注释：

<Client-id>：App ID
<scope-name>：Secret Scope的名称
<key-name-for-service-credential>：Azure Key Vault
<directory-id>：tenant Id
<mount-name> ：是DBFS path，表示Data Lake Store或其中的一个Folder在DBFS中装载的位置
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>") ：从Secret Scope中的Secret中获取服务凭证
<file-system-name>：文件系统的名称
<storage-account-name>：存储账户的名称

2，访问挂载点

访问挂载点中的文件，可以通过pyspark.sql来访问：

df = spark.read.text("/mnt/%s/...." % <mount-name>)
df = spark.read.text("dbfs:/mnt/<mount-name>/....")

或者通过SQL命令来访问：

%sql
select *
from csv.`/mnt/mount_datalakeg2/stword.csv`

3，刷新挂载点

dbutils.fs.refreshMounts()

4，卸载挂载点：

dbutils.fs.unmount("/mnt/<mount-name>")

5，通过编程方式获得相应的数据进行挂载

Databricks 第8篇：把Azure Data Lake Storage Gen2 (ADLS Gen 2)挂载到DBFS

# Python code to mount and access Azure Data Lake Storage Gen2 Account to Azure Databricks with Service Principal and OAuth
# Author: Dhyanendra Singh Rathore

# Define the variables used for creating connection strings
adlsAccountName = "dlscsvdataproject"
adlsContainerName = "csv-data-store"
adlsFolderName = "covid19-data"
mountPoint = "/mnt/csvFiles"

# Application (Client) ID
applicationId = dbutils.secrets.get(scope="CSVProjectKeyVault",key="ClientId")

# Application (Client) Secret Key
authenticationKey = dbutils.secrets.get(scope="CSVProjectKeyVault",key="ClientSecret")

# Directory (Tenant) ID
tenandId = dbutils.secrets.get(scope="CSVProjectKeyVault",key="TenantId")

endpoint = "https://login.microsoftonline.com/" + tenandId + "/oauth2/token"
source = "abfss://" + adlsContainerName + "@" + adlsAccountName + ".dfs.core.windows.net/" + adlsFolderName

# Connecting using Service Principal secrets and OAuth
configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": applicationId,
           "fs.azure.account.oauth2.client.secret": authenticationKey,
           "fs.azure.account.oauth2.client.endpoint": endpoint}

# Mounting ADLS Storage to DBFS
# Mount only if the directory is not already mounted
if not any(mount.mountPoint == mountPoint for mount in dbutils.fs.mounts()):
  dbutils.fs.mount(
    source = source,
    mount_point = mountPoint,
    extra_configs = configs)

View Code

参考文档：

Mounting & accessing ADLS Gen2 in Azure Databricks using Service Principal and Secret Scopes

Mount an ADLS Gen 2 to Databricks File System Using a Service Principal and OAuth 2.0 (Ep. 5)

Azure Data Lake Storage Gen2

Secret management

一，创建Azure Data Lake Storage Gen2 Account

二，注册Application

三，授予Service Principal访问Data Lake V2账户的权限

四，创建Key Vault

五，创建Azure Key Vault-backed的Secret Scope

六 ，挂载Data Lake Storage Gen2

3，刷新挂载点

您必须 登录 才能发表评论！

六，挂载Data Lake Storage Gen2

您必须登录才能发表评论！