Databricks Delta Lake Connector

A Databricks Delta Lake connector provides support for crawling database objects, and profiling the data.

OvalEdge Crawling is a process of collecting information about data from various data sources like on-premise, cloud databases, Hadoop, visualization software, and file systems. When an OvalEdge crawler connects to a data source, it collects and catalogs all the data elements (i.e. metadata) and stores it in the OvalEdge data repository. Here the crawler creates an index for every stored data element, which can later be used in data exploration within the OvalEdge Data catalog which is a smart search. The OvalEdge crawlers can be scheduled to scan the databases regularly, so they always have an up-to-date index of the data element.

Data Sources: The Data Sources are the ones where the OvalEdge crawler integrates with various data sources to help the users to extract metadata and build a data catalog. This document provides information about how to make a connection to your Delta Lake instance and crawl the data.

Connect to the Data: Before crawling and building a, you must first connect to your data. OvalEdge requires users to configure a separate connection for each type of data source. The users must enter the source credentials and database information for each type of connectivity. Once a data connection is made, a simple click of the Crawl button starts the crawling process.

Connectivity Summary

The connectivity to Databricks Delta Lake is via JDBC driver, which is included in the platform. 

The drivers used by the connector are given below:

Driver / API

Version

Details

Delta JDBC driver

2.6.17

Is JDBC 4.2 compliant, uses JDK 8.0

https://databricks.com/spark/jdbc-drivers-download

Technical Specifications

The connector capabilities are shown below:

Crawling

Feature

Supported Objects

Remarks

Crawling

Tables

 

Table Columns

Supported Data Types:

STRING, TIMESTAMP, BIGINT

Views

 

Profiling

Please refer to the Profiling Data for more details on profiling.

Feature

Support

Remarks

Table Profiling

Row count, Columns count, View sample data

 

View Profiling

Row count, Columns count, View sample data

The view is treated as a table for profiling purposes

Column Profiling

Min, Max, Null count, distinct, top 50 values

 

Full Profiling 

Supported

 

Sample Profiling

Supported

 

Lineage Building

Lineage entities

Details

Table lineage

Not Supported

Column lineage

Not Supported

Querying

Operation 

Details

Select

Supported

Insert

Not supported, by default.

Update

Not supported, by default.

Delete

Not supported, by default.

Joins within database

Supported

Joins outside database

Not supported

Aggregations

Supported

Group By

Supported

Order By

Supported

 By default, the service account provided for the connector will be used for any query operations. If the service account has to write privileges, then Insert / Update / Delete queries can be executed.

Connection Details

Pre-requisites:

The following are the prerequisites required for connecting  Databricks Delta Lake using the OvalEdge application.

  • An admin/service account with read privileges.
  • JDBC driver is provided by default. In case it needs to be changed, add Databricks Delta Lake client drivers into the OvalEdge Jar path to communicate to the Databricks Delta Lake database. 

Check the Configuration section for further details on how to add the drivers to the jar path.

To connect to the Databricks Delta Lake database using the OvalEdge application, complete the following steps.

  • Login to the OvalEdge application
  • In the left menu, click on the Administration module name, the sub-modules associated with the Administration are displayed.
  • Click on the Crawler sub-module name, the Crawler Information page is displayed.
  • In the Crawler Information page, click on the . The Manage Connection with Search Connector pop-up window is displayed.
  • In the Manage Connection pop-up window, select the connection type as Delta Lake. The Manage Connection with Delta Lake specific details pop-up window is displayed.

1-Jan-31-2022-03-59-03-36-PM

Property

Details

Connection Type

Select the connection type as Delta Lake

License Type

By default the License Type is selected as the Standard.

Connection Name

Select a connection name for the Delta Lake . The name that you specify is a reference name to easily identify your Databricks Delta Lake database connection in OvalEdge.

Example: Databricks Delta Lake Connection DB1

Server/IP Address

Enter the database instance URL (on-premises/cloud-based)
Example:https://adb-7811813977693091.11.azuredatabricks.net/

Port 

Enter the port number as 443

Sid/Database

Enter the name of the database to connect.

Driver Name

It is the JDBC driver name for Databricks Delta Lake which is auto-populated from the OvalEdge Configuration.
Example: com.simba.spark.jdbc41.Driver

HTTP Path

HTTP path in cluster information

Example: sql/protocolv1/o/7811813977693091/0717-094118-bathe927

Username 

User account login credential Databricks Delta Lake Authentication

Password 

Password/Token (only for Databricks Delta Lake Authentication)

Connection String  

Databricks Delta Lake connection string. Set the Connection string toggle button to automatic, to get the details automatically from the credentials provided. Alternatively, you can manually enter the string.
Format:jdbc:spark://adb-7811813977693091.11.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/7811813977693091/0717-094118-bathe927;AuthMech=3;AllowSelfSignedCerts=1;SocketTimeout=0

  • Once after entering the connection details in the required fields, click on the  button the entered connection details are validated the Save and Save & Configure buttons are enabled.
  • Click on the  button to establish the connection or the user can also directly click on the button to establish the connection and configure the connection settings. Here when you click on the Save & Configure button, the Connection Settings pop-up window is displayed, where you can configure the connection settings for the selected Connector.

Once connectivity is established, additional configurations for Crawling and Profiling can be specified.

1-Feb-01-2022-01-12-12-49-PM

Property

Details

Crawler configurations

Tables, Views, and Columns

If the checkbox is selected it will crawl the tables and columns in databricks

Procedure, Functions Triggers & Views Source Code

If the checkbox is selected it will crawl the procedures, functions, triggers, and view source code.

Relationship

If the checkbox is selected it will crawl the tables which are having the primary and foreign key relationship.

User, Roles, Policies & Permissions

If the checkbox is selected it will fetch the users and roles from the selected server and when this checkbox is selected the Remote Data Access Management options will be enabled in the Remote Access tab. 

Include Table

Regex

It catalogs the table based on the regex search pattern added.

Exclude Table Regex

It doesn’t catalog the table based on the regex search pattern added.

Profiler Settings

Profile Type

Auto - Full profiling will be performed if the rowcount of the table is less than rowcount field.

Sample - Sample profiling will be performed based on sample profile size Disabled - profiling is disabled completely 

No. of threads

No of threads used for profiling

Query TimeOut

It is a wait time for query response

Rowcount constraint

If it is checked the profiling performed based on the rowcount limit

Profile rules

It will include/ exclude the tables to perform profiling based on the regex search pattern added.

Databricks RDAM

The Databricks RDAM describes the process of managing Databricks Permissions through the OvalEdge application.

Please see the Databricks-RDAM article for more details.

Best Practices

  1. Use the Crawler options to crawl the selected Metadata (Table, Views, Stored Procs, etc.) from databricks. The system will allow the user to filter the unwanted objects using the Include/Exclude Regex option even on the selected objects.
  2. Use the Profiler options to configure the number of threads, kind of Profiling(Full/Sample), set up constraints on the number of rows for Full Profiling, Number of rows to view in Table data, JDBC Query Timeout. Along with this process, the system also allows the user to skip profiling for unwanted/large tables/views, which helps achieve the best performance during profiling and permits the process to be profiling during Non-Business days/hours.
  3. Set up Remote Data Access Authorization to allow users to use their Databricks credentials to see the Table/View data.

FAQs

1. What should I know when upgrading my driver?

The Microsoft JDBC Driver 7.4 supports the JDBC 4.2, and 4.3 (partially) specifications and includes three JAR class libraries in the installation package as follows:

JAR

JDBC Specification

JDK Version

SparkJDBC42.jar

JDBC 4.2

JDK 8.0

2. How much does the driver cost?

The Microsoft JDBC Driver for Databricks Delta Lake is available at no additional charge.

3. Can I use the driver to access Databricks Delta Lake from a Linux computer?

Yes! You can use the driver to access Databricks Delta Lake from Linux, Unix, and other non-Windows platforms. 

4. Which authentication types are supported by the Delta Lake JDBC Driver for Databricks Delta Lake?

The table below lists available authentication options.

Platform

Authentication

Any Platform

Authentication By Password/Token


Copyright © 2019, OvalEdge LLC, Peachtree Corners GA USA