A Databricks Delta Lake connector provides support for crawling database objects, and profiling the data.
OvalEdge Crawling is a process of collecting information about data from various data sources like on-premise, cloud databases, Hadoop, visualization software, and file systems. When an OvalEdge crawler connects to a data source, it collects and catalogs all the data elements (i.e. metadata) and stores it in the OvalEdge data repository. Here the crawler creates an index for every stored data element, which can later be used in data exploration within the OvalEdge Data catalog which is a smart search. The OvalEdge crawlers can be scheduled to scan the databases regularly, so they always have an up-to-date index of the data element.
Data Sources: The Data Sources are the ones where the OvalEdge crawler integrates with various data sources to help the users to extract metadata and build a data catalog. This document provides information about how to make a connection to your Delta Lake instance and crawl the data.
Connect to the Data: Before crawling and building a, you must first connect to your data. OvalEdge requires users to configure a separate connection for each type of data source. The users must enter the source credentials and database information for each type of connectivity. Once a data connection is made, a simple click of the Crawl button starts the crawling process.
Connectivity Summary
The connectivity to Databricks Delta Lake is via JDBC driver, which is included in the platform.
The drivers used by the connector are given below:
Driver / API |
Version |
Details |
Delta JDBC driver |
2.6.17 |
Is JDBC 4.2 compliant, uses JDK 8.0 |
Technical Specifications
The connector capabilities are shown below:
Crawling
Feature |
Supported Objects |
Remarks |
Crawling |
Tables |
|
Table Columns |
Supported Data Types: STRING, TIMESTAMP, BIGINT |
|
Views |
Profiling
Please refer to the Profiling Data for more details on profiling.
Feature |
Support |
Remarks |
Table Profiling |
Row count, Columns count, View sample data |
|
View Profiling |
Row count, Columns count, View sample data |
The view is treated as a table for profiling purposes |
Column Profiling |
Min, Max, Null count, distinct, top 50 values |
|
Full Profiling |
Supported |
|
Sample Profiling |
Supported |
Lineage Building
Lineage entities |
Details |
Table lineage |
Not Supported |
Column lineage |
Not Supported |
Querying
Operation |
Details |
Select |
Supported |
Insert |
Not supported, by default. |
Update |
Not supported, by default. |
Delete |
Not supported, by default. |
Joins within database |
Supported |
Joins outside database |
Not supported |
Aggregations |
Supported |
Group By |
Supported |
Order By |
Supported |
By default, the service account provided for the connector will be used for any query operations. If the service account has to write privileges, then Insert / Update / Delete queries can be executed.
Connection Details
Pre-requisites:
The following are the prerequisites required for connecting Databricks Delta Lake using the OvalEdge application.
- An admin/service account with read privileges.
- JDBC driver is provided by default. In case it needs to be changed, add Databricks Delta Lake client drivers into the OvalEdge Jar path to communicate to the Databricks Delta Lake database.
Check the Configuration section for further details on how to add the drivers to the jar path.
To connect to the Databricks Delta Lake database using the OvalEdge application, complete the following steps.
- Login to the OvalEdge application
- In the left menu, click on the Administration module name, the sub-modules associated with the Administration are displayed.
- Click on the Crawler sub-module name, the Crawler Information page is displayed.
- In the Crawler Information page, click on the
. The Manage Connection with Search Connector pop-up window is displayed.
- In the Manage Connection pop-up window, select the connection type as Delta Lake. The Manage Connection with Delta Lake specific details pop-up window is displayed.
Property |
Details |
Connection Type |
Select the connection type as Delta Lake |
License Type |
By default the License Type is selected as the Standard. |
Connection Name |
Select a connection name for the Delta Lake . The name that you specify is a reference name to easily identify your Databricks Delta Lake database connection in OvalEdge. Example: Databricks Delta Lake Connection DB1 |
Server/IP Address |
Enter the database instance URL (on-premises/cloud-based) |
Port |
Enter the port number as 443 |
Sid/Database |
Enter the name of the database to connect. |
Driver Name |
It is the JDBC driver name for Databricks Delta Lake which is auto-populated from the OvalEdge Configuration. |
HTTP Path |
HTTP path in cluster information Example: sql/protocolv1/o/7811813977693091/0717-094118-bathe927 |
Username |
User account login credential Databricks Delta Lake Authentication |
Password |
Password/Token (only for Databricks Delta Lake Authentication) |
Connection String |
Databricks Delta Lake connection string. Set the Connection string toggle button to automatic, to get the details automatically from the credentials provided. Alternatively, you can manually enter the string. |
- Once after entering the connection details in the required fields, click on the
button the entered connection details are validated the Save and Save & Configure buttons are enabled.
- Click on the
button to establish the connection or the user can also directly click on the
button to establish the connection and configure the connection settings. Here when you click on the Save & Configure button, the Connection Settings pop-up window is displayed, where you can configure the connection settings for the selected Connector.
Once connectivity is established, additional configurations for Crawling and Profiling can be specified.
Property |
Details |
Crawler configurations |
|
Tables, Views, and Columns |
If the checkbox is selected it will crawl the tables and columns in databricks |
Procedure, Functions Triggers & Views Source Code |
If the checkbox is selected it will crawl the procedures, functions, triggers, and view source code. |
Relationship |
If the checkbox is selected it will crawl the tables which are having the primary and foreign key relationship. |
User, Roles, Policies & Permissions |
If the checkbox is selected it will fetch the users and roles from the selected server and when this checkbox is selected the Remote Data Access Management options will be enabled in the Remote Access tab. |
Include Table Regex |
It catalogs the table based on the regex search pattern added. |
Exclude Table Regex |
It doesn’t catalog the table based on the regex search pattern added. |
Profiler Settings |
|
Profile Type |
Auto - Full profiling will be performed if the rowcount of the table is less than rowcount field. Sample - Sample profiling will be performed based on sample profile size Disabled - profiling is disabled completely |
No. of threads |
No of threads used for profiling |
Query TimeOut |
It is a wait time for query response |
Rowcount constraint |
If it is checked the profiling performed based on the rowcount limit |
Profile rules |
It will include/ exclude the tables to perform profiling based on the regex search pattern added. |
Databricks RDAM
The Databricks RDAM describes the process of managing Databricks Permissions through the OvalEdge application.
Please see the Databricks-RDAM article for more details.
Best Practices
- Use the Crawler options to crawl the selected Metadata (Table, Views, Stored Procs, etc.) from databricks. The system will allow the user to filter the unwanted objects using the Include/Exclude Regex option even on the selected objects.
- Use the Profiler options to configure the number of threads, kind of Profiling(Full/Sample), set up constraints on the number of rows for Full Profiling, Number of rows to view in Table data, JDBC Query Timeout. Along with this process, the system also allows the user to skip profiling for unwanted/large tables/views, which helps achieve the best performance during profiling and permits the process to be profiling during Non-Business days/hours.
- Set up Remote Data Access Authorization to allow users to use their Databricks credentials to see the Table/View data.
FAQs
1. What should I know when upgrading my driver?
The Microsoft JDBC Driver 7.4 supports the JDBC 4.2, and 4.3 (partially) specifications and includes three JAR class libraries in the installation package as follows:
JAR |
JDBC Specification |
JDK Version |
SparkJDBC42.jar |
JDBC 4.2 |
JDK 8.0 |
The Microsoft JDBC Driver for Databricks Delta Lake is available at no additional charge.
3. Can I use the driver to access Databricks Delta Lake from a Linux computer?
Yes! You can use the driver to access Databricks Delta Lake from Linux, Unix, and other non-Windows platforms.
4. Which authentication types are supported by the Delta Lake JDBC Driver for Databricks Delta Lake?
The table below lists available authentication options.
Platform |
Authentication |
Any Platform |
Authentication By Password/Token |
Copyright © 2019, OvalEdge LLC, Peachtree Corners GA USA