RDBMS

Delta Lake Connector

Delta Lake is an open-source data lake technology that provides reliable data
pipelines, transactional consistency, and data versioning capabilities to improve the
reliability and quality of big data and machine learning workloads. Unity Catalog is a leading solution for unified data and AI governance in lakehouse environments. It enables organizations to seamlessly manage data, models, notebooks, and more across cloud platforms. Users can securely access and collaborate on trusted assets, boosting productivity with AI. Unity Catalog offers centralized access control, auditing, lineage tracking, and data discovery for Delta Lake workspaces. 
OvalEdge uses a JDBC Driver to connect and perform data crawling, profiling, query execution, and lineage building.

Connector Capabilities

The following is the list of objects supported by the Delta Lake connector:

Functionality

Descriptions

Crawling

  • Tables, Views & Columns
  • Procedures, Functions, Triggers & Views Source Code

Profiling

  • Table Profiling: Row count, Column count, and View sample data
  • View Profiling: Row count,   Columns count, View sample data
  • Column Profiling: Min, Max, Null count, distinct, top 50 values
  • Full Profiling

Lineage Building

  • Table Lineage
  • Column Lineage

Lineage Sources: Stored Procedures, Functions, Triggers, Views, SQL Queries 

Query Execution

  • Select
  • Joins within database
  • Aggregations
  • Group By
  • Order By

Prerequisites

This section lists the prerequisites to establish a connection between the connector and OvalEdge Application.

  1. Driver Details
  2. Service Account Permission
  3. Configure environment variables (Optional)

Driver Details

Drivers: The drivers used by the connector are given below:

Driver 

Version

Details

JDBC 

V4.0.0 and above

The OvalEdge Dependencies automatically include the  Microsoft JDBC Driver, ensuring that the Connector comes equipped with it by default.

Service Account Permissions

The following are the minimum privileges required for a service account to crawl and profile the data.

Operations

Permission

Connection Validation

SELECT

Note: The service account user should have the 'Can Manage' permission on the SQL Warehouse/cluster.

Crawl Schemas

SELECT

Crawl Tables

SELECT

Profile Schemas, Tables

SELECT

Lineage Building 

SELECT

Note: To fetch the System Views from the data source, the Service Account user needs to have Read Access to the Public Synonyms. 

Establish Environment Variables (Optional)

This section describes the settings or instructions that you should be aware of before establishing a connection. If your environments have been configured, skip this step.

Configure Environment Names

The Environment Names allow you to select the environment configured for the specific connector from the dropdown list in the Add Connector pop-up window.
For consistency, you might want to consider crawling the same schema in both stage and production environments. The typical crawling environments are PROD, STG, or Temporary, and they may also include QA or other environments. 

Additionally, crawling a temporary environment can be useful for schema comparisons, which can later be deleted, especially during application upgrade assistance. 

Steps to Configure the Environment

  1. Navigate to Administration > System Settings 
  2. Select the Connector tab.
  3. Find the Key name “connector. environment”.
  4. Enter the desired environment values (PROD, STG) in the value column. 
  5. Click ✔ to save. 

Establish a connection

To establish a Delta Lake Unity Catalog Connection, fill in the required fields with the relevant information in the Manage Connector pop-up window:

  1. Log into the OvalEdge application
  2. In the left menu, click on the Administration module name and then on the Connectors sub-module name. The Connectors Information page will then be displayed.
  3. Click on + New Connector. The Add Connector pop-up window is displayed.
  4. Select the connection type Delta Lake Unity Catalog. The Add Connector with Delta Lake Unity Catalog details are displayed.

    Fields

    Details

    Connector Type

    The selected connection type ‘Delta Lake’ is displayed by default. The drop-down list allows the user to change the connector type if required.

    Credential Manager

    Select the option from the drop-down list where you want to save your credentials.

    • Database: When OvalEdge establishes a connection to the Snowflake database, the connection is configured with the basic username and password of the service account in real-time. 
    • HashiCorp: The credentials are stored in the HashiCorp database server and fetched from HashiCorp to OvalEdge.  
    • AWS Secrets Manager: The credentials are stored in the AWS Secrets Manager database server and fetched from the AWS Secrets Manager to OvalEdge.
    • Azure Key Vault: The credentials are stored in the Azure Key Vault database server and fetched from the Azure Key Vault to OvalEdge. Click here to know more.

    For more information on Credential Manager, refer to Credential Manager

    License Add-Ons

    All the connectors will have a Base Connector License by default, which allows you to crawl and profile to obtain metadata and statistical information from a data source. 

    OvalEdge supports various License Add-Ons based on the connector’s functionality requirements.

    • Select the Auto Lineage Add-On license that enables the automatic construction of the Lineage of data objects for a connector with the Lineage feature. 
    • Select the Data Quality Add-On license to identify, report, and resolve the data quality issues for a connector whose data supports data quality using DQ Rules/functions, Anomaly detection, Reports, and more.
    • Select the Data Access Add-On license that will enforce connector access via OvalEdge with the Remote Data Access Management (RDAM) feature enabled. 

    Connector Name*

    Select a connection name for the Delta Lake database. Users can specify a reference name to identify the Delta Lake database connection in OvalEdge. 

    Example: OvalEdge_Delta Lake_Connection

    Connector Environment

    The OvalEdge Environment dropdown menu is used to select the environment for crawling, such as PROD, STG, or Temporary, and may also include QA or other environments.

    Server*

    Specify the server address where the database instance is located.

    Example: adb-123456789.11.azuredatabricks.net

    Port*

    Enter the port number. The default port for Delta Lake when connecting to a server is usually "443"

    Database_Type*

    Select the database type as Delta Lake_Unity_Catalog. 

    • If you opt for Delta Lake_Regular, the database will include the default catalog folder containing schemas. 
    • On the other hand, selecting Delta Lake_Unity_Catalog will result in a database with multiple catalog folders that include nested schemas.

    Database

    Enter the name of the database associated with the Database type.

    Driver*

    A JDBC driver is a Java library file with the extension .jar that connects to a database. By default, the driver details associated with the Delta Lake database will be auto-populated.

    Example: com.simba.spark.jdbc.Driver

    HTTP Path

    Enter the HTTP Path associated with Delta Lake, and it helps in connecting with the legacy-specific cluster or with the warehouse

    Example: sql/protocolv1/o/781181XXXXXXX/0717-094118-bathe927

    Lineage Fetching Mode

    Choose the mode for retrieving and displaying lineage details in OvalEdge by selecting either Query or API

    Username* 

    Provide the service account username required to connect to the Delta Lake server.

    Note: This field may be auto-filled by the web browser with the current OvalEdge user login. Please enter the Delta Lake Service Account name if necessary.

    Password*

    Enter the service account password to gain access to the Delta Lake Server. In general, security measures involve token-based authentication for enhanced protection. 

    Connection String 

    A connection string configures the Delta Lake object. Toggle the button to automatically retrieve details from the provided credentials or manually enter the connection string. 

    Example: jdbc:sqlserver://{server}:1234;database={xyz}

    Plug-in Server

    Specify the server name if the data source library is running as a web server, similar to bridge-lite.

    Plug-in Port

    Enter the port number associated with the plugin server.

    Default Governance Roles*

    You can select a specific user or a team from the governance roles (Steward, Custodian, Owner) that get assigned for managing the data asset. 

    Note: The dropdown list displays all the configurable roles (single user or a team) as per the configurations made in the OvalEdge Security | Governance Roles section.  

    Admin Roles

    Select the required admin roles for this connector.

    To add Integration Admin Roles, search for or select one or more roles from the Integration Admin options and then click on the Apply button. 

    • The Integration Admin's responsibility includes creating a connector, configuring its crawling and profiling settings, and deleting connectors, schemas, or data objects.

    To add Security and Governance Admin roles, search for or select one or more roles from the list and then click on the Apply button. 

    The security and Governance Admin is responsible for:

    • Configuring role permissions for the connector and its associated data objects.
    • Adding admins to set permissions for the connector's roles and associated data objects.
    • Updating governance roles.
    • Creating custom fields.
    • Developing Service Request templates for the connector.
    • Creating Approval workflows for the templates.

    No of Archive Objects*

    The "Number of archive objects" refers to the number of recent modifications made to the metadata data of a dataset at the remote/source location. This feature is disabled by default. To enable it, toggle the Archive button and enter the desired number of objects to archive.

    For instance, if a user sets the count to 4 and the connection is crawled, it will retrieve the last 4 changes that occurred in the connector's remote/source. These changes can be observed in the 'version' column of the 'Metadata Changes' module.

    Select Bridge

    To enable OvalEdge to function as a SaaS application behind a customer's firewall, the OvalEdge Bridge is necessary. 

    • When a bridge has been set up, it will be displayed in a dropdown menu. Users can select the required Bridge ID.
    • The user can select "NO BRIDGE" when it is not configured.

    For more information, refer to Bridge Overview

  5. Click on the Validate button to validate the connection details. 
  6. Click on the Save button to save the connection.  Alternatively, the user can also directly click on the Save & Configure button that displays the Connection Settings pop-up window to configure the settings for the selected Connector. The Save & Configure button is displayed only for the Connectors for which the settings configuration is required.

Note: * (asterisk) indicates the mandatory field required to establish a connection. Once the connection is validated and saved, it will be displayed on the Connectors home page. 

Note: It is up to the user's wish, you can save the connection details first, or you can validate the connection first and then save it. 

Connection Validation Errors 

Sl.No

Error Message(s)

Description

1

Failed to establish a connection, please check the credentials

Invalid credentials are provided, or the user or role does not have access.

2

java.sql.SQLException: [Simba][SparkJDBCDriver](500593) Communication link failure. Failed to connect to the server.

Invalid credentials or tokens expired

Note: If you have any issues creating a connection, please contact your assigned OvalEdge Customer Success Management (CSM) team.

Connector Settings 

Once the connection is validated successfully, various settings are provided to retrieve and display the information from the data source.  The connection settings include Crawler, Profiler, Query Policies, Access Instruction, Business Glossary Settings, and Others.

Connection Settings

Description

Crawler

Crawler settings are configured to connect to a data source and collect and catalog all the data elements in the form of metadata. 

Note: In Crawler Options, the user must select the Procedures, Functions, Tiggers & Views Source Code checkbox to fetch lineage data.

Profiler

Profiling is the process of gathering statistics and informative summaries about the connected data source(s). Statistics can help assess the data source's quality before using it in an analysis. Profiling is always optional; crawling can be run without profiling.

Data Access

The Remote Access tab lists the data objects and the meta and data permissions on these objects that a user is assigned access to in a remote application.

  • Crawler Options
  • Data Access Management
  • Data Access Authorization

Query Policies

The Query Policies in the Crawler setting provide the right access to the Query sheet functions (Join, Union, SUM, or aggregate functions). You can specify the desired roles and permission to deny the usage of the query sheet function. A role that has been denied policy permission will not have access to those functions in the Query Sheet.

Example: If the user selects the Role as “OE_HRADMIN,” Query Type as “JOIN,” and the Access Type as “DENY,” then the users associated with the OE_HRADMIN privileges are restricted from using the JOIN function in the Query Sheet page.

Access Instruction 

It allows the data owner to instruct others on using the objects in the application. 

Business Glossary Settings

The Business Glossary setting provides flexibility and control over how they view and manage term association within the context of a business glossary at the connector level. 

Notification

The Enable/Disable Metadata Changes Notifications option is used to set the change notification about the metadata changes of the data objects.

  • You can use the toggle button to set the Default Governance Roles (Steward, Owner Custodian, etc.) 
  • From the drop-down menu, you can select the role and team to receive the notification of metadata changes.

For more information, refer to the Connector Settings.

The Crawling of Schema(s)

You can use the Crawl/Profile option, which allows you to select the specific schemas that need to be crawled, profiled, or unprofiled. For any scheduled crawlers and profilers, the defined run date and time are displayed to set.

  1. Navigate to the Connectors page, and click Crawl/Profile
  2. It allows the user to select the specific schemas that need to be crawled, profiled, unprofiled, or scheduled.  
  3. Click on the Run that gathers all metadata from the connected source into the OvalEdge Data Catalog. 

Note: For more information on Scheduling, refer to Scheduling Connector

 

 

Copyright © 2024, OvalEdge LLC, Peachtree Corners GA USA