RDBMS

Delta Lake Connector

Delta Lake is an open-source data lake technology that provides reliable data
pipelines, transactional consistency, and data versioning capabilities to improve the
reliability and quality of big data and machine learning workloads.

OvalEdge uses JDBC driver to connect to the data source, allowing users to crawl
and profile the data objects, build lineage, and execute queries.

Prerequisites

The following are the prerequisites required for establishing the connection between Delta lake and OvalEdge. 

The drivers used by the connector are given below:

Driver 

Version

Details

Delta JDBC 

2.6.17

Is JDBC 4.2 compliant, uses JDK 8.0

https://databricks.com/spark/jdbc-drivers-download

Configure Environment Names

This is a global configuration and may already have been established.
To create a distinction between environments for connections (e.g. prod, test, etc.), follow the steps below.

  1. Go to the Administration tab in the left panel menu and select Configuration
    form the dropdown.
    Select the Connector tab.
  2. In the OVALEDGE_APP configuration type, enter the desired environment
    values in the value column.
  3. In the OVALEDGE_APP configuration type, enter the desired environment
    values in the value column.

Driver Requirements

The following driver is required:

Driver  Version  Details
Delta JDBC 2.6.17 It is JDBC 4.2 compliant. It uses JDK 8.0

https://databricks.com/spark/jdbc-drivers-download

User Permissions

The table below summarizes the minimum access permissions required by the
service account to perform the operations.

Operation 

Minimum Access Permission 

Connection Validation

Usage, Select 

Crawling 

Usage, Select, Reference 

Profiling

Usage, Select 

Query Execution

Select

Technical Specifications

The following is the list of objects and data types supported by the Delta Lake connector.

Crawler

Profiler

Lineage Building

Query Execution

Tables

Views

Columns

Functions

Procedures

Synonyms

Triggers

Roles

Users

Permissions

Usage statistics

Tasks

Pipes

Table Profiling: 

Row count,

Columns count, and View sample data


View Profiling: 

Row count,   

Columns count, 

View sample data


Column Profiling: Min, Max, Null count, distinct, top 50 values

Full Profiling

Table Lineage

Column Lineage


Lineage Sources: Stored procedures, functions triggers, views, SQL queries 

Select: Joins within database


Aggregations

Group By

Order By

Establish a Connection

To establish a Delta Lake connection, complete the following steps:

  1. With your OvalEdge Administration access, log in to the OvalEdge
    application, navigate to the Administration module and click on Connectors.
  2. Click on the + icon (New Connector ) to the right to open the Manage
    Connector window.
  3. Search and select the desired connector, and the Manage Connector pop-
    up with the selected connector details is displayed. Complete the required
    fields.



    Fields

    Details

    Connection Type

    Select the connection type as Delta Lake

    By default, the selected connection type ‘Delta Lake’ is displayed. If required, the dropdown menu allows the user to change the connector type.

    License Type

    OvalEdge offers two types of licenses based on the customer's requirements. By default, the License type is displayed as ‘Auto Lineage.’ The user can select the license type as Standard or Auto Lineage. 

    (i) Standard: The standard license has crawler and profiler features and doesn’t have Auto Lineage functionality, where the users are not allowed to build lineage for the selected database and schema-level objects.

    (ii) Auto Lineage: The Auto lineage license allows to build a lineage in addition to the Crawling and Profiling feature.  

    Connection Name

    Select a connection name for Delta Lake. The name that you specify is a reference name to easily identify your Delta Lake database connection in OvalEdge.

    Example: Delta Lake Connection DB1

    Server/IP Address

    Enter the database instance URL (on-premises/cloud-based)

    Provide format 

    Example: adb-78xxxxxxxxxxxx091.11.yyyyyyyyyy.net/

    Port 

    Port number 443 is the default port used by Delta Lake when connecting to a server over the internet.

    Database

    Provide the name of the database that is associated with the delta lake connection, if necessary.

    Driver Name

    A JDBC driver is a Java library file with the extension .jar that connects to a database. The driver details associated with the delta lake database will be auto-populated by default.

    Example: com.yyyyyyyyyyy.spark.jdbc41.Driver

    Path

    HTTP Path 

    (Example: sql/protocolv1/o/7811813977693091/0717-094118-bathe927)
    Helps in connecting with the legacy-specific cluster or with the SQL Warehouse

    Username 

    User account login credential for Delta Lake Authentication.

    Enter the Username to access the required database of Delta lake connectors.

    Password 

    Password/Token (only for Delta Lake Authentication)

    Enter the password to access the required database of delta lake connectors.

    Connection String  

    Set the Connection string toggle button to automatic, to get the details automatically from the credentials provided. Alternatively, you can manually enter the string.

    Format: jdbc:spark://adb-7xxxxxxxxxxx1.11.yyyyyyyyy.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/7811813977693091/0717-4118-bathe927;AuthMech=3;AllowSelfSignedCerts=1;SocketTimeout=0

    Note: * (asterisk) indicates the mandatory fields, after providing the required details 
  4. Click on the Validate button. The entered connection details will be validated, and the Save & Configure button is enabled. 
  5. Click Save to save the connection. Alternatively, click Save & Configure to
    save and configure additional connection settings.

Connection Validation Errors 

The table below summarizes common connection validation errors.

Sl.No

Error Message(s)

Description

1

Failed to establish a 

connection; please check the credentials

Invalid credentials provided or user or role does not have access.

2

Errors while downloading the File.

403: Access denied [Provide appropriate access to user or role using in connection]

404: No such key [The object does not exist in the remote.] 

Crawl/Profile

After connectivity is established, additional configurations for crawling and profiling
can be specified for the selected data source. Selecting Crawl/Profile allows the
user to select the specific schemas that need to be crawled, profiled, or unprofiled.

Connector Settings

After the connection is established successfully, follow the steps below to navigate to
the connector’s settings.

  1. Select the connector using the selection box.
  2. Click on the nine dots icon on the top right side of the page.
  3. Select settings to configure additional settings for the connector.

The below sections describe each of the settings tabs.

Crawler

When an OvalEdge crawler connects to a data source, it collects and catalogs all the
data elements (i.e., metadata) and stores them in the OvalEdge database. The admin
user may select several options to crawl the data source system.

Crawler Options

The Tables, Views & Columns option is selected by default.

  • Tables, Views, and Columns: This option fetches the tables, views, and
    columns from the database.
  • Procedure, Functions & Views Source Code: This option fetches the
    procedures, functions, and views into OvalEdge.
  • Relationship: This option fetches the relationships between the tables into
    the OvalEdge.
  • Users, Roles, Policies & Permissions: This option fetches the users and
    roles defined at the source. Selecting this option enables the “Remote
    Access” tab.

Crawler Rules

A user can write a regular expression to find matching character sequences in the
source system.

Crawler Logic Rules

A user can define the crawler rules that will either include/exclude tables, views,
columns, procedures, and functions. The regex rule is defined by using logical
patterns that start with, end with, or have a middle character.

  • Starts with: Syntax: a[(.*)]
    The rule matches when the regex pattern matches the beginning of the
    rule with a[(.*)].
  • Ends with: Syntax : [(.*)]a
    The rule matches when the regex pattern matches the ending of the
    rule with [(.*)]a
  • Middle Characters: Syntax : [(.*)]a[(.*)]
    The rule matches when the regex pattern matches anywhere within the
    expression.

The logical pattern for each rule is as follows:

When setting up the regex rules, the user will be able to write rules that will either
include/exclude tables, views, columns, procedures, and functions that start with,
end with, or have middle characters as defined in the rule.

Regex

Syntax

Sample

Include Schema Regex

"a[(.*)]" 

If a user only wants schemas that start with "a" then the user needs to write "a[(.*)]" in the Include Schema Regex field. After that they need to run the crawl job, then the results for the schemas that start only with character "a” are displayed.

Exclude Schema Regex

"[(.*)]a"

If a user only wants schemas that end with "a" then the user needs to write "[(.*)]a" in the Exclude Schema Regex field. After that they need to run the crawl job, then the results for the tables and views that end with the character "a” are excluded.

Include Table Regex

"e[(.*)]"

If a user only wants a Table that starts with "e" then the user needs to write "e[(.*)]" in the Include Table Regex field. After that they need to run the crawl job, then the results for the tables and views that end only with the character "e” are displayed.

Exclude Table Regex

"[(.*)]e"

If a user only wants a Table that ends with "e," then the user needs to write "[(.*)]e" in the Exclude Table Regex field. After that, they need to run the crawl job, then the results for the tables and views that end only with the character "e” are excluded, and the remaining tables are displayed.

Include Column Regex

"a[(.*)]"

If a user only wants schemas that start with "a," then the user needs to write "a[(.*)]" in the Include Column Regex field. After that, they need to run the crawl job, then the results for the schemas that start only with character "a” are displayed.

Exclude Column Regex

"n[(.*)]"

If a user only wants a Column that starts with "n," then the user needs to write "n[(.*)]" in the Exclude Column field. After that they need to run the crawl job, then the results for the column that start only with the character "n” are excluded and the remaining column details are displayed. 

Include Procedure & Function Regex:

"s[(.*)]"

If a user only wants Procedure & Functions that start with "s," then the user needs to write "s[(.*)]" in the Include Procedure & Function Regex field. After that, they need to run the crawl job, then the results for the column that start only with the character "s” are displayed. 

Exclude Procedure & Function Regex:

"s[(.*)]" 

If a user only wants Procedure & Functions that start with "s," then the user needs to write "s[(.*)]" in the Exclude Procedure & Function Regex field. After that, they need to run the crawl job, then the results for the column that start with the character "s” are not displayed. 

Note: Apart from "[(.*)]" users can also user "%" as a syntax for Regex rule.

Profiler

To profile a data source, the process starts by selecting the desired data source
and configuring the profile settings to increase the efficiency and performance of
the database (while the profile job is running). There are many attributes that can
be specified in the profile settings.

Profiler Options

  • Table and Columns: By default, the checkbox for Tables and Columns is
    selected. It will profile and provide the statistics of the database.
  • Views and Columns: This selection will profile and provide the statistics of
    the views and columns.

Profiler Rules

Profiler rules will compute and collect the statistics of a data source to analyze
and explore the data assets. There are settings that can be configured to
increase the efficiency and performance of the database.

By default, when a data source is profiled, all the rows in a dataset are analyzed
to collect the statistics. When setting up the regex rules, the user can write
rules that either include/exclude tables and columns that start with, end with,
or have middle characters as defined in the rule.

  • Include Regex: Enter the specific table, view, and column names that start with, end with, or have central characters that are included for profiling.
  • Exclude Regex: Enter the specific table, view, and column names
    that start with, end with, or have central characters that are excluded
    for profiling.

Note: Profile Rules work similar to Crawler Rules.

Profiler Settings

Parameters Description
Order The order number refers to the sequence in which the profiling is performed.
Day Day indicates the day of the week profiling is set to run.
Start/End Time The start and end time that profiling is set to perform.
Number of Threads A thread is a process in which a query is executed on a database to do single or multiple tasks. The number of threads determines the number of parallel queries executed on the data source.
Profile Type

There are four types of data profiling.

  • Sample: The profiling is based on a given Sample Profile Size. The data on columns ( statistics such as Min, Max, Distinct, and Null Count) will be different compared to full profiles.
    To execute a sample profile, select the profile type as
    “Sample” and enter a sample profile size (count of
    records to be profiled).
  • Auto:
    • If the Row Constraint checkbox is selected
      (Set as True) and if the total Table Row
      Count (1000) is higher than the configured
      Row count Limit (100), then sample profiling
      is performed by considering the count
      mentioned in the Sample Profile Size.
    • If the Row Constraint checkbox is selected
      (Set as True) and if the Table Row Count
      (100) is less than the configured Row count
      Limit (1000), then all the rows of the table
      are profiled without considering the count
      mentioned in the Row count Limit.
    Note: A profile type set to “Auto” will always depend on
    the Row count limit when the row count constraint is set
    as “True.”
  • Query:
    • If the entered table row count is less than the Row count Limit, then the profiling is executed on the entire table.
    • If the input table row count exceeds the Row count Limit, then the profiling skips execution for those tables to avoid performance issues.
  • Disabled: This profile type prevents profiling on the selected data source.
Row Count Constraint The Row Count Constraint option only applies when the Profile Type is selected as Auto.
  • If the Row Constraint checkbox is selected (Set as True) and if the configured Row count Limit (100) is less than the total Table Row Count (1000), then the sample profiling is performed by considering the count mentioned in the Sample Profile Size.
If the Row Constraint checkbox is selected (Set as True) and if the configured Row count Limit (1000) is greater than the total Table Row Count (100), then the query is
executed without considering the count mentioned in the Row count Limit.
Row Count Limit Enter the maximum number of rows considered for profiling.
Sample Data Count Enter the total number of rows within the table data page in the Catalog.
Sample Profile Size Select this option for considering the total number of rows in profiling.
Query Timeout Enter the number of seconds for the query to run on the remote database before it times out.

Remote Access

This module allows OvalEdge to manage access & users at the source system.
There are two options available:

  1. Remote Data Access Management
  2. Data Access Authorization

Note: The Remote Policies feature (in the above image) allows OvalEdge to manage
masking and restriction policies at the source system. It is configured through the
Security module. 

Remote Data Access Management 

Remote data access management includes the following setting options

  1. None
  2. OvalEdge is Master
  3. Remote is Master

None: Remote data access management is disabled. No remote users are crawled into OvalEdge.

OvalEdge is Master

When OvalEdge is the Master, OvalEdge can be used to create, modify and delete the users at the source system.

Remote System is Master

When the remote system is the Master, changes to the users in the remote system are reflected in OvalEdge.

Data Access Authorization

The Data Access Authorization is included in the crawler-specific connector
settings to validate the user credentials and allow that particular user to access
the query sheet and queries in the data catalog.

OvalEdge follows OvalEdge data permissions

When this option is selected, the OvalEdge application uses a service account
for creating the connection in the Data Catalog’s Data tab, Data Catalog’s Code
tab, and Query Sheet.

OvalEdge follows Remote system permissions

In this case, users are prompted to enter the remote data source credentials on the
Data Catalog’s Code tab, Data Catalog’s Data tab, and Query Sheet. The same
credentials are used to establish the connection and fetch the data. These credentials are cached and cleared after 7 days of inactivity.

Query Policies

The Query Policies are included in the crawler-specific connector settings for the
connectors that support querying from Query Sheet. Here the system allows you,
under the specific role, to restrict the usage of the selected query type
(UNION/SUM/AVG, etc.) on the Query Sheet for fetching the required data.

To limit the type of queries that could be executed, complete the following steps:

  1. From the connector’s settings, navigate to the Query Policies tab. The data
    grid with existing Policy details with Role, Query Type, Access Type, and
    Delete columns is displayed.
  2. To include new query policies, click the +Add Query Policy button in the
    inline row with Role, Query Type, Access Type, and Cancel options displayed.
  3. Select the desired Role, Query Type, and Access Type from the respective
    dropdowns and click the Save Changes button.
  4. The selected Query Policy will be displayed in the Datagrid.

    Example: If you select the Role as “OE_ADMIN,” Query Type as “JOIN,” and
    the Access Type as “DENY,” then the users associated with the OE_ADMIN
    privileges are restricted to use the JOIN function on the Query Sheet page.
In the Query Sheet, if the user associated with the OE_ADMIN role tries to use
the JOIN function, then the error message “User does not have permission on
this <TableName> to perform this Function: Join” is displayed.

Access Instructions

This section allows you to add notes about the connector.

To add access instructions, follow the steps below.

  1. From the connector’s settings, navigate to the Access Instructions tab.
  2. Enter the instructions.
  3. Click the Save Changes button.

The notes added in the Access Instructions field are displayed on the data
object’s summary page in the data catalog as shown below.

Others

In this section, you can send notifications to the Data Owner and Data Steward in
case of any metadata changes by including specific user roles.

Copyright © 2019, OvalEdge LLC, Peachtree Corners GA USA