Delta Lake is an open-source data lake technology that provides reliable data
pipelines, transactional consistency, and data versioning capabilities to improve the
reliability and quality of big data and machine learning workloads.
OvalEdge uses JDBC driver to connect to the data source, allowing users to crawl
and profile the data objects, build lineage, and execute queries.
Prerequisites
The following are the prerequisites required for establishing the connection between Delta lake and OvalEdge.
The drivers used by the connector are given below:
Driver |
Version |
Details |
Delta JDBC |
2.6.17 |
Is JDBC 4.2 compliant, uses JDK 8.0 |
Configure Environment Names
This is a global configuration and may already have been established.
To create a distinction between environments for connections (e.g. prod, test, etc.), follow the steps below.
- Go to the Administration tab in the left panel menu and select Configuration
form the dropdown.
Select the Connector tab. - In the OVALEDGE_APP configuration type, enter the desired environment
values in the value column. - In the OVALEDGE_APP configuration type, enter the desired environment
values in the value column.
Driver Requirements
The following driver is required:
Driver | Version | Details |
Delta JDBC | 2.6.17 | It is JDBC 4.2 compliant. It uses JDK 8.0 https://databricks.com/spark/jdbc-drivers-download |
User Permissions
The table below summarizes the minimum access permissions required by the
service account to perform the operations.
Operation |
Minimum Access Permission |
---|---|
Connection Validation |
Usage, Select |
Crawling |
Usage, Select, Reference |
Profiling |
Usage, Select |
Query Execution |
Select |
Technical Specifications
The following is the list of objects and data types supported by the Delta Lake connector.
Crawler |
Profiler |
Lineage Building |
Query Execution |
Tables Views Columns Functions Procedures Synonyms Triggers Roles Users Permissions Usage statistics Tasks Pipes |
Table Profiling: Row count, Columns count, and View sample data View Profiling: Row count, Columns count, View sample data Column Profiling: Min, Max, Null count, distinct, top 50 values Full Profiling |
Table Lineage Column Lineage Lineage Sources: Stored procedures, functions triggers, views, SQL queries |
Select: Joins within database Aggregations Group By Order By |
Establish a Connection
To establish a Delta Lake connection, complete the following steps:
- With your OvalEdge Administration access, log in to the OvalEdge
application, navigate to the Administration module and click on Connectors. - Click on the + icon (New Connector ) to the right to open the Manage
Connector window. - Search and select the desired connector, and the Manage Connector pop-
up with the selected connector details is displayed. Complete the required
fields.
Fields
Details
Connection Type
Select the connection type as Delta Lake
By default, the selected connection type ‘Delta Lake’ is displayed. If required, the dropdown menu allows the user to change the connector type.
License Type
OvalEdge offers two types of licenses based on the customer's requirements. By default, the License type is displayed as ‘Auto Lineage.’ The user can select the license type as Standard or Auto Lineage.
(i) Standard: The standard license has crawler and profiler features and doesn’t have Auto Lineage functionality, where the users are not allowed to build lineage for the selected database and schema-level objects.
(ii) Auto Lineage: The Auto lineage license allows to build a lineage in addition to the Crawling and Profiling feature.
Connection Name
Select a connection name for Delta Lake. The name that you specify is a reference name to easily identify your Delta Lake database connection in OvalEdge.
Example: Delta Lake Connection DB1
Server/IP Address
Enter the database instance URL (on-premises/cloud-based)
Provide format
Example: adb-78xxxxxxxxxxxx091.11.yyyyyyyyyy.net/
Port
Port number 443 is the default port used by Delta Lake when connecting to a server over the internet.
Database
Provide the name of the database that is associated with the delta lake connection, if necessary.
Driver Name
A JDBC driver is a Java library file with the extension .jar that connects to a database. The driver details associated with the delta lake database will be auto-populated by default.
Example: com.yyyyyyyyyyy.spark.jdbc41.Driver
Path
HTTP Path
(Example: sql/protocolv1/o/7811813977693091/0717-094118-bathe927)
Helps in connecting with the legacy-specific cluster or with the SQL WarehouseUsername
User account login credential for Delta Lake Authentication.
Enter the Username to access the required database of Delta lake connectors.
Password
Password/Token (only for Delta Lake Authentication)
Enter the password to access the required database of delta lake connectors.
Connection String
Set the Connection string toggle button to automatic, to get the details automatically from the credentials provided. Alternatively, you can manually enter the string.
Format: jdbc:spark://adb-7xxxxxxxxxxx1.11.yyyyyyyyy.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/7811813977693091/0717-4118-bathe927;AuthMech=3;AllowSelfSignedCerts=1;SocketTimeout=0
- Click on the Validate button. The entered connection details will be validated, and the Save & Configure button is enabled.
- Click Save to save the connection. Alternatively, click Save & Configure to
save and configure additional connection settings.
Connection Validation Errors
The table below summarizes common connection validation errors.
Sl.No |
Error Message(s) |
Description |
1 |
Failed to establish a connection; please check the credentials |
Invalid credentials provided or user or role does not have access. |
2 |
Errors while downloading the File. |
403: Access denied [Provide appropriate access to user or role using in connection] 404: No such key [The object does not exist in the remote.] |
Crawl/Profile
After connectivity is established, additional configurations for crawling and profiling
can be specified for the selected data source. Selecting Crawl/Profile allows the
user to select the specific schemas that need to be crawled, profiled, or unprofiled.
Connector Settings
After the connection is established successfully, follow the steps below to navigate to
the connector’s settings.
- Select the connector using the selection box.
- Click on the nine dots icon on the top right side of the page.
- Select settings to configure additional settings for the connector.
The below sections describe each of the settings tabs.
Crawler
When an OvalEdge crawler connects to a data source, it collects and catalogs all the
data elements (i.e., metadata) and stores them in the OvalEdge database. The admin
user may select several options to crawl the data source system.
Crawler Options
The Tables, Views & Columns option is selected by default.
- Tables, Views, and Columns: This option fetches the tables, views, and
columns from the database. - Procedure, Functions & Views Source Code: This option fetches the
procedures, functions, and views into OvalEdge. - Relationship: This option fetches the relationships between the tables into
the OvalEdge. - Users, Roles, Policies & Permissions: This option fetches the users and
roles defined at the source. Selecting this option enables the “Remote
Access” tab.
Crawler Rules
A user can write a regular expression to find matching character sequences in the
source system.
Crawler Logic Rules
A user can define the crawler rules that will either include/exclude tables, views,
columns, procedures, and functions. The regex rule is defined by using logical
patterns that start with, end with, or have a middle character.
- Starts with: Syntax: a[(.*)]
The rule matches when the regex pattern matches the beginning of the
rule with a[(.*)]. - Ends with: Syntax : [(.*)]a
The rule matches when the regex pattern matches the ending of the
rule with [(.*)]a - Middle Characters: Syntax : [(.*)]a[(.*)]
The rule matches when the regex pattern matches anywhere within the
expression.
The logical pattern for each rule is as follows:
When setting up the regex rules, the user will be able to write rules that will either
include/exclude tables, views, columns, procedures, and functions that start with,
end with, or have middle characters as defined in the rule.
Regex |
Syntax |
Sample |
---|---|---|
Include Schema Regex |
"a[(.*)]" |
If a user only wants schemas that start with "a" then the user needs to write "a[(.*)]" in the Include Schema Regex field. After that they need to run the crawl job, then the results for the schemas that start only with character "a” are displayed. |
Exclude Schema Regex |
"[(.*)]a" |
If a user only wants schemas that end with "a" then the user needs to write "[(.*)]a" in the Exclude Schema Regex field. After that they need to run the crawl job, then the results for the tables and views that end with the character "a” are excluded. |
Include Table Regex |
"e[(.*)]" |
If a user only wants a Table that starts with "e" then the user needs to write "e[(.*)]" in the Include Table Regex field. After that they need to run the crawl job, then the results for the tables and views that end only with the character "e” are displayed. |
Exclude Table Regex |
"[(.*)]e" |
If a user only wants a Table that ends with "e," then the user needs to write "[(.*)]e" in the Exclude Table Regex field. After that, they need to run the crawl job, then the results for the tables and views that end only with the character "e” are excluded, and the remaining tables are displayed. |
Include Column Regex |
"a[(.*)]" |
If a user only wants schemas that start with "a," then the user needs to write "a[(.*)]" in the Include Column Regex field. After that, they need to run the crawl job, then the results for the schemas that start only with character "a” are displayed. |
Exclude Column Regex |
"n[(.*)]" |
If a user only wants a Column that starts with "n," then the user needs to write "n[(.*)]" in the Exclude Column field. After that they need to run the crawl job, then the results for the column that start only with the character "n” are excluded and the remaining column details are displayed. |
Include Procedure & Function Regex: |
"s[(.*)]" |
If a user only wants Procedure & Functions that start with "s," then the user needs to write "s[(.*)]" in the Include Procedure & Function Regex field. After that, they need to run the crawl job, then the results for the column that start only with the character "s” are displayed. |
Exclude Procedure & Function Regex: |
"s[(.*)]" |
If a user only wants Procedure & Functions that start with "s," then the user needs to write "s[(.*)]" in the Exclude Procedure & Function Regex field. After that, they need to run the crawl job, then the results for the column that start with the character "s” are not displayed. |
Note: Apart from "[(.*)]" users can also user "%" as a syntax for Regex rule.
Profiler
To profile a data source, the process starts by selecting the desired data source
and configuring the profile settings to increase the efficiency and performance of
the database (while the profile job is running). There are many attributes that can
be specified in the profile settings.
Profiler Options
- Table and Columns: By default, the checkbox for Tables and Columns is
selected. It will profile and provide the statistics of the database. - Views and Columns: This selection will profile and provide the statistics of
the views and columns.
Profiler Rules
Profiler rules will compute and collect the statistics of a data source to analyze
and explore the data assets. There are settings that can be configured to
increase the efficiency and performance of the database.
By default, when a data source is profiled, all the rows in a dataset are analyzed
to collect the statistics. When setting up the regex rules, the user can write
rules that either include/exclude tables and columns that start with, end with,
or have middle characters as defined in the rule.
- Include Regex: Enter the specific table, view, and column names that start with, end with, or have central characters that are included for profiling.
- Exclude Regex: Enter the specific table, view, and column names
that start with, end with, or have central characters that are excluded
for profiling.
Note: Profile Rules work similar to Crawler Rules.
Profiler Settings
Parameters | Description |
Order | The order number refers to the sequence in which the profiling is performed. |
Day | Day indicates the day of the week profiling is set to run. |
Start/End Time | The start and end time that profiling is set to perform. |
Number of Threads | A thread is a process in which a query is executed on a database to do single or multiple tasks. The number of threads determines the number of parallel queries executed on the data source. |
Profile Type |
There are four types of data profiling.
|
Row Count Constraint | The Row Count Constraint option only applies when the Profile Type is selected as Auto.
executed without considering the count mentioned in the Row count Limit. |
Row Count Limit | Enter the maximum number of rows considered for profiling. |
Sample Data Count | Enter the total number of rows within the table data page in the Catalog. |
Sample Profile Size | Select this option for considering the total number of rows in profiling. |
Query Timeout | Enter the number of seconds for the query to run on the remote database before it times out. |
Remote Access
This module allows OvalEdge to manage access & users at the source system.
There are two options available:
- Remote Data Access Management
- Data Access Authorization
Note: The Remote Policies feature (in the above image) allows OvalEdge to manage
masking and restriction policies at the source system. It is configured through the
Security module.
Remote Data Access Management
Remote data access management includes the following setting options
- None
- OvalEdge is Master
- Remote is Master
None: Remote data access management is disabled. No remote users are crawled into OvalEdge.
OvalEdge is Master
When OvalEdge is the Master, OvalEdge can be used to create, modify and delete the users at the source system.
Remote System is Master
When the remote system is the Master, changes to the users in the remote system are reflected in OvalEdge.
Data Access Authorization
The Data Access Authorization is included in the crawler-specific connector
settings to validate the user credentials and allow that particular user to access
the query sheet and queries in the data catalog.
OvalEdge follows OvalEdge data permissions
When this option is selected, the OvalEdge application uses a service account
for creating the connection in the Data Catalog’s Data tab, Data Catalog’s Code
tab, and Query Sheet.
OvalEdge follows Remote system permissions
In this case, users are prompted to enter the remote data source credentials on the
Data Catalog’s Code tab, Data Catalog’s Data tab, and Query Sheet. The same
credentials are used to establish the connection and fetch the data. These credentials are cached and cleared after 7 days of inactivity.
Query Policies
The Query Policies are included in the crawler-specific connector settings for the
connectors that support querying from Query Sheet. Here the system allows you,
under the specific role, to restrict the usage of the selected query type
(UNION/SUM/AVG, etc.) on the Query Sheet for fetching the required data.
To limit the type of queries that could be executed, complete the following steps:
- From the connector’s settings, navigate to the Query Policies tab. The data
grid with existing Policy details with Role, Query Type, Access Type, and
Delete columns is displayed. - To include new query policies, click the +Add Query Policy button in the
inline row with Role, Query Type, Access Type, and Cancel options displayed. - Select the desired Role, Query Type, and Access Type from the respective
dropdowns and click the Save Changes button. - The selected Query Policy will be displayed in the Datagrid.
Example: If you select the Role as “OE_ADMIN,” Query Type as “JOIN,” and
the Access Type as “DENY,” then the users associated with the OE_ADMIN
privileges are restricted to use the JOIN function on the Query Sheet page.
the JOIN function, then the error message “User does not have permission on
this <TableName> to perform this Function: Join” is displayed.
Access Instructions
This section allows you to add notes about the connector.
To add access instructions, follow the steps below.
- From the connector’s settings, navigate to the Access Instructions tab.
- Enter the instructions.
- Click the Save Changes button.
The notes added in the Access Instructions field are displayed on the data
object’s summary page in the data catalog as shown below.
Others
In this section, you can send notifications to the Data Owner and Data Steward in
case of any metadata changes by including specific user roles.
Copyright © 2019, OvalEdge LLC, Peachtree Corners GA USA