Data Warehouse

Amazon Redshift Connector

The Amazon Redshift connector is used to pull the metadata existing in the Amazon Redshift connector database and helps the users to crawl the metadata and profile the sample data and build lineage to view the movement of the crawled data displaying the relationship between the objects and its profile statistics.

Redshift

Data Sources

OvalEdge crawler integrates with various data sources to help the users to extract metadata and build a data catalog. In this document, you can see how to make a connection to your Amazon Redshift and crawl the Views, Stored Procedures, and Functions.

Crawling

Crawling is a process of collecting information about data from various data sources like on-premise and cloud databases, Hadoop, visualization software, and file systems. When an OvalEdge crawler connects to a data source, it collects and catalogs all the data elements (i.e., metadata) and stores it in OvalEdge.  OvalEdge crawlers can be scheduled to scan the databases regularly.

Connect to the Data

Before you can crawl and build a data catalog, you must first connect to your data. OvalEdge requires users to configure a separate connection for each data source type. The users must enter the source credentials for each type of connectivity. Once a data connection is made, a simple click of the Crawl button starts the crawling process.       

Connector Capabilities  

Technical Specification 

A technical specification for the Amazon Redshift connector contains information about the Crawler, Profiler, and Query Execution and also information about supported objects, supported data types, and user permissions.

Crawling

Feature

Supported Objects

Supported Data Types

Crawling

Tables

-

Table Columns

Bigint, Boolean, Char, Date, Datetime, Datetime2, Time Stamp, Decimal (P, S), Float, Int, RealSmallint, Time, Varchar

Views, Stored Procedures, Functions

-

Profiling

Feature

Supported Objects

Remarks

Table Profiling

Row Count, Columns Count, View Sample data

-

View Profiling

Row Count, Columns Count, View sample data

The view is treated as a table for profiling purposes.

Column Profiling

Min, Max, Null Count, Distinct, Top 50 values

-

Full Profiling

Supported

-

Sample Profiling

Supported

-

Lineage Building

Lineage entities

Details

Table Lineage

Supported

Column Lineage

Supported

Lineage Sources

Stored Procedures, functions, triggers, views, SQL queries (from Query Sheet)

Querying 

Operation 

Details

Select

Supported

Insert

Not supported by default.

Update

Not supported by default.

Delete

Not supported by default.

Joins within database

Supported

Joins outside database

Not supported

Aggregations

Supported

Group By

Supported

Order By

Supported

By default, the service account provided for the connector will be used for any query operations. If the service account has to write privileges, then Insert / Update / Delete queries can be executed.

Connection Details

To use the Amazon Redshift connector, the details specified in the following section should be available.

Pre-requisites

  • By default, the service account provided for the connector will be used for any query operations. If the service account has write privileges, then Insert/Update/Delete queries can be executed.
  • The minimum privileges required are

Operation 

Minimum Access Permission For Service Account

Connection Validation

Select and Usage

Crawling

Select, Usage , Reference, and Execution 

Profiling 

Select 

Query execution

Select 

Add a connection

To connect to the Amazon Redshift connector using the OvalEdge application, complete the following steps.

  1. Log in to the OvalEdge application.
  2. Navigate to Administration > Connectors module.
  3. To add a new connection, click on the +AddNewConnector icon. A manage connection pop-up is displayed to select a connector. Search for, or select an Amazon Redshift connector. The Manage Connection form is displayed, where information for setting up a connection is provided. 

Field Name

Description

Connection Type

Select Amazon Redshift connector.

By default, the selected connection type is displayed as Amazon Redshift. If required, the connection type can be changed, and depending on the connector selected; fields are displayed accordingly.

License Type

By default, License type is Auto Lineage.


License Type: 

In a license type, the permissions are specified based on the requirements of the customer. Users have the option to select the license type as Standard or Auto Lineage.  The connector license is categorized into the:

(i) Standard: The standard connectors may not have Auto Lineage functionality. It will not build the lineage for the selected database.


(ii) Auto Lineage: Additionally, auto lineage connectors have Auto Lineage functionality. It will build the lineage for the selected database.


See, License Types for more information. 

Connection Name

Enter the name of the connection, the connection name specified in the Connection Name textbox will be a reference to the Amazon Redshift connection in the OvalEdge application.

Environment

By default, there is no selection of environment from the dropdown list when there is no environment is configured for Redshift. 


Environment: The purpose of the environment field is to help users to understand that the new connector is established in an environment available at the development, production, and QA.

Server

Enter the fully qualified server name or IP address related to the Redshift database instance URL (on-premises/cloud-based)

Example: 34.54.23.43

Port

By default, port number 5439, related to the Redshift database, is displayed. If needed, a new port number can be provided.

Database

Enter the name of the database.


It is a user-specified database name to crawl the metadata of selected connectors.

Driver

By default, the driver details associated with the connector database are displayed, and it is non-editable.

Username

By default, the username of the OvalEdge application is displayed. If required, users can manually enter the user's credential of the OvalEdge application. 

Password

By default, the OvalEdge application password is displayed.  If required, users can manually enter the user's credential of the OvalEdge application. 

Connection String

A connection string is used to configure the Amazon Redshift connection object. Amazon Redshift connections are specified by key-value pairs in a connection string.


The connection string has auto and manual modes. 

By default, it is set to Auto mode, where connection string details are fetched automatically and displayed in the connection string textbox. Alternatively, the user can manually enter the string by setting the toggle button to manual mode. 


Example: jdbc:amazon redshift://{server}:5439/{sid}

Plugin Server

Provide the server name if you are running this as a plugin.

Plugin Port

Provide the port number on which the plugin is running.

Governance Roles

The admin will select a specific user or a team from the governance roles (Steward, Custodian, Owner) that get assigned to the data asset. 

The dropdown list displays all the configured users (single user or a team) as per the configurations made in the Security > Governance Roles page.  

No. of Archived Objects

By default, the number of archive objects is set to disable mode. Click on the Archive toggle button and enter the number of objects you wish to archive.

No. of archive objects:  It is the count of the number of last modifications made in the metadata data of a Remote/source. 

For example, if you update the count as 4 in the ‘No. of archive object’ field, and then the connection is crawled. It will provide the last 4 changes that occurred in the remote/source of the connector. You can observe these changes in the ‘version’ column of the ‘Metadata Changes’ module. 

Select Bridge

Bridge: To collect data from both on-premises and cloud sources, data administrators need to configure Cloud Bridge by specifying a data source connection and unique parameters.

Note: If Redshift Connector is configured with a bridge, the Bridge ID will be shown in the Bridge dropdown menu when bridges are configured and added, or it will be displayed as "NO BRIDGE". 

 

Manage Connector

Error Validation Details 

Error Message(s)

Description

Connection Timeout

The server is not Up and Running. 

BadSQLgrammar

A SQL query may fail with a bad SQL grammar exception

The table or view does not exist

Do not have permission or any syntax error in the query.

Connection Settings 

Once the connection is established successfully, there are various settings provided to fetch and analyze the information from the data source. The connection setting includes Crawler, Profiler, Remote Access, Access Instruction, Lineage, and Others. Based on the connector selected the option on the setting page will differ. 

To configure the connector settings, select the Redshift Connection Name from the Crawler Information page and click on the Nine dots option and select the Settings options. The Connection Settings pop-up window is displayed. 

Crawler Settings

The crawler setting will help to get the information of the data source based on the crawler setting provided. Crawler setting has various settings options, including Crawler options, Crawler Rules, Query Log Setting.

Crawler Settings

Details

Tables, Views, and Columns

Select the checkbox to crawl the tables, views, and columns existing in the Oracle Service Cloud database into the OvalEdge.

Note: By default, the checkbox for Tables, Views, and Columns is selected.

Procedure Functions & Relationship

Select the checkbox to determine relationships between the tables for the selected schemas existing in the Amazon Redshift Connector database into the OvalEdge.

Crawler Rules: Include Regex

Enter the specific schema, table, views, and column names that start with, end with, or have middle characters that are included for crawling.

Crawler Rules: Exclude Regex

Enter the specific schema, table, views, and column names that start with, end with, or have middle characters that are excluded for crawling.

Profiler Settings

Profiler Settings

Details

Tables and Columns

Select the checkbox to profile the tables and columns that exist in the Oracle Service Cloud database in the OvalEdge.

Note: By default, the checkbox for Tables and Columns is selected.

Views and Columns

Select the checkbox to profile the views and columns that exist in the Oracle Service Cloud database in the OvalEdge.

Note: By default, the checkbox for Views and Columns is selected.

Profile Rules: Include Regex

Enter the specific table, view, and column names that start with, end with or have middle characters that are included for profiling.

Profile Rules: Exclude Regex

Enter the specific table, view, and column names that start with, end with or have middle characters that are excluded for profiling.

To configure the Profile Setting,

Click on the Edit icon that allows the Admin user to configure the profiler setting for the selected data source. There are many attributes you can specify in the profile settings. 

The attributes are as follows,

Columns

Description

Order

The order number is the sequence in which the profiling is done. 

Day

The day of the week on which profiling is set to run.

Start/End Time

The start and end time at which profiling is set to perform.

Number of Threads

Thread is a process where a query is executed on a database to do single or multiple tasks. The number of threads determines the number of parallel queries executed on the data source. 

Profile Type

There are four main types of data profiling. 

  • Sample 

The profiling is performed based on a given Sample Profile Size. The data on columns ( statistics such as Min, Max, Distinct, and Null Count) will be different compared to full profiles as it is calculated only on sample size. 
To execute a sample profile, select the profile type as “Sample” and enter a sample profile size(count of records to be profiled).

  • Auto 
    • If the Row Constraint checkbox is selected (Set as True) and if the total Table Row Count (1000)  is greater than configured Rowcount Limit (100), then the sample profiling is performed by considering the count mentioned in the Sample Profile Size.
    • If the Row Constraint checkbox is selected (Set as True) and if the  Table Row Count (100) is less than configured Rowcount Limit (1000), then all the rows of the table will be profiled without considering the count mentioned in the Rowcount Limit.

Note:  A profile type set to “Auto” will always depend on the Rowcount limit when the row count constraint must be set as “True.” 

  • Query:
    • If the entered table row count is less than the Rowcount Limit, then the profiling is executed on the entire table. 
    • If the input table row count exceeds the Rowcount Limit, then the profiling skips execution for those tables to avoid performance issues.
  • Disabled: 
    Disabled profile type prevents profiling on the selected data source

Row Count Constraint

The Row Count Constraint option is applicable only when the Profile Type is selected as Auto.

  • If the Row Constraint checkbox is selected (Set as True) and if the configured Rowcount Limit (100) is less than the total Table Row Count (1000) then the sample profiling is performed by considering the count mentioned in the Sample Profile Size.

If the Row Constraint checkbox is selected (Set as True) and if the configured Rowcount Limit (1000) is greater than the total Table Row Count (100) then the query is executed without considering the count mentioned in the Rowcount Limit. 

Row Count Limit

Enter the maximum number of rows that should be considered for profiling.

Sample Profile Size

Select this option for considering the total number of rows in profiling.

Query Timeout

Enter the number of seconds for the query to run on the remote database before it times out.

Crawler/Profiler

Once the setting for profiling has been configured, go back to the Crawler screen and click “Crawl/Profile” to begin the profiling. 

Remote Access

This Remote Access tab lists the data objects and the meta and data permissions on these objects that a user is assigned access to in a remote application.

Remote Data Access Management

Remote data access management has three ways to connect a remote database

  • None: When you crawl any FileFolders/Buckets, all the users and roles from the remote source will come into the Remote Users tab and Remote Roles tab in the Administration > Users & Roles.
  • Remote System is a master: In the Remote Access tab, the user selects an option of a Remote system is the master, and when you crawl a remote connection, all the users and roles available in the remote source about that FileFolders/Buckets connection are displayed in the OvalEdge (Administration - > Users & Roles). 
  • At the time of crawling, the user permission available on that FileFolders/Buckets will also be reflected in the Users & Roles | Remote users and Remote roles tab. You will be able to log in with that user's default password; then, you can change it on the first login.
  • When this option is selected, the admin users cannot create, update or delete the users or roles will also be reflected in the Security, FileFolders/Buckets tab.
  • OvalEdge is a master: When OvalEdge is the master, users can assign Roles and User-based permissions to Objects. For that admin, users can use the existing Users and Roles, or it can create new Users and Roles and then assign them.
  • At the time of Crawling, users and roles assigned to the FileFolders/Buckets are displayed.
  • When this option is selected, the admin users can create, update or delete the users or roles. This will get reflected or added in remote sources as well. It also considers the roles permissions and FileFolders/Buckets permissions. Security FileFolders/Buckets level permission can be updated from OvalEdge.

Note: Remote is master, or OE is master in the Remote Access will not work unless Users, Roles, Policies & Permissions are not checked.

Data Access Authorization (Data Permissions)

The Data Access Authorization is about the data permissions is to ensure that the right user is accessing the query sheet and queries in the data catalog. Here, the system validates the user credentials and allows that user to access the query sheet and queries in the data catalog.

  • OvalEdge follows OvalEdge data permissions: When this option is selected, the OvalEdge application uses a service account used to create the connection in the 
    • Query Sheet page
    • Data Catalog > Table > Data page and 
    • Data Catalog > Code page to fetch the required data
  • OvalEdge follows Remote system permissions: When this option is selected, then the users are prompted to enter the remote data source credentials in the 
    • Query Sheet page, 
    • Data Catalog > Table > Data page and 
    • Data Catalog > Query page 

The submitted credentials are used to establish the connection to fetch the required data. Here, these credentials will be cached and cleared after seven days of inactivity.

Note: Remote is master, or OE is master in the Remote Access will not work unless Users, Roles, Policies & Permissions are not checked.

Remote Policy

Sync OvalEdge policy with Remote:  You can select the check box to assign the OvalEdge policy with the remote. When selected, this option enables various predefined OvalEdge policy schemes to be applied on the remote connection. 

Query Policy

The Query Policies in the Crawler setting are used to provide the right access to the Query sheet functions (Join, Union, SUM, or aggregate functions). You can specify the desired roles and permission to deny the usage of the query sheet function. A role that has been denied policy permission will not have access to those functions in the Query Sheet.

  1. Click on the Query Policies tab, and the data grid with existing Policies details with Role, Query Type, Access Type, and Delete columns are displayed. 
  2. To include new query policies, click on the +Add Query Policy button.  
  3. Select the desired Role, Query Type, and Access Type from the respective dropdowns and click on the Save Changes button

Example: If the user selects the Role as “OE_HRADMIN,” Query Type as “JOIN” and the Access Type as “DENY,” then the users associated with the OE_HRADMIN privileges are restricted from using the JOIN function in the Query Sheet page.

Access Instruction

It allows the admin user to write the instructions and guide the business user to crawl the data source. Instructions are nothing, but some related information about the data source connection can be provided here. it could have the information resources like links, Images, or Videos that help business users of that particular data source connection. 

For example: When an admin user saves the access instructions and crawls the schema, the saved instruction will appear in the data catalog after clicking on the schema. It's just providing info or guidelines on a specific schema.

  • You can provide the instruction in Crawler > Setting page 
  • Click on the Access Instruction tab
  • Enter the instructions 

Click the Save Changes button. Once you add the access instruction for a specific connection in the crawler settings, it will appear in the connection hierarchy like a database.

Other

The Send Metadata Changes Notifications option is provided in the Others tab of the settings page. The purpose is to send notifications on any metadata changes to data owners or stewards. 

 Send Metadata Notification to:

  1. Select whether the notifications for the Data Owner and Data Steward under the specific Roles need to be sent or not.
  2. Select the desired role from the Roles dropdown menu and click on the Save Changes button. The notifications will be successfully sent to select Data Owner and Data Steward.