RDBMS

Data Pipeline

The Data Pipeline serves as a standard agreement between data producers and consumers regarding data representation format. In this context, several client services send event messages to Kafka by adhering to Data Pipeline schemas, which are structured in JSON format. 

Data Lake data is organized into numerous buckets and sub-buckets. These data subdivisions are grouped based on topics, named according to the current Kafka pipeline. Each topic can have varying numbers of associated schemas. Once the schemas are specified, files are further categorized based on their date and time of entry. 

To enhance data source visibility, OvalEdge has introduced a data pipeline connector. This connector retrieves schemas from GitHub, fetches tables and columns from JFrog, and performs data profiling using Amazon S3.

Connector Capabilities

The following is the list of objects and data types the Data Pipeline connector supports.

Functionality

Supported Data Objects

Crawler

Schemas

Tables

Columns

Profiler

Table Profiling: Row count, Column count, and View sample data

Column Profiling: Min, Max, Null count, distinct, top 50 values

Full Profiling

Prerequisites

GitHub Connector

In order to retrieve the schemas in building the Data Pipeline connector, it's essential to establish the GitHub connector and capture the GitHub connection ID. This ID will be utilized in the setup of the Data Pipeline connection.

For more information on establishing the GitHub connection, refer to the GitHub connector article.

Note: Read permission to be given for the repos located in the target links.

JFrog Repository Details

In order to retrieve the tables and columns in building the Data Pipeline connector, it's essential to enter the JFrog repository details such as JFrog Host Name, JFrog Repo Name, JFrog API Key, and JFrog Files Prefix.

For more information on JFrog repository details, refer to the Additional Information section.

S3 Connector

To profile the schemas, tables, and table columns within the OvalEdge application during the creation of the Data Pipeline connector, it's crucial to set up the S3 connector.  After this, you can capture the S3 Connection ID and S3 Bucket Name. These details will be utilized in the setup of the Data Pipeline connection.

For more information on establishing the S3 connection, refer to the S3 connector article.

Service Account Permissions

The following are the minimum privileges required for a service account to crawl the data available in the JFrog repository.

Operation

Minimum Access Permission

JFrog API Key

The API key will be generated based on username and password. The username must have read permission for the respective repository.

JFrog Repo

Read permission

Establish Environment Variables

This section describes the settings or instructions that you should be aware of prior to establishing a connection. If your environments have been configured, skip this step.

Configure Environment Names

The Environment Names allow you to select the environment configured for the specific connector from the drop-down list in the Add Connector pop-up window.

You might want to consider crawling the same schema in both stage and production environments for consistency. The typical environments for crawling are PROD, STG, or Temporary, and may also include QA or other environments. Additionally, crawling a temporary environment can be useful for schema comparisons, which can later be deleted, especially during application upgrade assistance. 

Steps to Configure the Environment 

  1. Log into the OvalEdge application.
  2. Navigate to AdministrationSystem Settings.
  3. Select the Connector tab.
  4. Find the key name “connector.environment”.
  5. Enter the desired environment values (PROD, STG) in the Value column.
  6. Click ✔ to Save.

Establish a Connection

To connect to Data Pipeline using the OvalEdge application, complete the following steps:

  1. Log in to the OvalEdge application.
  2. Navigate to Administration > Connectors module.
  3. Click on the "+” (New Connector) button enabled at the top right of the page.
  4. The Add Connector pop-up window is displayed, where you can search for the Data Pipeline connector.

  5. The Add Connector with Connector Type specific details pop-up window is displayed. Enter the relevant information to configure the Data Pipeline connection.
    Note: The asterisk (*) denotes mandatory fields required for establishing a connection.

    Field Name

    Description

    Connector Type

    By default, the selected connector type ‘Data Pipeline’ is displayed.

    Connector Settings

    License Add-Ons

    All the connectors will have a Base Connector License by default that allows you to crawl and profile to obtain the metadata and statistical information from a datasource. 

    OvalEdge supports various License Add-Ons based on the connector’s functionality requirements.

    • Select the Auto Lineage Add-On license that enables the automatic construction of the Lineage of data objects for a connector with the Lineage feature. 
    • Select the Data Quality Add-On license to identify, report, and resolve the data quality issues for a connector whose data supports data quality, using DQ Rules/functions, Anomaly detection, Reports, and more.
    • Select the Data Access Add-On license that will enforce connector access via OvalEdge with Remote Data Access Management (RDAM) feature enabled.

    Connector Name*

    Select a connector name for the Data Pipeline database. The name you specify is a reference for your Data Pipeline database connection in OvalEdge.

    Example: DataPipeline_Connection

    Connector Environment

    The Connector Environment drop-down list allows you to select the environment configured for the connector from the drop-down list. 

    For example, PROD, or STG (based on the configured items in the OvalEdge configuration for the connector.environment).
    The purpose of the environment field is to help you identify which connector is connecting what type of system environment (Production, STG, or QA). 

    Note: The steps to set up environment variables are explained in the prerequisites section. 

    GitHub ConnectionID*

    The connection id of the GitHub connector.

    Example: 1001

    GitHub Config Path*

    Specify the URL of the GitHub repository where all the target YAML files are located. 

    Example: https://github.com/ovaledgeindia/oetest1/blob/main/file1.yaml


    To know more, refer Additional Information section

    JFrog Host Name*

    JFrog host indicates a server or instance where JFrog's Artifactory repository manager is installed and hosted. This server hosts the repositories where you store your software artifacts.

    Example: https://jfrog.ovaledge.net


    To know more, refer Additional Information section

    JFrog Repo Name*

    The name of the JFrog repository created using JFrog Artifactory and that is user-defined.

    Example: oe-test-repo


    To know more, refer Additional Information section

    JFrog API Key*

    It allows authentication for REST API calls using API key.

    To generate an API key: https://jfrog.com/help/r/jfrog-platform-administration-documentation/jfrog-api-key-deprecation-process


    Note: API key will be generated based on username and password. The username must have read permission for the respective repository.

    JFrog Files Prefix*

    Prefixes of the ZIP files located in the repository of JFrog.

    Example: *-ovaledge-json-schema, *-dp-json-schema


    To know more, refer Additional Information section

    S3 ConnectionID*

    The Id of the S3 connector.

    Example: 1000

    S3 Bucket Name*

    Bucket name where the parquet files are hosted in AWS S3.

    Example: OE-Pipeline

    Default Governance Roles

    Default Governance Roles*

    You can select a specific user or a  team from the governance roles (Steward, Custodian, Owner) that get assigned for managing the data asset. 

    Note: The dropdown list displays all the configurable roles (single user or a team) as per the configurations made in the OvalEdge Security | Governance Roles section.  

    Admin Roles

    Admin Roles*

    Select the required admin roles for this connector.

    • To add Integration Admin Roles, search for or select one or more roles from the Integration Admin options, and then click on the Apply button.
      The responsibility of the Integration Admin includes configuring crawling and profiling settings for the connector, as well as deleting connectors, schemas, or data objects.
    • To add Security and Governance Admin roles, search for or select one or more roles from the list, and then click on the Apply button.
      The security and Governance Admin is responsible for:
      • Configure role permissions for the connector and its associated data objects.
      • Add admins to set permissions for roles on the connector and its associated data objects.
      • Update governance roles.
      • Create custom fields.
      • Develop Service Request templates for the connector.
      • Create Approval workflows for the templates.

    No Of Archive Objects*

    The number of archive objects indicates the number of recent metadata modifications made to a dataset at a remote/source location. By default, the archive objects feature is deactivated. However, users may enable it by clicking the Archive toggle button and specifying the number of objects they wish to archive. 


  6. After filling in all the connection details, select the appropriate button based on your preferences. 
    1. Validate: Click on the Validate button to verify the connection details. This ensures that the provided information is accurate and enables successful connection establishment.
    2. Save: Click on the Save button to store the connection details. Once saved, the connection will be added to the Connectors home page for easy access.
    3. Save & Configure: For certain Connectors that require additional configuration settings, click on the Save & Configure button. This will open the Connection Settings pop-up window, allowing you to configure the necessary settings before saving the connection.
Once the connection is validated and saved, it will be displayed on the Connectors home page.

Connection Validation Errors

S.No.

Error Message(s)

Description

            1

error_validate_connection

An alert message is displayed when provided details are incorrect.

Note: If you have any issues creating a connection, please contact your assigned GCS team.

Connector Settings

Once the connection is established successfully, various settings are provided to fetch and analyze the information from the data source.

Connection Settings

Description

Crawler 

Crawler settings are configured to connect to a data source and collect and catalog all the data elements in the form of metadata. Check out the crawler options to set the crawler's behavior in the  Crawler & Profiler Settings.

Profiler

The process of gathering statistics and informative summaries about the connected data source(s). Statistics can help assess the data source's quality before using it in an analysis. Profiling is always optional; crawling can be run without profiling also. For more information, refer to Crawler & Profiler Settings.

Business Glossary Settings

The Business Glossary setting provides flexibility and control over how they view and manage term association within the context of a business glossary at the connector level.

Access Instructions

Access Instruction allows the data owner to instruct others on using the objects in the application. 

Note: For more information, refer to the Connector Settings.

The Crawling of Schema(s)

You can use the Crawl/Profile option, which allows you to select the specific schemas for the following operations: crawl, profile, crawl & profile, or profile unprofiled. For any scheduled crawlers and profilers, the defined run date and time are displayed to set. 

  1. Navigate to the Connectors page, and click on the Crawl/Profile option.
  2. Select the required Schema(s).
  3. Click on the Run button that gathers all metadata from the connected source into the OvalEdge Data Catalog.

Note: For more information on Scheduling, refer to Scheduling Connector

Additional Information

Below are the reference screenshots that are helpful in understanding the various fields of a Data Pipeline connector.

  • GitHub Config Path
    Specify the URL of the GitHub repository where all the target YAML files are located.

  • JFrog Host Name
    JFrog host indicates a server or instance where JFrog's Artifactory repository manager is installed and hosted. This server hosts the repositories where you store your software artifacts.

  • JFrog Repo Name
    The name of the JFrog repository created using JFrog Artifactory and that is user-defined.

  • JFrog Files Prefix
    Prefixes of the ZIP files located in the repository of JFrog.