File

HDFS

HDFS (Hadoop Distributed File System) is the underlying storage layer where the metadata and data about various datasets are stored. The metadata is often stored in HDFS files.
OvalEdge uses the Apache Hadoop common library to establish a connection with the HDFS data source for metadata crawling and profiling.

Connector Capabilities

The connector capabilities are shown below:

Crawling

Features

Supported Objects

Remarks

Crawling

Buckets

While crawling root File Folders / Files  will be cataloged by default

Please see this article Crawling Data for more details on crawling.

Profiling

Please see Profiling Data for more details on profiling.

Feature

Support

Remarks

File Profiling

Row count, Columns count, View sample data

Supported File Types: CSV, XLS, XLSX, JSON, AVRO, PARQUET, ORC

Sample Profiling

Supported

 

Pre-requisites

To use the connector, the following need to be available:

  • Connection details as specified in the following section should be available.
  • Service account, for crawling and profiling. The minimum privileges required are:
    • Connection validate
    • Crawl FileFolders
    • Catalog files/folders
    • Profile files/folders

Note: All the above privileges must have Read access permission.

Drivers

The drivers used by the connector are given below:

Configuring Environment Variables

Configuring environment names enables you to select the appropriate environment from the drop-down list when adding a connector. This allows for consistent crawling of schemas across different environments, such as production (PROD), staging (STG), or temporary environments. It also facilitates schema comparisons and assists in application upgrades by providing a temporary environment that can later be deleted.

Before establishing a connection, it is important to configure the environment names for the specific connector. If your environments have been configured, skip this step.

Steps to Configure the Environment

  1. Log into the OvalEdge application.
  2. Navigate to AdministrationSystem Settings.
  3. Select the Connector tab.
  4. Find the key name “connector.environment”.
  5. Enter the desired environment values (PROD, STG) in the Value column.
  6. Click ✔ to Save.

Service Account Permissions

A service account is required for crawling and profiling. By default, the service account provided for the connector will be used for any query operations. If the service account has a write privilege, insert, update, and delete queries can be executed. The minimum privileges required are listed below.

Operation Access Permission

Connection Validation

SELECT. USAGE

Crawling

Select, Usage, Reference, and Execution 

Profiling 

Read, Select

Establish a Connection

To connect to HDFS using the OvalEdge application, complete the following steps:

  1. Log in to the OvalEdge application.
  2. Navigate to Administration >  Connectors.
  3. Click on the + (New Connector) icon.
  4. The Add Connector pop-up window is displayed, and you can search for the HDFS connector.
  5. The Add Connector with Connector Type specific details pop-up window is displayed. Enter the relevant information to configure the HDFS connection.

    Note: An asterisk (*) denotes a mandatory field for establishing a connection.

    HDFS (Non-Kerberos Authentication)

    Field Name

    Description

    Connector Type

    This field allows you to select the connector from the drop-down list provided. By default, 'HDFS' is displayed as the selected connector type.

    Credential Manager*

    Select the option from the drop-down menu where you want to save your credentials:

    OE Credential Manager: The HDFS connection is configured with the basic Username and Password of the service account in real time when OvalEdge establishes a connection to the HDFS database. Users must manually add the credentials if the OE Credential Manager option is selected.

    HashiCorp: The credentials are stored in the HashiCorp database server and fetched from HashiCorp to OvalEdge.  

    AWS Secrets Manager: The credentials are stored in the AWS Secrets Manager database server and fetched from the AWS Secrets Manager to OvalEdge.

    For more information on Azure Key Vault, refer to Azure Key Vault.

    For more information on Credential Manager, refer to Credential Manager.

    License Add Ons

    All the connectors will have a Base Connector License by default, which allows you to crawl and profile to obtain metadata and statistical information from a data source. 

    OvalEdge supports various License Add-Ons based on the connector’s functionality requirements.

    • Select the Data Quality Add-On license to identify, report, and resolve the data quality issues for a connector whose data supports data quality using DQ Rules/functions, Anomaly detection, Reports, and more.

    Connector Environment

    The Connector Environment drop-down list allows you to select the environment configured for the connector from the drop-down list. 

    For example, you can select PROD or STG (based on the items configured in the OvalEdge configuration for the connector environment).

    The purpose of the environment field is to help you identify which connector is connecting what type of system environment (Production, STG, or QA).

    Note: The Configuring Environment Variables section explains setting up environment variables.

    Connector Name*

    The connection name refers to the HDFS database connection in the OvalEdge application.

    WebHdfs URL*

    Ex: hdfs://3x.1x.3x.5x:8xxx

    Default Governance Roles

    Steward*

    Select the Steward from the drop-down list options.

    Custodian*

    Select the Custodian from the drop-down list options.

    Owner*

    Select the Owner from the drop-down list options.

    Governance Roles 4, 5, 6*

    Select the respective user from the drop-down options.

    Note: The drop-down list displays all the configurable roles (single user or a team) as per the configurations made in the OvalEdge Security > Governance Roles section.

    Admin Roles

    Integration Admins*

    To add Integration Admin Roles, search for or select one or more roles from the Integration Admin options, then click the Apply button.
    The Integration Admin's responsibilities include configuring crawling and profiling settings for the connector and deleting connectors, schemas, or data objects.

    Security and Governance Admins*

    To add Security and Governance Admin roles, search for or select one or more roles from the list and then click the Apply button.
    The Security and Governance Admin is responsible for:

    • Configuring role permissions for the connector and its associated data objects.
    • Adding admins to set permissions for the connector's roles and associated data objects.
    • Updating governance roles.
    • Creating custom fields.
    • Developing Service Request templates for the connector.
    • Creating approval workflows for  Service Request templates.

    No. of Archive Objects*

    The number of archive objects indicates the number of recent metadata modifications made to a dataset at a remote/source location. By default, the archive objects feature is deactivated. However, users may enable it by clicking the Archive toggle button and specifying the number of objects they wish to archive. 

    Select Bridge

    With the OvalEdge Bridge component, any cloud-hosted server can connect with any on-premise or public cloud data source(s) without modifying firewall rules. A bridge provides real-time control, making data movement between source and destination easy. For more information, refer to

    Bridge Overview.

     

    HDFS (Kerberos Authentication)

    Field Name

    Description

    Keytab*

    This is a file input section, where users will be selecting a keytab file, which is used for an authentication mechanism.

    Principal*

    This is used to validate keytab.
    Ex: PRINCIPAL:ovaledge/ecx-18-x20-1x4-2x9.us-east-2.compute.amazonaws.com@US-EAST-2.COMPUTE.INTERNAL

    Krb5-Configuration File*

    This configuration file is used to set up the Kerberos realm.

  6. After entering all the required connection details, select the appropriate option based on your preferences:
    1. Validate: Click the Validate button to verify the connection details. This ensures that the provided information is accurate and enables successful connection establishment.
    2. Save: Click on the Save button to store the connection details. Once saved, the connection will be added to the Connectors home page for easy access.
    3. Save & Configure: For certain Connectors requiring additional configuration settings, click the Save & Configure button. This will open the Connection Settings pop-up window, allowing you to configure the necessary settings before saving the connection.
  7. Once the connection is validated and saved, it will be displayed on the Connectors home page.

Note: You can either save the connection details first or validate the connection first and then save it.

Connection Validation Details

S.No

Error Message(s)

Description

1

Connection Time Out

The browser could not establish a connection to the server in time.

2

The file path does not exist

When the JSON file path is wrongly entered.

3

Can't find the Kerberos realm

When the configuration file is not correctly given.

Note: If you have issues creating a connection, please contact your assigned OvalEdge Customer Success Management (CSM) team.

Connector Settings

Once the connection is successfully established, various settings are provided to fetch and analyze the information from the data source.

The connection settings include Crawler, Profiler, Query Policies, Access Instruction, Business Glossary Settings, and Notification.

To view the Connector Settings page,

  1. Go to the Connectors page.
  2. From the 9- dots, select the Settings option.
  3. This will display the Connector Settings page, where you can view all the connector settings.
  4. When you have finished making your desired changes, click on Save Changes. All setting changes will be applied to the metadata.

    The following is a list of connection settings and their corresponding descriptions.

    Connection Settings

    Description

    Crawler

    Crawler settings are configured to connect to a data source and collect and catalog all the data elements in metadata.

    Access Instruction

    Access Instruction allows the data owner to instruct others on using the objects in the application.

    Business Glossary Settings

    The Business Glossary Settings provide flexibility and control over how users view and manage term association within a business glossary at the connector level.

    Note: For more information, refer to the Connector Settings.

Crawling of Folders/Files

While crawling root Files/Folders, all the folders and files existing in that specific root path will be cataloged by default.

Additional Information

Parameters

Description

Security

This is regarding the technical user's SSL Certificate I AM Roles set in the backend. 

Proxy

This connector supports the proxy configuration. 

 


Copyright © 2024, OvalEdge LLC, Peachtree Corners, GA, USA.