HDFS (Hadoop Distributed File System) is the underlying storage layer where the metadata and data about various datasets are stored. The metadata is often stored in HDFS files.
OvalEdge uses the Apache Hadoop common library to establish a connection with the HDFS data source for metadata crawling and profiling.
Connector Capabilities
The connector capabilities are shown below:
Crawling
Features |
Supported Objects |
Remarks |
---|---|---|
Crawling |
Buckets |
While crawling root File Folders / Files will be cataloged by default |
Please see this article Crawling Data for more details on crawling.
Profiling
Please see Profiling Data for more details on profiling.
Feature |
Support |
Remarks |
File Profiling |
Row count, Columns count, View sample data |
Supported File Types: CSV, XLS, XLSX, JSON, AVRO, PARQUET, ORC |
Sample Profiling |
Supported |
Pre-requisites
To use the connector, the following need to be available:
- Connection details as specified in the following section should be available.
- Service account, for crawling and profiling. The minimum privileges required are:
- Connection validate
- Crawl FileFolders
- Catalog files/folders
- Profile files/folders
Note: All the above privileges must have Read access permission.
Drivers
The drivers used by the connector are given below:
- Driver / API: Apache Hadoop Common
- Version: 2.7.3 (latest version is 3.3.1)
- Details: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.7.7
Configuring Environment Variables
Configuring environment names enables you to select the appropriate environment from the drop-down list when adding a connector. This allows for consistent crawling of schemas across different environments, such as production (PROD), staging (STG), or temporary environments. It also facilitates schema comparisons and assists in application upgrades by providing a temporary environment that can later be deleted.
Before establishing a connection, it is important to configure the environment names for the specific connector. If your environments have been configured, skip this step.
Steps to Configure the Environment
- Log into the OvalEdge application.
- Navigate to Administration > System Settings.
- Select the Connector tab.
- Find the key name “connector.environment”.
- Enter the desired environment values (PROD, STG) in the Value column.
- Click ✔ to Save.
Service Account Permissions
A service account is required for crawling and profiling. By default, the service account provided for the connector will be used for any query operations. If the service account has a write privilege, insert, update, and delete queries can be executed. The minimum privileges required are listed below.
Operation | Access Permission |
Connection Validation |
SELECT. USAGE |
Crawling |
Select, Usage, Reference, and Execution |
Profiling |
Read, Select |
Establish a Connection
To connect to HDFS using the OvalEdge application, complete the following steps:
- Log in to the OvalEdge application.
- Navigate to Administration > Connectors.
- Click on the + (New Connector) icon.
- The Add Connector pop-up window is displayed, and you can search for the HDFS connector.
- The Add Connector with Connector Type specific details pop-up window is displayed. Enter the relevant information to configure the HDFS connection.
Note: An asterisk (*) denotes a mandatory field for establishing a connection.
HDFS (Non-Kerberos Authentication)
Field Name
Description
Connector Type
This field allows you to select the connector from the drop-down list provided. By default, 'HDFS' is displayed as the selected connector type.
Credential Manager*
Select the option from the drop-down menu where you want to save your credentials:
OE Credential Manager: The HDFS connection is configured with the basic Username and Password of the service account in real time when OvalEdge establishes a connection to the HDFS database. Users must manually add the credentials if the OE Credential Manager option is selected.
HashiCorp: The credentials are stored in the HashiCorp database server and fetched from HashiCorp to OvalEdge.
AWS Secrets Manager: The credentials are stored in the AWS Secrets Manager database server and fetched from the AWS Secrets Manager to OvalEdge.
For more information on Azure Key Vault, refer to Azure Key Vault.
For more information on Credential Manager, refer to Credential Manager.
License Add Ons
All the connectors will have a Base Connector License by default, which allows you to crawl and profile to obtain metadata and statistical information from a data source.
OvalEdge supports various License Add-Ons based on the connector’s functionality requirements.
- Select the Data Quality Add-On license to identify, report, and resolve the data quality issues for a connector whose data supports data quality using DQ Rules/functions, Anomaly detection, Reports, and more.
Connector Environment
The Connector Environment drop-down list allows you to select the environment configured for the connector from the drop-down list.
For example, you can select PROD or STG (based on the items configured in the OvalEdge configuration for the connector environment).
The purpose of the environment field is to help you identify which connector is connecting what type of system environment (Production, STG, or QA).
Note: The Configuring Environment Variables section explains setting up environment variables.
Connector Name*
The connection name refers to the HDFS database connection in the OvalEdge application.
WebHdfs URL*
Ex: hdfs://3x.1x.3x.5x:8xxx
Default Governance Roles
Steward*
Select the Steward from the drop-down list options.
Custodian*
Select the Custodian from the drop-down list options.
Owner*
Select the Owner from the drop-down list options.
Governance Roles 4, 5, 6*
Select the respective user from the drop-down options.
Note: The drop-down list displays all the configurable roles (single user or a team) as per the configurations made in the OvalEdge Security > Governance Roles section.
Admin Roles
Integration Admins*
To add Integration Admin Roles, search for or select one or more roles from the Integration Admin options, then click the Apply button.
The Integration Admin's responsibilities include configuring crawling and profiling settings for the connector and deleting connectors, schemas, or data objects.Security and Governance Admins*
To add Security and Governance Admin roles, search for or select one or more roles from the list and then click the Apply button.
The Security and Governance Admin is responsible for:- Configuring role permissions for the connector and its associated data objects.
- Adding admins to set permissions for the connector's roles and associated data objects.
- Updating governance roles.
- Creating custom fields.
- Developing Service Request templates for the connector.
- Creating approval workflows for Service Request templates.
No. of Archive Objects*
The number of archive objects indicates the number of recent metadata modifications made to a dataset at a remote/source location. By default, the archive objects feature is deactivated. However, users may enable it by clicking the Archive toggle button and specifying the number of objects they wish to archive.
Select Bridge
With the OvalEdge Bridge component, any cloud-hosted server can connect with any on-premise or public cloud data source(s) without modifying firewall rules. A bridge provides real-time control, making data movement between source and destination easy. For more information, refer to
HDFS (Kerberos Authentication)
Field Name
Description
Keytab*
This is a file input section, where users will be selecting a keytab file, which is used for an authentication mechanism.
Principal*
This is used to validate keytab.
Ex: PRINCIPAL:ovaledge/ecx-18-x20-1x4-2x9.us-east-2.compute.amazonaws.com@US-EAST-2.COMPUTE.INTERNALKrb5-Configuration File*
This configuration file is used to set up the Kerberos realm.
- After entering all the required connection details, select the appropriate option based on your preferences:
- Validate: Click the Validate button to verify the connection details. This ensures that the provided information is accurate and enables successful connection establishment.
- Save: Click on the Save button to store the connection details. Once saved, the connection will be added to the Connectors home page for easy access.
- Save & Configure: For certain Connectors requiring additional configuration settings, click the Save & Configure button. This will open the Connection Settings pop-up window, allowing you to configure the necessary settings before saving the connection.
- Once the connection is validated and saved, it will be displayed on the Connectors home page.
Note: You can either save the connection details first or validate the connection first and then save it.
Connection Validation Details
S.No |
Error Message(s) |
Description |
1 |
Connection Time Out |
The browser could not establish a connection to the server in time. |
2 |
The file path does not exist |
When the JSON file path is wrongly entered. |
3 |
Can't find the Kerberos realm |
When the configuration file is not correctly given. |
Note: If you have issues creating a connection, please contact your assigned OvalEdge Customer Success Management (CSM) team.
Connector Settings
Once the connection is successfully established, various settings are provided to fetch and analyze the information from the data source.
The connection settings include Crawler, Profiler, Query Policies, Access Instruction, Business Glossary Settings, and Notification.
To view the Connector Settings page,
- Go to the Connectors page.
- From the 9- dots, select the Settings option.
- This will display the Connector Settings page, where you can view all the connector settings.
- When you have finished making your desired changes, click on Save Changes. All setting changes will be applied to the metadata.
The following is a list of connection settings and their corresponding descriptions.
Connection Settings
Description
Crawler
Crawler settings are configured to connect to a data source and collect and catalog all the data elements in metadata.
Access Instruction
Access Instruction allows the data owner to instruct others on using the objects in the application.
Business Glossary Settings
The Business Glossary Settings provide flexibility and control over how users view and manage term association within a business glossary at the connector level.
Crawling of Folders/Files
While crawling root Files/Folders, all the folders and files existing in that specific root path will be cataloged by default.
Additional Information
Parameters |
Description |
---|---|
Security |
This is regarding the technical user's SSL Certificate I AM Roles set in the backend. |
Proxy |
This connector supports the proxy configuration. |
Copyright © 2024, OvalEdge LLC, Peachtree Corners, GA, USA.