HDFS

Connectivity Summary

An out of the box connector is available for the HDFS . It provides support for crawling database objects, profiling of sample data.

Hadoop-common driver allows you to connect to an HDFS  server and use port forwarding and file transfer.

The connectivity to  HDFS  is via Hadoop common library, which is included in the platform. 

The drivers used by the connector are given below:

Driver / API: Apache Hadoop Common

Version: 2.7.3 (latest version is 3.3.1)

Details: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.7.7

Connector Capabilities

The connector capabilities are shown below:

Crawling

Feature Supported Objects Remarks
Crawling Buckets While crawling root File Folders / Files  will be cataloged by default

Please see this article Crawling Data for more details on crawling.

Profiling

Please see Profiling Data for more details on profiling.

Feature

Support

Remarks

File Profiling

Row count, Columns count, View sample data

Supported File Types: CSV, XLS, XLSX, JSON, AVRO, PARQUET, ORC

Sample Profiling

Supported

 

By default, the service account provided for the connector will be used for any user operations. If the service account has write privileges, then Insert / Update / Delete operations can be executed.

Pre-requisites

To use the connector, the following need to be available:

  • Connection details as specified in the following section should be available.
  • Service account, for crawling and profiling. The minimum privileges required are:
    • Connection validate
    • Crawl FileFolders
    • Catalog files/folders
    • Profile files/folders

Note: All the above privileges must have Read access permission.

Connection Details

The following connection settings should be added for connecting to a HDFS Server:

                             

  • Database Type: HDFS
  • Connection Name: Select a Connection name for the HDFS Server. The name that you specify is a reference name to easily identify your HDFS SERVER connection in OvalEdge.
    Example: hdfs Server Connection
  • License Type: Standard
  • Authentication: Kerberos / Non- Kerberos
  • WebHdfsUrl: IP of the server with the port on which HDFS is running.
    Example: hdfs://3.140.32.52:8020
  • KeyTab: File path of Keytab file including file.
    Example: D/Keytabs/ovaledge.keytab
  • Principal: Principal that should match with keytab for authentication purposes.
    Example: ovaledge/ec2-18-220-154-229.us-east-2.compute.amazonaws.com@US-EAST-2.COMPUTE.INTERNAL

Once connectivity is established, additional configurations for Crawling and Profiling can be specified.

Property

Details

Crawler configurations

Crawler Options

FileFolders/Buckets by default enabled

Crawler Rules

Include and exclude regex for FileFolders and Buckets only but not for files

Profiler Settings

Profile Options

No Existence for Profile

Profile Rules

No Profile Rules Exist

Points to note

  1. Supported File Types: CSV, XLS, XLSX, JSON, AVRO, PARQUET, ORC
  2. Only shows the details of File/Folder in FileManager which user has access to Files/FileFolder.