Connectivity Summary
An out of the box connector is available for the HDFS . It provides support for crawling database objects, profiling of sample data.
Hadoop-common driver allows you to connect to an HDFS server and use port forwarding and file transfer.
The connectivity to HDFS is via Hadoop common library, which is included in the platform.
The drivers used by the connector are given below:
Driver / API: Apache Hadoop Common
Version: 2.7.3 (latest version is 3.3.1)
Details: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.7.7
Connector Capabilities
The connector capabilities are shown below:
Crawling
Feature | Supported Objects | Remarks |
Crawling | Buckets | While crawling root File Folders / Files will be cataloged by default |
Please see this article Crawling Data for more details on crawling.
Profiling
Please see Profiling Data for more details on profiling.
Feature |
Support |
Remarks |
File Profiling |
Row count, Columns count, View sample data |
Supported File Types: CSV, XLS, XLSX, JSON, AVRO, PARQUET, ORC |
Sample Profiling |
Supported |
By default, the service account provided for the connector will be used for any user operations. If the service account has write privileges, then Insert / Update / Delete operations can be executed.
Pre-requisites
To use the connector, the following need to be available:
- Connection details as specified in the following section should be available.
- Service account, for crawling and profiling. The minimum privileges required are:
- Connection validate
- Crawl FileFolders
- Catalog files/folders
- Profile files/folders
Note: All the above privileges must have Read access permission.
Connection Details
The following connection settings should be added for connecting to a HDFS Server:
- Database Type: HDFS
- Connection Name: Select a Connection name for the HDFS Server. The name that you specify is a reference name to easily identify your HDFS SERVER connection in OvalEdge.
Example: hdfs Server Connection - License Type: Standard
- Authentication: Kerberos / Non- Kerberos
- WebHdfsUrl: IP of the server with the port on which HDFS is running.
Example: hdfs://3.140.32.52:8020 - KeyTab: File path of Keytab file including file.
Example: D/Keytabs/ovaledge.keytab - Principal: Principal that should match with keytab for authentication purposes.
Example: ovaledge/ec2-18-220-154-229.us-east-2.compute.amazonaws.com@US-EAST-2.COMPUTE.INTERNAL
Once connectivity is established, additional configurations for Crawling and Profiling can be specified.
Property |
Details |
Crawler configurations |
|
Crawler Options |
FileFolders/Buckets by default enabled |
Crawler Rules |
Include and exclude regex for FileFolders and Buckets only but not for files |
Profiler Settings |
|
Profile Options |
No Existence for Profile |
Profile Rules |
No Profile Rules Exist |
Points to note
- Supported File Types: CSV, XLS, XLSX, JSON, AVRO, PARQUET, ORC
- Only shows the details of File/Folder in FileManager which user has access to Files/FileFolder.