Deep Dive Articles

File Manager - File Explorer

This document provides a comprehensive guide to OvalEdge File Explorer of File Manager, which allows Author Users to manage, organize, and analyze files and folders across various data lake systems, such as HDFS, S3, NFS, and more.

Key functionalities:

  • Manage Files and Folders: Upload, download, delete, and organize files and folders within data lakes for NFS Connection.
  • Cataloging: Categorize files and folders for better visibility and data profiling.
  • Folder OvalSight: Gain insights into folder structure, file types, size distribution, and more.
  • Search and Filter: Find specific files and folders based on various criteria.

Getting Started

  • Connect to Data Lakes: Use connectors or upload tools to add data lakes like S3, HDFS, or NFS.
  • Catalog Data Lakes: Cataloging allows viewing files and folders within the Data Catalog (optional for some levels).
  • Explore Files and Folders: The File Explorer displays detailed information about files and folders within a data lake connection.

Key Features

  • Supported File Formats: Upload and manage various file formats, such as CSV, JSON, Parquet, and more. Users can configure allowed upload formats.
  • Data Lake OvalSight: Provides a high-level overview of a data lake's structure, size, and file distribution.
  • Folder OvalSight: Analyze folder contents, including file types, sizes, and overall structure. Users can access Folder OvalSight from the File Manager or Data Catalog.
  • Tree View: Navigate data folders and subfolders visually and access detailed Folder Analysis information.
  • Data Lake Search: Search for files and folders across an entire data lake connection using keywords.
  • System Settings: Administrators and Author Users can configure the maximum size for uploaded files, set the file types that users can catalog, and define the number of file entries displayed per page.

Cataloging Data Lakes

Cataloging data lakes before using File Explorer is essential. This allows Author Users to view all files and folders within the lake. OvalEdge offers various methods for adding data:

  • Connectors: OvalEdge integrates with data lake systems like Hadoop, Amazon S3, Google Cloud Storage, and more. These connectors are available on the "Connectors" page.
  • Upload Tools: Author Users can also upload files and folders directly using the "Upload File" or "Upload Folder" tools for NFS Connection.

Crawling with Connectors

  1. Navigate: Author Users navigate to Administration > Crawler.
  2. Add Connection: Select the file system (NFS/S3/HDFS/Azure/Drive) and enter the database name.
  3. Provide Credentials: Enter and validate connection details in the "Manage Connection" window. Save.
  4. Crawl Data: Click "Crawl/Profile" to initiate the process. Upon successful completion, folders and files will appear in the File Explorer.

Note:

  • File Explorer shows all connected files and folders.
  • The Data Catalog displays only first-level folders and files.
  • Author Users must manually catalog additional levels from File Explorer.

Example:

  • In S3, "Hospital" (Level 1) gets automatically cataloged (visible in Data Catalog).
  • To view "Departments" (Level 2) or "General Medicine" (Level 3), users must catalog them manually from File Explorer.

Uploading via NFS

Author Users can upload files and folders directly to the NFS data lake connection.

  1. Access Upload: In File Explorer, select the NFS data lake and click the 9-Dots icon to access the "Upload" option.
  2. Choose File/Folder: Select "File" or "Folder" on the upload page.
  3. Browse and Upload: Browse the computer directory to select the file or folder, then initiate the upload.
  4. Create Directory (Optional): Use the 9-Dots icon to create a new directory if needed.
  5. Verify and Finish: A successful upload will highlight the file in green. Click "Finish" to complete.

Supported File Formats for Upload

The File Explorer supports specific file types. Author Users can configure these types through the "config.file.types.to.be.cataloged" setting in System Settings (OTHERS tab). Provide the valid file format for upload.


Supported File Formats 

Once uploaded and cataloged in the Data Catalog, users can profile these file formats:

  • CSV (.csv): Comma Separated Values store tabular data, with each line representing a record and commas separating fields.
  • JSON (.json): JavaScript Object Notation stores simple data structures for easy data interchange between applications and servers.
  • Parquet (.parquet): An Apache Parquet file format that is efficient for storing and processing large datasets.
  • ORC (.orc): Optimized Row Columnar files are used in the Hadoop ecosystem for structured data storage.
  • XLSX (.xlsx): Microsoft Excel Open XML Spreadsheet format.
  • XLS (.xls): Microsoft Office Excel spreadsheet format containing rows and columns of data.
  • Avro (.avro): Apache Avro is a data serialization framework for efficient data exchange with features like schema evolution.
  • Gzip (.gz): Compressed files using the gzip algorithm for reduced size and faster transmission.

Selecting a Data Lake

The "File Explorer" displays all available file connections, including:

  • Connector Name
  • Connector Type (NFS, S3, etc.)
  • Created By (username)
  • Last Modified On
  • Data Lake OvalSight
  • Last OvalSight Scan

Author Users can use the search icons in the respective columns to search for specific connections by name or created by and can filter by type.

Users can click on Connection Name to explore files and folders, which takes them to File Explorer.

Exploring Folders

Clicking a connection name in the File Explorer will display the contents of the selected connection folder or file.


    • Type: File or Folder
    • Folder/File Name: File or Folder name
    • File Type: Extension (e.g., .csv, .xlsx) - empty for folders
    • Catalog: OvalEdge requires users to categorize (catalog) files and folders for better metadata management, which is essential before profiling data. A check mark symbol indicates a cataloged file or folder.
    • Automatic Cataloging: The first level of data (typically top-level folders) automatically catalogs during the connection crawling.
    • Manual Cataloging: Author Users can manually catalog second-level or deeper folders and individual files using two methods:
    • Plus Sign (+): In File Explorer, click the "+" icon next to each file or folder.
    • 9-Dots: Use the 9-Dots menu in File Explorer to catalog multiple files/folders simultaneously.
    • Folder OvalSight: Users can "Run Folder OvalSight" on a selected folder at the first level and restrict it to the second level. From the Folder OvalSight column, the icon will change to green when the Folder Scan is completed. The Folder OvalSight provides an in-depth analysis of nested levels within a folder, detailing all subfolders and files up to the last level.
    • File Size: File size
    • Last Modified: Date of last modification in the source system
  • Last OvalSight Scan: Date and time of the last Run Folder OvalSight.
  • Preview Link: Copy the link to view the Data Catalog Summary Page in another browser tab.

File Explorer Details

The File Explorer displays detailed information for each file within a folder:

  • Type: File or Folder
  • Folder/File Name: File or Folder name
  • File Type: Extension (e.g., .csv, .xlsx)
  • Catalog: OvalEdge requires users to categorize (catalog) files for better metadata management, which is essential before profiling data. A check mark symbol indicates a cataloged file or folder.
  • Automatic Cataloging: The first level of data (typically top-level folders) automatically catalogs during the connection.
  • Manual Cataloging: Author Users can manually catalog second-level or deeper files using two methods:
  • Plus Sign (+): Click the "+" icon next to each file.
  • 9-Dots: Use the 9-Dots menu in File Explorer or Data Catalog to catalog multiple files/folders simultaneously.
  • Folder OvalSight: Users can "Run Folder OvalSight" on a selected folder at the first level and restrict it to the second level. From the Folder OvalSight column, the icon will change to green when the Folder Scan is completed. The Folder OvalSight provides an in-depth analysis of nested levels within a folder, detailing all subfolders and files up to the last level.
  • File Size: File size
  • Last Modified: Date of last modification in the source system
  • Last OvalSight Scan: Date and time of the last Run Folder OvalSight.
  • Preview Link: Copy the link to view the Data Catalog Summary Page File/Folder in another browser tab.

User Actions

The 9-Dots icon in the top right corner provides a menu for various actions on files and folders:

  • View File: Preview data in raw or table format.
    • Raw View: Unstructured data view.
    • Table View: Structured data view.
  • Download File: Download the file uploaded to File Explorer to the desktop.
  • Delete File: Permanently remove a file (cannot be recovered) from the OvalEdge application but not from the source system.
  • Upload File: Upload a new file or folder (NFS connection only).
  • Catalog Files/Folders: Categorize files/folders for visibility in the Data Catalog (required for profiling). A check mark symbol indicates a cataloged file or folder.
  • Run Folder OvalSight: Users can "Run Folder OvalSight" on a selected folder at the first level and restrict it to the second level. 

To know more about Data Lake and Folder OvalSight, please refer to: File Manager - Data Lake OvalSight

Data Lake OvalSight

A comprehensive dashboard is available for each data lake connection. When the Folder Scan is completed, the icon will change to green. Users can access it by clicking the “OvalSight” icon in the Folder OvalSight column.

Folder OvalSight Summary provides detailed statistics for insights into the selected connector and its folders, subfolders, and files. Users can find the following on the dedicated dashboard:

  • Folder Details Tiles: Displays the folder's level, total subfolders, empty subfolders, last folder level, and total file count.
  • Top 10 File Formats Donut Chart: This chart displays the top 10 file formats with the most files in the selected folder. It shows file names, total files, and percentages. Other file types can be viewed by clicking the "View All" button. A central indicator shows the total number of file types in the folder.
  • File Size Range Analysis Donut Chart: It categorizes file sizes into five ranges (<100KB, 100KB-1MB, 1MB-10MB, 10MB-100MB, >100MB) within the selected folder. Clicking on a chart segment shows all files within that size range. A central indicator shows the total folder size for the selected folder.
  • Folder Modification Summary Bar Graph: Displays the last modification dates of subfolders within the selected folder, organized by quarters (Q1, Q2, Q3, Q4). Hovering over a bar shows the folder count for each quarter, and clicking a bar provides a detailed view of folders modified within that quarter.
  • File Modification Summary Bar Graph: Displays the last modification dates of files within the selected folder, organized by quarters (Q1, Q2, Q3, Q4). Hovering over a bar shows the file count for each quarter, and clicking a bar provides a detailed view of files modified within that quarter.

Folder OvalSight List View

Folder OvalSight List View provides insights into the structure and contents of folders within Amazon S3, Azure Data Lake Storage, CIFS, and NFS data lakes. It helps users understand folder organization and extract key details about files and subfolders, such as folder name, folder level (1=top level), folder type, size, catalog, file count, file types, file sizes, sample files, etc. 

Folder OvalSight Tree View

A Folder OvalSight Tree View allows users to navigate their folders and subfolders easily. A breadcrumb trail allows users to navigate through the folder hierarchy quickly. It provides the key details about folders, such as folder name, OvalSight summary, file OvalSight, folder level (1=top level), folder type, size, catalog, file count, file types, file sizes, sample files, etc. 

Data Lake Search

Data Lake Search lets users search for files and folders across a connection. Enter a file or folder name in the search bar. The search is part of the Folder OvalSight job, and hence, it will give results only for the folders on which the Authors have run Folder OvalSight.

Search Results

The search returns a list with the following details:

  • Name: Name of the folder or file.
  • Level: Hierarchical position within the folder structure.
  • Type: Folder or file.
  • Catalog: Option to categorize the item (required for profiling).
  • Size (KB): Size of the folder or file.
  • Last Modified: Date of the latest modification in the source system (reflected upon Data Lake re-cataloging by admins).

System Settings of File Explorer

System settings for the File Explorer allow administrators and author users to control the behavior and display. These settings allow to:

Search and Display

  • Cataloged File Types (config.file.types.to.be.cataloged): Set the file types users can catalog (default: csv, conf, env, sh, properties, txt, yaml, xlsx, json, ddl, sql, hql, parquet).
  • Folder Analysis (config.folder.enable.folder.analysis): Enable/disable the Folder Analysis feature (default: true).

Upload Limits

  • Maximum File Size (ovaledge.filesize.limit): Set the maximum size (in bytes) for uploaded files (default: 2097152 bytes).
  • Maximum Files per Upload (ovaledge.fileupload.maxfiles): Set the maximum number of files allowed in a single upload (default: 10). This setting applies to files uploaded using the File API.

List View

  • Rows per Page (filemanager.pagination.row.limit): Define the number of file entries displayed per page (default: 100).

Copyright © 2024, OvalEdge LLC, Peachtree Corners GA USA