ETLs

Airflow

Connectivity Summary

An out-of-the-box connector is available for Airflow to support crawling datasets i.e Airflow Dags and tasks and lineage building.

The connectivity to Airflow is via JDBC,  which is included in the platform.

Connector Capabilities

The following are the connector capabilities mentioned below:

Crawling

Supported Objects: Jobs

Remarks: It fetches all Airflow Dags and Tasks from AirflowDb.

Please see this article Crawling Data for more details on crawling.   

Lineage Building

 

Operation Details
Table t Table Supported
Table-File Lineage Supported
File - Table Lineage Supported
Column lineage- File Column Lineage Supported

Querying

Operation Details
Select Supported
Insert Not Supported, by default.
Update Not Supported, by default.
Delete Not Supported, by default.
Joins within database Supported
Joins outside database Supported
Aggregations Supported
Group By Supported

Pre-requisites

To use the connector, the following need to be available:

  1. Connection details as specified in the following section should be available.

  2. Service account, for crawling. The minimum privileges required are:

    Operation: Access Permission

    Connection validate: Should have permission for the specified path

Connection Details

The following are the connection settings that should be added for connecting to an Airflow:

Airflow Connector

  • Connection (Database) Type: AirflowDB
  • License Type: Standard or Auto Lineage
  • Connection Name: Select a Connection name for the Airflow. The name that you specify is a reference name to easily identify the Airflow connection in OvalEdge.
    Example: Airflow1
  • Server: IP Address of Airflow
  • Remote Dag Path:  Enter the path of the location where all dags(Python files) located in the Airflow server
    Local Dag Path:
    Enter the path of the location where all the dags(Python files) are present in Local/Ovaledge server. Here both must have the same count
  • Username: Provide the valid username
  • Password: Provide a valid password

Once connectivity is established, additional configurations for Crawling and Profiling can be specified.

Points to be noted
    1. Airflow requires its DAG Path in Remote (Airflow Server), Local (OvalEdge) server, IP address, Username, and Password. All the fields are mandatory. The connection will be successful only if connection details are correct and a valid Local DAG Path.CC

    2. All the DAG’s must be copied from Remote Server (Airflow) ( QA1/QA2 localpath /home/ovaledge/dags ) to Local Server (OvalEdge) (dags https://sqldll.s3.us-east-2.amazonaws.com/dags.zip ).

    3. Airflow DAG’s are considered as Datasets. Tasks of each DAG are considered as a child dataset. 

    4. There will be a python code associated with each DAG which must be copied from Airflow to OvalEdge. We will read the python code and create a dataset, Reading python code can be done successfully only if the remote DAG is correct and the corresponding Local DAG file exists.

    5. Airflow Web can be accessed using URL http://{host:port}/admin/airflow/login