Skip to main content

Preparing to run the Apache Airflow collector

Warning

This collector is in public preview. It has passed our standard testing, but it is not yet widely adopted. You might encounter unforeseen edge cases in your environment. data.world is committed to promptly addressing any issues with public preview collectors. If you face any problems, please report them through your Customer Success Director, implementation team, or support team for assistance.

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software

Docker

Click here to get Docker.

data.world specific objects

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains



Enabling the REST API in Apache Airflow

Prerequisites

  • You must have Airflow Version 2.10.4 or later.

  • Ensure that CORS is enabled by following the instructions in the Airflow documentation on enabling CORS.

  • Confirm that the REST API is not disabled in your configuration.

To enable REST API:

Important

For detailed instructions on enabling the metadata API, refer to the Airflow documentation.

  1. Open your airflow.cfg file located in the Airflow home directory.

  2. Find the auth_backends setting in the airflow.cfg.

  3. Ensure it contains the airflow.api.auth.backend.basic_auth value. Here is what the full entry should look like:

    auth_backends = airflow.api.auth.backend.session,airflow.api.auth.backend.basic_auth