Preparing to run the Amazon Managed Streaming for Kafka (MSK) Collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item	Requirement
Hardware Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.
RAM	8 GB
CPU	2 Ghz processor
Software Docker or Java Runtime Environment
Docker	Click here to get Docker.
Java Runtime Environment	OpenJDK 17 is supported and available here.
data.world specific objects
Dataset	You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector. If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.
Network connection
Allowlist IPs and domains	Follow these instructions to configure your network. Use these tools to check network connections before running the collector.

Setting up Amazon MSK

Network considerations

The collector necessitates a network path to the Kafka cluster with the ports configured for each listener open. Coordinate with your IT department to establish the required infrastructure/configuration.

Here are some options:

Create an EC2 instance or similar resource within the same VPC as the running cluster and deploy the collector there, ensuring network routing to the cluster.
Set up a VPN on the machine where the collector will run to make the collector network routable to the cluster.

Setting up a user

For the collector to access your Amazon MSK, work with your Amazon MSK administrator to set up a user that the collector will use to authenticate to Amazon MSK.

Setting up permissions for topics and topic procedures

In order for the Kafka collector to harvest metadata about a topic (including partitions, consumers, consumer groups, and schemas) the cluster user passed to the collector must have DESCRIBE permission on the topic. If the user does not have the DESCRIBE permission, the collector doesn't see that topic and cannot write any information for the topic to the catalog graph.
In order for the collector to harvest information about a topic’s producers, the cluster user passed to the collector must have READ permission on the topic. If the user lacks READ permission, the collector will write a warning message indicating that producer information cannot be harvested. If a user has READ permission for a topic, that user automatically has DESCRIBE permission as well.

In this section: