Kafka Connect - HackMD

# Kafka Connect Kafka Connect is a tool to reliably and scalably stream data between Kafka and other systems.There are connectors that help to move huge data sets into and out of the Kafka system. ## Kafka Connect Types 1. Source Connector: A source connector takes the whole databases and streams table updates to the topics. It is able to collect metrics from the user's entire application servers into the topics. This makes the data available for stream processing with low latency. 2. Sink Connector: This connector is used to deliver data from a topic into the secondary indices like the Hadoop system for offline analysis. ![](https://i.imgur.com/U6JgIkd.png) Kafka Connect makes it easy to stream from numerous sources into Kafka and from Kafka into numerous sources, with hundreds of available connectors. ## How Kafka Connect Source Works Kafka Connect runs in its own process, separate from the Kafka brokers. It is distributed, scalable, and fault tolerant, giving you the same features you know and love about Kafka itself. ![](https://i.imgur.com/OWgOJWu.png) There are literally hundreds of different connectors available for Kafka Connect. Some of the most popular ones include * RDBMS (Oracle, SQL Server, Db2, Postgres, MySQL) * Cloud object stores (Amazon S3, Azure Blob Storage, Google Cloud Storage) * Message queues (ActiveMQ, IBM MQ, RabbitMQ) * NoSQL and document stores (Elasticsearch, MongoDB, Cassandra) * Cloud data warehouses (Snowflake, Google BigQuery, Amazon Redshift) ## Kafka Connect Terminologies 1. Connectors: The connectors are used to coordinate and manage the copying of data between Kafka as well as other systems. A connector instance is created that performs data streaming management. All the classes used by the connectors are defined in a plugin called Connector Plugin. 2. Tasks: It does the actual implementation of copying data to or from the Apache Kafka. Each instance of connector coordinates a set of tasks which actually copies the data. 3. Workers: Both connectors and tasks are the logical units of work. The workers are the running processes which execute both connectors and tasks. There are two types of workers: A. Standalone Workers: These are the workers where a single process executes all connectors as well as tasks. It is the simplest mode, so it requires fewer configurations. But, there is limited functionality as well as scalability. It does not have any fault tolerance beyond monitoring. B.Distributed Workers: Here, multiple worker processes to execute the connectors and tasks using the same group id. These workers automatically schedule the execution across all active workers. The workers redistribute the works if a new worker is added, or any worker fails or shuts down. **4. Converters:** It is the code which is used to convert the data between Kafka Connect and the sender/receiver. These converters are used by the tasks to change the data format from bytes to Kafka Connect internal data. **5. Transforms:** It is used to alter the data to make it simple and lightweight. It is a simple function which takes a single record as input, modifies it, and outputs the record. ## Popular Connector * JDBC Source and Sink * JMS Source * Elasticsearch Service Sink * Amazon S3 Sink * HDFS 2 Sink * Replicator * ActiveMQ Source and Sink * Amazon S3 Source and Sink ## Setup of Kafka connector #### Connecting the FileStreamSourceConnector for standalone 1. Start our kafka cluster `$ docker-compose up kafka-cluster` 2. start a hosted tools on mac, mapped on our code `docker run --rm -it -v "$(pwd)":/demo --net=host landoop/fast-data-dev:cp3.3.0 bash` 3. launch the kafka connector in standalone mode: `cd /tutorial/source/demo-1` 4. create the topic we write to with 3 partitions `kafka-topics --create --topic demo-1-standalone --partitions 3 --replication-factor 1 --zookeeper 127.0.0.1:2181` 5. Usage is connect-standalone worker.properties connector1.properties [connector2.properties connector3.properties] `connect-standalone worker.properties file-stream-demo-standalone.properties` #### FileStreamSourceConnector in distributed mode: ## Start Kafka with Confluent Cloud 1. Browse to the sign-up page: https://www.confluent.io/confluent-cloud/tryfree/ and fill in your contact information and a password. Then click the Start Free button and wait for a verification email. get-started-confluent ![](https://i.imgur.com/Z101LSk.jpg) 2. Click the link in the confirmation email and then follow the prompts (or skip) until you get to the Create cluster page. Here you can see the different types of clusters that are available, along with their costs. For this course, the Basic cluster will be sufficient and will maximize your free usage credits. After selecting Basic, click the Begin configuration button. ![](https://i.imgur.com/9TMRSkV.png) 3. Choose your preferred cloud provider and region, and click Continue. ![](https://i.imgur.com/ObKUSt1.png) 4. Review your selections and give your cluster a name, then click Launch cluster. This might take a few minutes.