What is Apache Kafka?
Apache Kafka is an open-source stream-processing software created by LinkedIn and maintained by Confluent. Apache Kafka helps you to decouple data streams & systems to achieve a few goals:
- Distributed, resilient architecture, fault tolerant
- Horizontal scalability
- High Performance
What Problem Does it Help Solve?
Kafka is used as a transportation mechanism. Here are some common applications:
- Messaging systems
- Activity tracking
- Gather metrics from different locations
- Gather Logs
- Stream processing
- Decoupling of system dependencies
Netflix embraces Apache Kafka as the de-facto standard for its eventing, messaging, and stream processing needs. Kafka acts as a bridge for all point-to-point and Netflix Studio wide communications. It provides us with the high durability and linearly scalable, multi-tenant architecture required for operating systems at Netflix.
Basic Concepts
- Topics: a particular stream of data (similar to a table in a database)
- Topics are split in partitions
- Each partition is ordered
- Each message within a partition gets an incremental id, called offset
.
- A Kafka Cluster is composed of multiple brokers (Servers)
- Each broker is identified with its ID
- Producers write data to topics
- Producers automatically know which broker and partitions to write to
- In case of Failures, the Producer will automatically recover
From the consumer side
- Read data from a topic
- Know which broker to read from
- In case a broker failures, consumer know how to recover
- Data is read in order within each partition
Kafka vs
Storm
- Distributed real time processing
- Stateless, Data is streamed
- Stream abstraction
- Micro batching processing
Kafka
- It is a distributed message broker
- It is about transferring messages, data is store in the filesystem
- Use publisher - subscriber paradigm
- Stream Processing
Hadoop
- Distributed processing
- State based, data is static and stored
- MapReduce cluster computing paradigm
- Batch Processing
Spark
- Distributed processing
- Stateless / Stateful
- Resilient distributed dataset (RDD)
- Batch processing