What is Apache Kafka?
Apache Kafka is an open-source stream-processing software created by LinkedIn and maintained by Confluent. Apache Kafka helps you to decouple data streams & systems to achieve a few goals:
- Distributed, resilient architecture, fault tolerant
- Horizontal scalability
- High Performance
What Problem Does it Help Solve?
Kafka is used as a transportation mechanism. Here are some common applications:
- Messaging systems
- Activity tracking
- Gather metrics from different locations
- Gather Logs
- Stream processing
- Decoupling of system dependencies
Netflix embraces Apache Kafka as the de-facto standard for its eventing, messaging, and stream processing needs. Kafka acts as a bridge for all point-to-point and Netflix Studio wide communications. It provides us with the high durability and linearly scalable, multi-tenant architecture required for operating systems at Netflix.
Basic Concepts
- Topics: a particular stream of data (similar to a table in a database)
- Topics are split in partitions
- Each partition is ordered
- Each message within a partition gets an incremental id, called offset
.
![](https://cdn.prod.website-files.com/5f3acb2672fdcd05b7611500/5fdb9d2bc36ee9b27ffd3ec5_JVKiWQkGLAHVGY31vkKPoioVEeSPhKGt4A1TwUM9_ycAkKvQewO-m3pavoZohbJDoI4nIBJhjkVbNih8TVrHIyEPlpXaHPKVqsWfTZ8WNQvTTbtNq6jhmVydO4hIdK1nkv6oDQgZ.png)
- A Kafka Cluster is composed of multiple brokers (Servers)
- Each broker is identified with its ID
![](https://cdn.prod.website-files.com/5f3acb2672fdcd05b7611500/5fdb9d2a3a5cd3333dbda169_hmNEAI3YT4QE7CRRvMrb_Lj3uKLrMtErqweBU54H9YwOM1oyCldvO9yegOazxJPVySF6dVdxpaAtMc34amrT0V_3-feYo4fGWYA2BVPSUOGQje4PMtwZ1fPqtF5DhxUIywLCZyWz.png)
- Producers write data to topics
- Producers automatically know which broker and partitions to write to
- In case of Failures, the Producer will automatically recover
![](https://cdn.prod.website-files.com/5f3acb2672fdcd05b7611500/5fdb9d2bc221ce809eda10f5_Vru6lFZgf4z9-64g9HPEjhFWV2d8RIilB2wZ8L4vYPMnZ8-iYrhO2Bjmyc2x35jnpxGgbaYiplkP_zWtYZV3Go7fnNm8ThuA9-TYf3SR2JLDNoYEvQ9LIJi5z3TYjrPL9wb4PD_J.png)
From the consumer side
- Read data from a topic
- Know which broker to read from
- In case a broker failures, consumer know how to recover
- Data is read in order within each partition
![](https://cdn.prod.website-files.com/5f3acb2672fdcd05b7611500/5fdb9d2b3a5cd31d59bda16a_24rgQ-dVti-vLnmRScIpugTdnGAVvIYRjDT7kmnC9FJUjA1JAia0QB4pz1mhkG6KTb5y_ciEtp8wAGGinhjVHy8UA8VpDkF78m92U-c6AlpfIhEPJAlCnIc2KsdEf1To6vTX91-S.png)
Kafka vs
Storm
- Distributed real time processing
- Stateless, Data is streamed
- Stream abstraction
- Micro batching processing
Kafka
- It is a distributed message broker
- It is about transferring messages, data is store in the filesystem
- Use publisher - subscriber paradigm
- Stream Processing
Hadoop
- Distributed processing
- State based, data is static and stored
- MapReduce cluster computing paradigm
- Batch Processing
Spark
- Distributed processing
- Stateless / Stateful
- Resilient distributed dataset (RDD)
- Batch processing