Batch vs. Streaming: A Simple Analogy
Most traditional data processing is batch processing.
- Analogy: It's like getting your physical mail delivered. The post office collects all the mail for the day (the batch), sorts it, and delivers it to you in one go. You process your batch of mail once per day.
- In Data: You collect a day's worth of log files, then run a large job at midnight to process them all at once.
Streaming, on the other hand, is about processing data in near real-time as it arrives.
- Analogy: It's like receiving text messages. You get notified and can read and react to each message the moment it arrives.
- In Data: As users click on your website, each click event is sent to a processing system immediately, allowing you to update a "live" dashboard of user activity.
Apache Kafka: The Central Nervous System for Data
To handle streaming data, you need a system that can reliably receive and queue massive volumes of events from many sources. This is where Apache Kafka comes in.
Kafka is a distributed event streaming platform. Think of it as a highly scalable, fault-tolerant, and durable "inbox" for your data streams.
Core Kafka Concepts
- Producer: An application that sends (produces) messages. For example, a web server could be a producer that sends a message for every user click.
- Consumer: An application that reads (consumes) messages. A fraud detection system could be a consumer that reads a stream of financial transactions.
- Topic: A named category or feed to which messages are sent. A producer writes to a topic, and a consumer reads from a topic (e.g., user_clicks, iot_sensor_readings).
- Broker: A single Kafka server. Kafka is usually run as a cluster of multiple brokers for fault tolerance and scalability.
A producer sends a record to the user_clicks topic. A consumer (like Spark) can then read from that topic to process the click events in real time.
Spark Structured Streaming: Processing Streams with DataFrames
Spark Structured Streaming is a stream-processing engine built on the Spark SQL engine. Its brilliant innovation is treating a data stream as an unbounded table.
You can define a query on this "table" using the exact same DataFrame API you use for batch processing. Spark takes care of continuously running your query and updating the results as new data arrives in the stream.
Putting It All Together: A Simple Pipeline
Let's build a simple pipeline that reads messages from a Kafka topic, performs a transformation, and prints the result to the console.
Scenario: We have a Kafka topic named logs where messages are being sent in the format {"level": "INFO", "message": "User logged in"}. We want to count the number of logs for each level in real time.
PySpark Code:
Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder \
.appName("KafkaStreamingExample") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0") \
.getOrCreate()
# 1. Read from the Kafka stream. This creates our "unbounded table".
streaming_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "logs") \
.load()
# The data from Kafka is in a binary 'value' column. We need to parse it.
json_schema = StructType([
StructField("level", StringType()),
StructField("message", StringType())
])
parsed_df = streaming_df.select(from_json(col("value").cast("string"), json_schema).alias("data")) \
.select("data.*")
# 2. Define the transformation (our query on the stream)
# This is the exact same DataFrame code you'd use in a batch job.
log_counts_df = parsed_df.groupBy("level").count()
# 3. Start the query and write the output to the console
# The 'update' output mode will continuously print the updated counts.
query = log_counts_df.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
When you run this, Spark will connect to Kafka, and your console will continuously update with the latest counts of INFO, ERROR, and WARN logs as they are produced.