From Batch to Live: Transitioning to Spark Structured Streaming#

Table of Contents#

Introduction
A Quick History of Apache Spark and Streaming
What Is Structured Streaming?
Setting Up Your Environment
Spark Structured Streaming Fundamentals
- Streaming DataFrames and Datasets
- Sources and Sinks
- Triggers
- Output Modes
Core Transformations and Operations
- Basic Transformations
- Aggregations
- Joins
Event-Time Processing and Watermarking
Checkpointing and Fault Tolerance
Stateful Streaming
Windows and Time-Based Operations
Micro-Batch vs. Continuous Processing
Advanced Topics
- Monitoring and Debugging
- Performance Tuning
- Integrating with Kafka
- Custom Sinks
- Handling Late Data
Putting It All Together
Conclusion and Further Reading

1. Introduction#

Apache Spark is a powerful, open-source distributed computing system for processing large datasets efficiently. It gained popularity thanks to its versatility, offering different APIs for batch processing, SQL, machine learning, and graph analytics. Over time, Spark evolved to support streaming capabilities, enabling data engineers and scientists to build end-to-end pipelines that can process and analyze live data.

Traditionally, many organizations focused on batch-oriented data processing pipelines, which meant running scheduled jobs (e.g., overnight) to process and transform data in bulk. While this approach works for a number of use cases, it doesn’t cater well to scenarios requiring near real-time updates—like fraud detection, dynamic pricing, IoT data analysis, and so on.

Spark Structured Streaming is Apache Spark’s powerful framework for building continuous, near real-time data pipelines. This blog post explores everything from getting started with Spark Structured Streaming to advanced concepts such as watermarking, windowing, and stateful stream processing. By the end, you will be well equipped to transition your workloads from batch to live, unlocking the potential of continuous data processing in your organization.

2. A Quick History of Apache Spark and Streaming#

Spark started as a flexible engine for distributed batch processing, designed to outperform Hadoop’s MapReduce with in-memory computation. Over time, Spark evolved many APIs:

RDD (Resilient Distributed Dataset): Spark’s original low-level API.
DataFrame: High-level API built on top of RDDs, inspired by data frames in R and Python’s pandas.
Dataset: Type-safe, object-oriented API for Scala and Java, offering more compile-time type safety than DataFrames.
Structured Streaming: High-level streaming API built on top of DataFrames and Datasets for near real-time data processing.

Before Structured Streaming, Spark offered Discretized Stream (DStream), which was micro-batch based. That earlier streaming engine worked well but required working with RDDs and transformations that were different from batch APIs, creating complexity in unifying batch and streaming pipelines. Structured Streaming unified the APIs, making it possible to handle streaming data using the same DataFrame and Dataset transformations that are used for batch analytics.

3. What Is Structured Streaming?#

Structured Streaming is an extension of the Spark SQL engine that allows you to process data streams using the same abstractions and operations you would in batch mode. In practice, Spark Structured Streaming treats streaming data as an unbounded table, where new data continually arrives. As new rows appear in this unbounded table, Spark incrementally updates the results of a query.

Key Features#

End-to-end exactly-once semantics: Structured Streaming can ensure that each record is processed exactly once under supported configurations (particularly when reading from and writing to certain data sources such as Kafka).
Same APIs for batch and streaming: You only need to define a transformation pipeline once, and Spark can run it in either batch or streaming mode with minimal changes.
Fault tolerance: By design, Spark can recover from node failures, restarts, and partial job failures while preserving processing guarantees.
Efficient state management: When using features like aggregations, Spark manages state internally and optimizes for memory usage, scale, and fault tolerance.

4. Setting Up Your Environment#

To follow along with the examples provided in this blog post, you’ll need a functional Spark environment. Common ways to set this up include:

Installing Spark locally from the Apache website.
Using a Docker container that has Spark pre-configured.
Setting up a cloud-based environment (Databricks, EMR on AWS, HDInsight on Azure, etc.).

For local development, installing Spark standalone (along with Python or Scala) is often sufficient.

Example: Local Installation Steps for Spark (Mac/Linux)#

Download Spark from the official website: https://spark.apache.org/downloads.html.
Unpack the downloaded file:
```
1
tar -xvf spark-3.3.1-bin-hadoop3.tgz
```

Add the Spark bin directory to your PATH, for instance:

1
export SPARK_HOME=~/spark-3.3.1-bin-hadoop3
2
export PATH=$SPARK_HOME/bin:$PATH

Verify the installation:
```
1
spark-shell --version
```

Once your environment is ready, you can follow the examples in Scala, Python, or even Spark Notebooks.

5. Spark Structured Streaming Fundamentals#

Streaming DataFrames and Datasets#

Structured Streaming operates on DataFrames (or Datasets in Scala/Java) in a streaming context. Essentially, you define a logical query on streaming data, and Spark runs it continuously in micro-batches (or continuous processing if configured) as new data arrives.

For example:

1
import org.apache.spark.sql.SparkSession
2
import org.apache.spark.sql.functions._
3

4
val spark = SparkSession.builder
5
  .appName("StructuredStreamingExample")
6
  .master("local[*]")
7
  .getOrCreate()
8

9
// Read streaming data from a directory of JSON files
10
val lines = spark.readStream
11
  .format("json")
12
  .schema(/* your schema here */)
13
  .load("path/to/your/streaming/data")
14

15
// Transform the data
16
val transformed = lines.withColumn("processed_timestamp", current_timestamp())
17

18
// Write the results to the console
19
val query = transformed.writeStream
20
  .format("console")
21
  .start()
22

23
query.awaitTermination()

In this snippet:

readStream indicates we’re reading data in a streaming fashion.
We specify the source using .format("json") and the path to the data.
We transform the stream using DataFrame operations, such as adding a new column.
Finally, we start a streaming query that writes results to a sink.

Sources and Sinks#

In Spark Structured Streaming, a source is where your streaming data originates. Sources are often message systems like Kafka, socket streams, or file directories. A sink is where your transformed data goes—possible sinks include files, memory, console, Kafka, or custom implementations.

Common Sources:#

Files (e.g., CSV, JSON, Parquet)
Kafka
Socket (for testing or simple demos)
Rate source (useful for testing)
Custom sources

Common Sinks:#

Console (for debugging)
Memory (for smaller, in-memory tables)
File (e.g., write Parquet files incrementally)
Kafka
Foreach sink (for custom handling)

Triggers#

A trigger controls the frequency at which Spark checks for new data and updates the output. The main triggers are:

Micro-batch (default): Processes data in micro-batches as soon as possible.
```
1
.trigger(Trigger.ProcessingTime("10 seconds"))
```
Continuous: Uses the experimental continuous processing mode (lowest latency, but with certain limitations).
```
1
.trigger(Trigger.Continuous("5 seconds"))
```
Once: Processes all available data once and then stops.
```
1
.trigger(Trigger.Once())
```
Manual (as soon as possible): If no trigger is specified, Spark tries to process micro-batches continuously as data arrives.

Output Modes#

Structured Streaming supports different ways of updating the query’s output:

Append: Only newly processed rows are written to the sink. Works for queries that do not modify or remove previously processed data.
Complete: The entire result table is recalculated and written to the sink. Often used for aggregations and reporting.
Update: Only changes to the result table are written to the sink. Particularly useful with aggregations, but not supported by all sinks.

Example:

1
val query = transformed.writeStream
2
  .format("console")
3
  .outputMode("append")
4
  .trigger(Trigger.ProcessingTime("5 seconds"))
5
  .start()

6. Core Transformations and Operations#

Basic Transformations#

Most DataFrame transformations (e.g., select, where, withColumn) used in batch contexts remain valid in Structured Streaming. However, keep in mind:

Some batch transformations (like certain random splits) aren’t fully supported or behave differently in streaming.
Certain actions (e.g., collect()) are not allowed in streaming contexts because the data is unbounded.

For basic manipulation, you might do something like:

1
import org.apache.spark.sql.functions._
2

3
// Filter messages with certain conditions
4
val filtered = lines.filter(col("level") === "ERROR")
5

6
// Select specific columns
7
val selected = filtered.select("timestamp", "message")

Aggregations#

When dealing with real-time data, aggregations (count, sum, average, etc.) over unbounded streams need special care with time constraints or triggers. Simple aggregations without a window can grow unbounded over time. For example:

1
// Group by a particular column, aggregating in real-time
2
val counts = lines.groupBy("category").count()
3

4
val query = counts.writeStream
5
  .outputMode("complete")
6
  .format("console")
7
  .start()

In this example, outputMode("complete") is used because the entire table of category counts changes with every new record. You must store and update the entire result set.

Joins#

Structured Streaming supports:

Stream-static join: Joining a streaming dataset with a static (batch) dataset.
Stream-stream join: Joining two streaming Datasets with support for time boundaries. This is a more advanced feature that typically requires watermarks and windowing.

Example (stream-static join):

1
val staticDF = spark.read.format("csv").option("header", "true").load("static_data.csv")
2
val joined = lines.join(staticDF, Seq("id"), "left_outer")

Example (stream-stream join with time constraints):

1
val clicks = spark.readStream.format("kafka").load()
2
val impressions = spark.readStream.format("kafka").load()
3

4
val joinedStream = clicks.join(
5
  impressions,
6
  expr("""
7
    clicks.userId = impressions.userId AND
8
    clicks.timestamp >= impressions.timestamp - interval 5 minutes AND
9
    clicks.timestamp <= impressions.timestamp + interval 5 minutes
10
  """)
11
)

The exact syntax may vary depending on your schema, but the key point is defining a time window (or range) within which the join should occur.

7. Event-Time Processing and Watermarking#

One of the most important aspects of real-time processing is dealing with event time, as opposed to processing time. Event time is the time at which events actually occurred, while processing time is the time at which Spark processes these events.

Why Does This Matter?#

Late data might arrive because of network delays, system outages, or any number of unforeseen issues.
Relying solely on processing time can lead to incorrect results if data from the past arrives after an aggregation window closes.

Watermarking#

Watermarks help Spark efficiently handle late data by bounding how far back in time the engine must keep state. For example:

1
import org.apache.spark.sql.functions._
2
import org.apache.spark.sql.streaming.Trigger
3

4
val events = spark.readStream
5
  .format("kafka")
6
  .option("kafka.bootstrap.servers", "localhost:9092")
7
  .option("subscribe", "events")
8
  .load()
9

10
// Assume "timestamp" is an event-time field
11
val withEventTime = events
12
  .selectExpr("CAST(value AS STRING)", "timestamp")
13
  .withColumn("eventTime", to_timestamp(col("timestamp")))
14

15
// Add watermark: keep data up to 10 minutes old
16
val aggregates = withEventTime
17
  .withWatermark("eventTime", "10 minutes")
18
  .groupBy(window(col("eventTime"), "5 minutes"), col("someKey"))
19
  .count()
20

21
val query = aggregates.writeStream
22
  .outputMode("update")
23
  .format("console")
24
  .trigger(Trigger.ProcessingTime("5 seconds"))
25
  .start()
26

27
query.awaitTermination()

Here, watermarks define that any data older than 10 minutes (based on eventTime) can be safely considered late and discarded (or segregated). This mechanism prevents the state from growing indefinitely.

8. Checkpointing and Fault Tolerance#

Structured Streaming provides fault tolerance by automatically keeping track of cumulative progress. However, you must configure a checkpoint directory where Spark tracks state:

1
val query = transformed.writeStream
2
  .option("checkpointLocation", "path/to/checkpoint/dir")
3
  .start()

When Spark recovers from a failure, it refers to the checkpoint directory to restore the last known state and pick up from where it left off. This ensures exactly-once or at-least-once semantics, depending on the source and sink.

Checkpointing Best Practices#

Use a reliable, fault-tolerant storage system (e.g., HDFS, S3, Azure Blob Storage).
Each streaming query must have a unique checkpoint location.
Don’t delete the checkpoint directory while a query is active; this can corrupt the query’s state.

9. Stateful Streaming#

Algorithms that maintain some state (e.g., counters, machine learning models, anomaly detectors) over a data stream need stateful streaming. Structured Streaming allows you to manage state through aggregations, mapGroupsWithState, and flatMapGroupsWithState. These operations let you store per-key aggregates or custom states that persist across micro-batches.

Example: mapGroupsWithState#

1
case class UserEvent(userId: String, action: String, timestamp: Long)
2
case class UserState(userId: String, actionCount: Long)
3

4
import org.apache.spark.sql.Dataset
5
import org.apache.spark.sql.streaming.GroupState
6

7
def updateState(userId: String, events: Iterator[UserEvent], state: GroupState[UserState]): UserState = {
8
  val currentState = state.getOption.getOrElse(UserState(userId, 0))
9
  val updatedCount = events.size + currentState.actionCount
10
  val newState = UserState(userId, updatedCount)
11
  state.update(newState)
12
  newState
13
}
14

15
// Assuming we have a Dataset[UserEvent] called userEvents
16
val stateful = userEvents
17
  .groupByKey(_.userId)
18
  .mapGroupsWithState(GroupStateTimeout.NoTimeout())(updateState)
19

20
// This stateful DataFrame can be written to a sink
21
val query = stateful.writeStream
22
  .format("console")
23
  .outputMode("update")
24
  .start()

When new events for a specific user arrive, updateState is called with an iterator of those events for the key. You can update the existing state and persist it, enabling use cases like user session tracking or real-time analytics.

10. Windows and Time-Based Operations#

Windowing is crucial for periodic reporting and summarizing streaming data. Common window functions include tumbling windows, sliding windows, and session windows.

Tumbling Window#

A tumbling window is a fixed-size window that doesn’t overlap. For example:

1
val windowedCounts = withEventTime
2
  .groupBy(window(col("eventTime"), "15 minutes"))
3
  .count()

If you receive event timestamps of 10:00:10, 10:00:20, and 10:14:59, all of these belong to the window [10:00:00, 10:15:00).

Sliding Window#

A sliding window can overlap. For instance:

1
val slidingCounts = withEventTime
2
  .groupBy(window(col("eventTime"), "15 minutes", "5 minutes"))
3
  .count()

Each window is 15 minutes long, but they start in increments of 5 minutes. This can be useful for finer-grained analysis.

Session Window#

Session windows group events that occur in close proximity into the same session and typically require a session gap. Currently, session windows in Structured Streaming might require special APIs (session_window in SQL or the session window DSL).

11. Micro-Batch vs. Continuous Processing#

By default, Structured Streaming operates in micro-batch mode, where each trigger creates a micro-batch. The latency in micro-batch mode is usually on the order of seconds. For workloads requiring very low latency, Spark introduced an experimental continuous processing mode. However, it comes with restrictions on supported operations.

Micro-batch: Most widely used, supports the full set of Structured Streaming features.
Continuous: Experimental, limited feature set, potential for sub-second latency.

In practical terms, most production workloads rely on micro-batches due to the availability of more operators and robust support in the ecosystem.

12. Advanced Topics#

Monitoring and Debugging#

Monitoring your streaming job is essential for production stability. Spark’s web UI (usually at http://:4040) provides insights into the job’s progress, the DAG, and the current state of micro-batches.

Structured Streaming UI: In recent Spark versions, there is a dedicated tab for Structured Streaming, showing the status of ongoing queries, rate of data processing, and more.
Logs: Always review Spark driver and executor logs for errors or performance bottlenecks.
Metrics: Expose Spark metrics to external systems like Prometheus or use Spark listeners for custom metric aggregation.

Performance Tuning#

Memory Tuning: Increase executor memory or use more executors if you’re dealing with large stateful operations.
Batch Size: Adjust how frequently micro-batches occur (e.g., bigger triggers or smaller triggers).
Aggregation Optimization: Use partitioning strategies or broadcast joins when beneficial.
Serialization: Choose efficient data formats (e.g., Parquet, Avro for Kafka) to reduce overhead.
Shuffle Partitions: Tweak spark.sql.shuffle.partitions to optimize the number of partitions used during shuffle operations.

Integrating with Kafka#

Kafka is often used for real-time data pipelines, so Spark provides native Kafka source/sink support. Here’s a minimal example:

1
val spark = SparkSession.builder.appName("KafkaExample").getOrCreate()
2

3
// Reading from Kafka
4
val inputDF = spark.readStream
5
  .format("kafka")
6
  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
7
  .option("subscribe", "input_topic")
8
  .load()
9

10
import org.apache.spark.sql.functions._
11

12
// Convert Kafka's value (binary) to string
13
val dataDF = inputDF.selectExpr("CAST(value AS STRING) as message")
14

15
// Processing your stream
16
val processedDF = dataDF
17
  .withColumn("processed_at", current_timestamp())
18

19
// Writing to Kafka
20
val query = processedDF.selectExpr("CAST(message AS STRING) as key", "CAST(message AS STRING) as value")
21
  .writeStream
22
  .format("kafka")
23
  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
24
  .option("topic", "output_topic")
25
  .option("checkpointLocation", "/tmp/checkpoints/kafka_example")
26
  .start()
27

28
query.awaitTermination()

Make sure you set checkpointLocation so that Spark can manage state and offsets.

Custom Sinks#

When none of the built-in sinks meet your requirements (e.g., you need to write data to a non-supported database or a custom external system), you can implement a ForeachWriter in Scala/Java or a similar mechanism in Python.

Example skeleton of a ForeachWriter in Scala:

1
import org.apache.spark.sql.ForeachWriter
2
import org.apache.spark.sql.Row
3

4
class CustomSinkWriter extends ForeachWriter[Row] {
5
  override def open(partitionId: Long, epochId: Long): Boolean = {
6
    // Open connection or session
7
    true
8
  }
9
  override def process(value: Row): Unit = {
10
    // Write to custom sink
11
  }
12
  override def close(errorOrNull: Throwable): Unit = {
13
    // Close connection/session
14
  }
15
}

Then you attach it to your streaming DataFrame:

1
val query = processedDF.writeStream
2
  .foreach(new CustomSinkWriter)
3
  .start()

Handling Late Data#

Even with watermarks, data can arrive late beyond the watermark threshold. These records are often dropped from aggregate calculations, but you can handle them separately by:

Writing late data to a side sink for further manual inspection.
Using a grace period in your watermarks to strike a balance between memory usage and accuracy.

13. Putting It All Together#

Let’s combine multiple concepts into a simplified end-to-end pipeline example:

Data Source: Kafka topic receiving JSON messages.
Parsing and Filtering: Convert messages to a structured format, filter out invalid data.
Aggregation: Maintain rolling counts per category using event-time with a tumbling window.
Late Data Handling: Use a watermark of 10 minutes.
Sink: Write aggregated results back to a different Kafka topic.

Example:

1
import org.apache.spark.sql.SparkSession
2
import org.apache.spark.sql.functions._
3
import org.apache.spark.sql.streaming.Trigger
4

5
object EndToEndExample {
6
  def main(args: Array[String]): Unit = {
7
    val spark = SparkSession.builder
8
      .appName("EndToEndStructuredStreaming")
9
      .getOrCreate()
10

11
    // 1. Kafka source
12
    val rawStream = spark.readStream
13
      .format("kafka")
14
      .option("kafka.bootstrap.servers", "localhost:9092")
15
      .option("subscribe", "input_topic")
16
      .load()
17

18
    // 2. Parsing and filtering
19
    val parsedStream = rawStream.selectExpr("CAST(value AS STRING) as jsonValue")
20
      .select(from_json(col("jsonValue"), /* schema */).as("data"))
21
      .select("data.*")
22
      .filter(col("category").isNotNull && col("timestamp").isNotNull)
23

24
    // Convert to event time
25
    val withEventTime = parsedStream
26
      .withColumn("eventTime", to_timestamp(col("timestamp")))
27

28
    // 3. Aggregation with watermark
29
    val aggregated = withEventTime
30
      .withWatermark("eventTime", "10 minutes")
31
      .groupBy(
32
        window(col("eventTime"), "5 minutes"),
33
        col("category")
34
      )
35
      .count()
36

37
    // 4. Sink: Write to Kafka
38
    val query = aggregated
39
      .selectExpr(
40
        "CAST(category AS STRING) AS key",
41
        "CAST(concat_ws(',', window.start, window.end, category, count) AS STRING) AS value"
42
      )
43
      .writeStream
44
      .format("kafka")
45
      .option("kafka.bootstrap.servers", "localhost:9092")
46
      .option("topic", "output_topic")
47
      .option("checkpointLocation", "/tmp/checkpoints/end_to_end")
48
      .trigger(Trigger.ProcessingTime("5 seconds"))
49
      .start()
50

51
    query.awaitTermination()
52
  }
53
}

What’s happening here:

We read from a Kafka source, parse the JSON messages, and select valid fields.
We add a column eventTime and define a watermark of 10 minutes.
We perform a tumbling window aggregation of 5 minutes on eventTime.
We output the transformed data back to a Kafka topic, ensuring each micro-batch is tracked via checkpointing.

14. Conclusion and Further Reading#

Transitioning from batch to real-time analytics is a critical step for many data-driven organizations. Spark Structured Streaming offers a unified, high-level API that reduces the complexity of building and managing streaming pipelines. By following best practices—such as setting proper watermarks, checkpointing, and monitoring state—you can achieve robust, fault-tolerant streaming applications.

Structured Streaming can handle everything from simple file-based streaming to intricate pipelines involving message systems like Kafka, complex aggregations, stateful operations, and time-windowed analyses. As you become more familiar with the framework, you can extend these patterns to support larger, more demanding systems.

Additional Resources#

Official Spark Structured Streaming Documentation: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks Structured Streaming Guide: https://docs.databricks.com/spark/latest/structured-streaming
Apache Kafka Documentation for Spark: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Books:
1. “Learning PySpark” by Tomasz Drabas and Denny Lee
2. “Spark: The Definitive Guide” by Bill Chambers and Matei Zaharia

Armed with these insights and examples, you should be well on your way to building powerful, real-time data pipelines with Spark Structured Streaming. Whether you’re looking to perform real-time fraud detection, IoT analytics, or dynamic user interactions, Structured Streaming offers a robust and unified framework to bring your data architecture into the real-time era. Enjoy your journey into continuous data processing!