The Secret Sauce: Combining Batch and Streaming with Feature Stores#

Modern machine learning systems have grown increasingly sophisticated, requiring both fresh data (streaming) and historically enriched data (batch) to inform predictive models. Feature stores are emerging as the indispensable framework to effectively manage, store, and serve these data-driven attributes for machine learning pipelines. Whether you’re just starting your exploration into feature stores or have been grappling with how to integrate them with both batch and streaming data sources, this guide aims to give you a clear understanding of what feature stores are, how they work, and how to unlock their power by combining both batch and streaming data.

In this blog post, we will start from the basics—explaining the main concepts behind feature stores—and gradually move towards more complex scenarios, ultimately unveiling how batch and streaming architectures can coexist within a robust feature store infrastructure. Throughout the article, you’ll see examples, code snippets, and tabular comparisons that will help you grasp the essential points. By the end, you’ll be equipped with knowledge suitable for a production-level, enterprise environment.

Table of Contents#

Introduction to Feature Stores
Batch Data Fundamentals
Streaming Data Fundamentals
Why Combine Batch and Streaming in a Feature Store?
Key Design Principles
Basic Architecture Overview
Sample Setup and Code Examples
Maintaining Data Consistency
Operationalizing the Feature Store
- Monitoring and Observability
- Scaling Considerations
Advanced Topics
Putting It All Together
Conclusion

Introduction to Feature Stores#

Feature stores are centralized data management layers tailored for machine learning (ML). They abstract the complexity of sourcing data from multiple places—databases, data lakes, message queues, APIs, and more—and ensure that features (i.e., the attributes used to train and power ML models) remain consistent, up to date, and easily discoverable.

Key aspects of a feature store include:

Storing precomputed features in a format optimized for fast retrieval.
Enforcing consistent transformations so that the same logic is applied during training and inference.
Serving features in both batch (offline) and online (low-latency) contexts.

Why Feature Stores?#

Consistency: One of the biggest challenges in ML is the “training-serving skew.�?A feature store ensures that the same data transformations are used when training your model and when serving predictions.
Reuse and Collaboration: Data scientists can re-use existing features, streamlining the development process and preventing duplication of effort.
Governance: Centralizing features in a store makes auditability and compliance controls more straightforward.

For a small ML project, you might store your data in CSV files or Python notebooks. That might suffice initially. However, once data grows in volume, velocity, and variety, a feature store becomes a critical component for robust, large-scale ML systems.

Batch Data Fundamentals#

“Batch data�?usually refers to large chunks of data processed periodically. This data might come from:

Data warehouses or data lakes (e.g., structured or unstructured files).
Periodic exports from transactional databases.
Logs and usage data aggregated daily or hourly.

Characteristics of Batch Data#

Volume: Often large in size, accumulated over time.
Latency: Usually processed at fixed intervals, such as hourly or daily, so not typically real-time.
Cost-Efficient for Historical Computation: Ideal for computing long-term aggregates, trends, or advanced analytics that don’t require immediate availability.

Batch data is perfect for exploring, feature engineering using historical data, and training ballistic ML models that do not strictly need real-time updates. However, many production ML models also require more up-to-date signals, which leads us to streaming data.

Streaming Data Fundamentals#

“Streaming data�?involves continuous, near-real-time flows of information. Typical sources include:

User interactions (clicks, views, events) on a website or mobile app.
Sensor data from IoT devices.
System logs and monitoring metrics that update in real-time.

Characteristics of Streaming Data#

Velocity: Data arrives continuously and must be processed almost immediately.
Low-Latency Requirements: Business use cases often demand real-time or near-real-time insights (e.g., fraud detection).
Potentially Smaller Batches: Data is often processed event-by-event or in micro-batches within seconds or milliseconds.

In isolation, streaming data can power real-time ML. However, to enrich streaming features, you often need historical or batch-derived context. This synergy—combining batch and streaming—is what makes a feature store truly valuable.

Why Combine Batch and Streaming in a Feature Store?#

A feature store that integrates both batch and streaming data sources solves several “pain points�?in modern ML:

Freshness: Streaming updates ensure features are current up to the second or minute.
Historical Context: Batch processes aggregate historical data to create features such as rolling averages, frequency counts, or user lifetime value.
Hybrid Use Cases: Many real-world scenarios need both: for example, a recommendation system that uses real-time behavior data (streaming) and long-term user preferences (batch).
Unified Data Pipeline: Developers and data scientists interact with a single interface (the feature store) that abstracts away complexities around data ingestion, transformation, and access.

By marrying batch and streaming capabilities, feature stores cater to both offline training (where historical data is crucial) and online inference (where low-latency updates are a must).

Key Design Principles#

When designing a feature store to handle both batch and streaming data, several guiding principles emerge:

Immutable Storage: Raw data and derived features are often stored in an immutable format for auditing and reproducibility.
Stateful Transformations: The transformations that generate features need to handle state, particularly in streaming modes (for rolling windows, accumulations, or dynamic thresholds).
Schema Management: Data schemas must be tracked and version-controlled to avoid breaking changes.
Time as a First-Class Citizen: Time-indexed data is critical for advanced analytics, time-travel queries, and ensuring consistent training data sets.
Latency-Aware Serving Layers: Allowing both offline queries (analytics, training features) and online queries (real-time inference) with sub-second response times.

Basic Architecture Overview#

While the specifics vary by platform, a robust feature store often has at least four key components:

Data Ingestion
- Batch Ingestion: Periodically fetches data from batch sources (data lake, warehouse), then transforms and stores it in both an offline storage system (for historical queries) and an online store (for low-latency lookups).
- Streaming Ingestion: Uses streaming frameworks (e.g., Apache Kafka, Apache Flink, Spark Structured Streaming) to process real-time data. These updates feed the same feature store.
Transformation and Aggregation
- The logic that converts raw data into usable features. In batch mode, this might involve large-scale distributed systems or scheduled jobs. In streaming mode, this could be an always-running real-time service.
Offline Storage
- Typically a scalable data lake or data warehouse solution (e.g., Hadoop HDFS, cloud-based object storage, or a columnar store). Used for historical data and large-scale feature engineering.
Online Serving Layer
- A low-latency database or key-value store. The final, processed features are stored here so your application can quickly retrieve the features needed for inference.

The essence of “combining batch and streaming�?is ensuring both ingestion pipelines can co-exist and harmonize. Streaming ingestion updates the online store in near-real-time, while batch ingestion ensures the historical offline store remains comprehensive and up to date.

Sample Setup and Code Examples#

In this section, we’ll walk through a simplified, code-based illustration of how you might combine batch and streaming mechanisms in a mock feature store. We’ll outline:

A minimal Python-based setup to emulate a feature store infrastructure.
Batch ingestion of user data.
Streaming ingestion of real-time events.
Serving features for an inference request.

This example is not production-grade but offers conceptual clarity on how the pieces fit together.

Setting Up a Mock Feature Store#

Below is a skeletal Python class to illustrate the core functionality of adding and retrieving features. In an actual implementation, you’d interface with real storage systems (like PostgreSQL, Cassandra, Redis, etc.).

1
import time
2
from datetime import datetime
3
from typing import Dict, Any
4

5
class MockFeatureStore:
6
    def __init__(self):
7
        # In-memory dictionaries to represent offline and online stores
8
        self.offline_store = {}
9
        self.online_store = {}
10

11
    def ingest_batch(self, entity_id: str, features: Dict[str, Any], timestamp: datetime):
12
        """
13
        Simulate a batch ingestion by storing data in both offline and online stores.
14
        """
15
        # Offline store: we keep a history of features keyed by (entity_id, timestamp)
16
        key = (entity_id, timestamp.isoformat())
17
        self.offline_store[key] = features
18

19
        # Online store: we only keep the latest version of features for instant lookups
20
        self.online_store[entity_id] = {
21
            'features': features,
22
            'ingest_time': datetime.utcnow().isoformat()
23
        }
24

25
    def ingest_stream(self, entity_id: str, features: Dict[str, Any]):
26
        """
27
        Simulate a streaming ingestion by updating the online store with new data.
28
        """
29
        # We assume the timestamp is 'now' for streaming events
30
        timestamp = datetime.utcnow()
31
        key = (entity_id, timestamp.isoformat())
32
        self.offline_store[key] = features
33

34
        # Update online store with latest features
35
        self.online_store[entity_id] = {
36
            'features': features,
37
            'ingest_time': timestamp.isoformat()
38
        }
39

40
    def get_online_features(self, entity_id: str):
41
        """
42
        Retrieve the most recent feature set for a given entity.
43
        """
44
        if entity_id not in self.online_store:
45
            return None
46
        return self.online_store[entity_id]['features']
47

48
    def get_offline_features(self, entity_id: str, start_time: datetime, end_time: datetime):
49
        """
50
        Retrieve historical features within a time range.
51
        """
52
        results = []
53
        for (eid, ts_str), feat in self.offline_store.items():
54
            if eid == entity_id:
55
                ts = datetime.fromisoformat(ts_str)
56
                if start_time <= ts <= end_time:
57
                    results.append((ts, feat))
58
        return results

Batch Ingestion Example#

Below is how we might simulate a daily batch ingestion of user data. Let’s assume this is being run at regular intervals (e.g., midnight each day).

1
from datetime import datetime, timedelta
2
import random
3

4
def run_batch_ingestion(feature_store: MockFeatureStore):
5
    current_time = datetime.utcnow()
6
    users = ['user_1', 'user_2', 'user_3']
7
    for user in users:
8
        # Simulate features derived from historical logs, e.g., 7-day rolling average
9
        user_features = {
10
            '7_day_avg_clicks': random.randint(5, 15),
11
            '7_day_avg_time_on_site': round(random.uniform(10, 50), 2)
12
        }
13
        # Use a timestamp representing the batch job's reference time
14
        batch_timestamp = current_time - timedelta(days=1)
15

16
        feature_store.ingest_batch(entity_id=user, features=user_features, timestamp=batch_timestamp)
17
        print(f"[Batch] Ingested features for {user} at {batch_timestamp.isoformat()}")

In a real setting, you’d pull this data from a data warehouse or a precomputed set of transformations run by Spark/MapReduce. The result is that your feature store now has updated historical data and the latest daily snapshot in the online store.

Streaming Ingestion Example#

Next, let’s process real-time events (e.g., user interactions on a web platform). We might have a streaming job that runs continuously, listening to a message queue or an event bus:

1
import time
2

3
def run_streaming_ingestion(feature_store: MockFeatureStore):
4
    users = ['user_1', 'user_2', 'user_3']
5
    for i in range(10):
6
        for user in users:
7
            # Example real-time events: user's last click time or last page viewed
8
            stream_features = {
9
                'last_clicked_page': f"page_{random.randint(1, 10)}",
10
                'session_length': random.uniform(0.5, 3.0)
11
            }
12

13
            feature_store.ingest_stream(entity_id=user, features=stream_features)
14
            print(f"[Stream] Ingested streaming features for {user} at {time.time()}")
15

16
        # Let’s simulate a small delay in event arrival
17
        time.sleep(1)

In a more robust solution, you’d leverage a distributed stream processing framework (e.g., Apache Flink or Kafka Streams) to compute rolling aggregates, windowed counts, or other advanced features on the fly.

Serving Features for Online Inference#

Finally, whenever your ML model needs to make a prediction, you can perform a quick online lookup:

1
def make_online_prediction(feature_store: MockFeatureStore, user: str):
2
    features = feature_store.get_online_features(user)
3
    if features is None:
4
        print(f"No features found for {user}, cannot make a prediction.")
5
        return None
6

7
    # Imagine we have a simple logic: if session_length > 1.5, we label them as "engaged"
8
    is_engaged = features.get('session_length', 0) > 1.5
9
    prediction = "engaged_user" if is_engaged else "casual_user"
10

11
    print(f"Prediction for {user}: {prediction} based on features: {features}")
12

13
# Example usage
14
fs = MockFeatureStore()
15
run_batch_ingestion(fs)
16
run_streaming_ingestion(fs)
17
make_online_prediction(fs, 'user_1')

This code snippet demonstrates how a feature store unifies batch and streaming data for immediate availability. Of course, real deployments are far more complex, but the principle remains the same.

Maintaining Data Consistency#

Data consistency is paramount:

Training-Serving Skew: Mismatched transformations between training data (often derived from batch) and the real-time data used for serving can degrade model performance. Solve this by defining transformations once, so they’re applied to both batch and stream data paths.
Event Time vs. Processing Time: In streaming systems, the difference between when an event occurred and when it was processed can create artifacts in your data. Carefully handle time-stamping.
Schema Evolution: Over time, new attributes are introduced, and existing ones evolve. Feature stores need versioning mechanisms with backward-compatible transformations.

In advanced setups, you’ll often see “time-travel queries�?or point-in-time lookups to handle training consistency. These queries ensure you only use data that would have been available at the exact point in time when a prediction was made in the past.

Operationalizing the Feature Store#

Monitoring and Observability#

It’s crucial to monitor:

Data Freshness: Track latency from the source to the store.
Quality Checks: Validate schema, distribution shifts, missing values, and outliers.
Throughput and Performance Metrics: Measure ingestion speed, query response times, and memory usage.

Logging and observability are best backed by dedicated APM (Application Performance Monitoring) tools or integrated solutions like Prometheus + Grafana.

Scaling Considerations#

Scaling a feature store to handle both batch and streaming workloads can be resource-intensive. You can approach scaling in several ways:

Horizontal Scaling: Add more instances of your ingestion services and storage backends.
Partitioning: Partition data by entity or time range to reduce indexing overhead.
Caching Strategy: Use caching layers (e.g., Redis) for real-time lookups to reduce load on primary databases.
Autoscaling Policies: Dynamically spin up or down resources based on throughput requirements.

Advanced Topics#

Feature Engineering in Real Time#

In real-time ML, feature engineering must happen on the fly:

Windowed Aggregations: Calculating rolling means, sums, or counts over short windows (e.g., the last 30 seconds).
Time-Decayed Features: Weighting recent events more than older ones.
Feature Interaction: Combining multiple event streams to create new composite features (e.g., ratio of clicks to impressions).

Depending on your streaming framework of choice, you might use built-in functions for windowed aggregations or develop custom stateful operators.

Stateful Stream Processing#

For advanced pipelines, stream processing frameworks allow you to maintain state across events rather than computing each event individually. This is critical for features like:

Session Tracking: Holding the state of a user’s session as events come in.
Incremental Aggregations: Continuously updating counters or sums without re-processing all historical data.
Complex Event Processing: Identifying patterns across multiple streams (e.g., anomaly detection).

Time-Travel Queries#

Time-travel queries let data scientists re-run experiments or debug models using historical snapshots:

In the offline store, you can keep versions or use immutable storage with timestamps.
When training a model for historical data, you only retrieve features valid at the time of each event.

This ensures that your model sees the real-world data that was available at training time, preventing data leakage.

Putting It All Together#

Let’s illustrate how the various pieces come together in a real ML workflow:

Raw Data Sources:
- Batch data from a data lake (containing historical logs, user attributes).
- Streaming data from real-time interactions or IoT devices.
Ingestion and Transformation:
- A scheduled job processes daily or hourly historical data, computing aggregated features.
- A streaming job continuously updates features like current session state, last click time, or incremental counts.
Offline Store (e.g., data lake or warehouse):
- Stores large volumes of historical feature data, used for training new ML models.
- Allows time-travel queries for reproducibility.
Online Store (e.g., NoSQL or in-memory KV store):
- Stores the most recent set of features for low-latency access by your production ML model.
Model Training:
- Periodically pull data from the offline store with labeled outcomes.
- Train new model versions, verifying performance on historical data.
Online Inference:
- The production service queries the online store for current features (including those updated by streaming).
- The ML model makes a real-time prediction.
Monitoring:
- Monitor data quality, model performance, and latency metrics.
- Trigger alerts or corrections if data anomalies are detected.

This workflow ensures your ML system benefits from both the depth of historical context and the immediacy of real-time signals, all governed by a single feature store.

Conclusion#

Feature stores have emerged as a cornerstone in modern ML infrastructures, bridging the gap between batch and streaming data pipelines. By separating the concerns of data ingestion, transformation, storage, and serving, a well-designed feature store provides:

Consistency across training and serving (reduced skew).
Scalability across large-scale historical data and real-time updates.
Flexibility to accommodate advanced feature engineering and time-travel capabilities.

For practitioners, the journey typically starts with batch-based features, but as your use cases mature, streaming becomes indispensable for real-time decision-making. Integrating both batch and streaming into a unified feature store architecture is the “secret sauce�?for delivering up-to-date and contextually rich ML models—be it for fraud detection, personalized recommendations, or dynamic pricing.

As you move forward, take care to:

Define and maintain clear, versioned transformations for both data paths.
Ensure robust monitoring and observability.
Design your schema and data pipeline with an emphasis on time management and consistency.

Armed with these best practices, you’ll be ready to tackle the complexities of real-world machine learning with confidence, leveraging your feature store as a unified platform for both batch and streaming data.