5 Data Pipeline Patterns Every AI Team Needs

Before your AI model can produce a single insight, your data infrastructure needs to reliably deliver clean, consistent, timely data. In practice, most AI failures aren't model failures — they're data failures in disguise.

Over hundreds of pipeline implementations, we've converged on five patterns that show up in almost every production system we build. Here they are.

1. The Dead Letter Queue Pattern

Every pipeline will encounter records it can't process. Maybe the schema is malformed. Maybe a required field is null. Maybe the data violates a business rule.

The naive approach: fail the pipeline or silently drop the record. Both are catastrophic.

The right approach: route unprocessable records to a dead letter queue (DLQ) — a separate data store where failed records wait for human review or automated retry.

# Kafka consumer configuration
consumer:
  on_error: route_to_dlq
  dlq:
    topic: events.dlq
    retention_days: 30
    metadata:
      - error_type
      - error_message
      - original_offset
      - failed_at

DLQs give you:

Zero data loss — nothing is silently dropped
Debugging context — you can inspect exactly what failed and why
Replay capability — fix the bug, reprocess the records

We treat DLQ depth as a first-class SLO. If it starts growing, something's wrong and we want to know immediately.

2. The Schema Registry Pattern

Schemas change. Product teams add fields. Source systems are updated. The format of an event that was valid in January may be invalid in July.

Unmanaged schema evolution is one of the most common sources of pipeline failures in production systems. The solution is a schema registry.

A schema registry is a centralized store of all schema versions for all your event types. Every producer registers its schema before publishing. Every consumer declares which schema versions it can handle.

When a new version is published, the registry validates backward/forward compatibility before allowing it through.

from confluent_kafka.schema_registry import SchemaRegistryClient

registry = SchemaRegistryClient({"url": "http://schema-registry:8081"})

# Producer registers schema
schema_id = registry.register_schema(
    "user-events-value",
    Schema(schema_str, "AVRO")
)

# Consumer validates on read — incompatible changes fail loudly
serializer = AvroDeserializer(registry, schema_str)

With a schema registry, you can evolve your data model with confidence and catch breaking changes before they reach production.

3. The Watermark Pattern for Late Data

Real-time streaming systems face an unavoidable problem: data arrives late. A mobile app event generated at 14:00:00 might not arrive at your pipeline until 14:02:47, because of network delays, device offline periods, or batching.

If you're doing time-windowed aggregations — which you are if you're building dashboards or training models on recent data — late arrivals will corrupt your results if not handled correctly.

Watermarks solve this. A watermark is a timestamp that tells the system "we've seen all events with timestamp ≤ T." Events that arrive after their window's watermark are either:

Dropped (if you need strict latency guarantees)
Processed in a corrective window (if accuracy matters more)
Routed to a late data handler

Apache Flink has excellent native watermark support:

DataStream<Event> events = env
    .fromSource(kafkaSource, WatermarkStrategy
        .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(30))
        .withIdleness(Duration.ofMinutes(1)),
        "Kafka Source");

This configuration tells Flink to wait 30 seconds for late events before closing a window. Events more than 30 seconds late are handled separately.

4. The Idempotent Consumer Pattern

Network failures happen. Consumers crash mid-process. At-least-once delivery means you will process the same message more than once.

If your processing isn't idempotent — if processing the same record twice produces different results — you'll get data corruption. Duplicate counts. Double charges. Conflicting state.

The fix is to design every consumer to be idempotent: processing the same record N times produces the same result as processing it once.

The practical approach: use a deduplication key.

def process_event(event: Event, db: Database) -> None:
    # Check if we've already processed this event
    if db.exists("processed_events", event.id):
        logger.info(f"Skipping duplicate event {event.id}")
        return
    
    # Process the event
    result = compute_result(event)
    
    # Atomically write result + mark as processed
    with db.transaction():
        db.write("results", result)
        db.insert("processed_events", event.id, ttl=days(7))

The deduplication key is typically the event's unique ID. With a 7-day TTL on the deduplication store, you're protected against the vast majority of real-world duplicate scenarios.

5. The Circuit Breaker Pattern

Your pipeline calls downstream services: databases, APIs, model inference endpoints, third-party enrichment services. All of these can fail or become slow.

Without protection, a slow downstream service will cause your pipeline to back up, exhaust connection pools, and eventually take down the whole system.

The circuit breaker pattern prevents cascade failures by monitoring downstream health and stopping calls when failure rates exceed a threshold.

from tenacity import circuit_breaker, stop_after_attempt, wait_exponential

@circuit_breaker(
    failure_threshold=5,       # Open after 5 failures in 60 seconds
    recovery_timeout=30,       # Try again after 30 seconds
    expected_exception=ServiceError
)
async def enrich_record(record: dict) -> dict:
    return await enrichment_api.call(record)

A circuit breaker has three states:

Closed (normal): Calls pass through. Failures are counted.
Open (failing): Calls are rejected immediately. Fast-fail preserves resources.
Half-open (recovering): A test call is allowed. If it succeeds, the circuit closes.

Combined with DLQs, circuit breakers give you a pipeline that degrades gracefully instead of catastrophically.

Putting It Together

These five patterns aren't mutually exclusive — they complement each other. A production-grade pipeline at PrismGraph typically uses all five simultaneously:

Pattern	Problem It Solves
Dead Letter Queue	Data loss from processing failures
Schema Registry	Breaking schema changes
Watermarks	Late-arriving event data
Idempotent Consumers	At-least-once delivery
Circuit Breaker	Downstream service failures

The combination gives you a system that can handle the full spectrum of real-world failure modes without human intervention.

Priya Nair is Co-founder & CTO at PrismGraph Technologies. PhD in distributed systems, formerly led data infrastructure at a Series D fintech startup.