Chapter 12: Pipeline Architecture
fsa architecture-styles pipeline
Status: Notes complete
Overview
The Pipeline architecture style (also called the Pipes and Filters pattern) structures an application as a unidirectional sequence of processing steps — filters — connected by data channels — pipes. Data flows in one direction from a producer through transformations to a consumer, with each filter performing a single, well-defined, stateless operation on the data it receives. It is one of the oldest and most recognizable architecture patterns, forming the conceptual backbone of Unix shell pipelines, ETL systems, compilers, and modern data engineering platforms. The style trades the ability to handle complex stateful logic for exceptional testability, simplicity, and composability.
Topology
Pipeline Flow (unidirectional)
──────────────────────────────►
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ │ │ │ │ │ │ │ │ │
│ Producer │───►│ Filter │───►│ Filter │───►│ Filter │───►│ Consumer │
│ (Source) │ │(Transform│ │ (Tester) │ │(Transform│ │ (Sink) │
│ │ │ #1) │ │ │ │ #2) │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
│ ┌──────┴───────┐
│ │ Discard / │
│ │ Dead-letter │
│ └──────────────┘
│
Fan-out variant:
┌─────────────────────────────┐
│ Filter A │
│ ┌──────────┐ ┌──────────┐ │
│ │ Branch 1 │ │ Branch 2 │ │
│ └──────────┘ └──────────┘ │
└─────────────────────────────┘
A pipeline has a fixed beginning (a Producer) and a fixed end (one or more Consumers). Each pipe carries data in one direction only — there is no feedback loop, no bidirectional communication, and no shared state between filters. A typical pipeline has 3–10 filters; complex ETL pipelines may have 20+. The key design constraint is that each filter must be independently executable and produce no side effects that affect other filters.
Style Specifics
Filters
Filters are the processing units of a pipeline. There are exactly four types:
Producer (Source)
The entry point of the pipeline. Generates or ingests data and places it into the first pipe. Does not receive data from any upstream filter. Examples: reading from a flat file, polling a database table, consuming from a message queue, receiving HTTP POST requests.
Transformer
Receives data from an upstream pipe, applies a transformation, and places the result in the downstream pipe. The most common filter type. Must be stateless — given the same input, it always produces the same output, with no dependency on previous inputs or external state. Examples: parsing JSON to a domain object, normalizing field names, enriching a record with a lookup value, computing a derived field, encoding/decoding, format conversion.
Tester (Router)
Receives data and evaluates a condition. Based on the condition, either passes data forward to one of several downstream pipes, drops the data, or routes it to an error/dead-letter channel. Does not transform data content — only routes or filters it. Examples: filtering out records with missing required fields, routing orders to different fulfillment paths based on order type, quality check pass/fail routing.
Consumer (Sink)
The terminal filter. Receives data from the final pipe and produces the pipeline’s output — persists to a database, writes to a file, publishes to a topic, calls a downstream API. Does not place data into any downstream pipe. Examples: writing to a data warehouse, publishing to a downstream message queue, updating a database record, generating a report file.
Pipes
Pipes are the data channels that connect filters. Properties:
- Unidirectional: Data flows from one filter to the next, never backward.
- Point-to-point: The most common variant. One filter output connects to exactly one filter input.
- Fan-out: One filter’s output is duplicated into multiple parallel downstream pipes. Used when multiple independent processing branches are needed.
- Implementation options: In-memory queues or collections (simplest, in-process), message queues (Kafka topics, SQS queues, RabbitMQ exchanges) for distributed pipelines, or named pipes/channels in streaming frameworks.
The pipe defines the data contract between filters — the schema of the data passing through. When filters are owned by different teams, the pipe’s schema is the team interface contract.
Data Topologies
Pipeline data flows typically originate from and terminate at external storage systems:
Input sources (feeding the pipeline)
- Flat files (CSV, JSON, XML, Parquet) from file system, SFTP, S3
- Relational database tables (polled via SQL, CDC streams, or database triggers)
- Message queues and streaming platforms (Kafka, Kinesis, SQS, Pub/Sub)
- HTTP endpoints (webhook receivers, API polling)
Output sinks (pipeline destinations)
- Data warehouses (Snowflake, BigQuery, Redshift) — most common in ETL
- Relational databases (transformed records persisted to operational DB)
- File systems or object storage (S3, GCS, ADLS) for downstream batch consumers
- Downstream message queues for chained pipelines
Filter state policy: Filters must not maintain persistent state between executions. If a transformation requires a lookup (e.g., currency code to currency name), that lookup must come from an external, read-only reference store — not from the filter’s own memory. This statelessness is what makes filters independently scalable and testable.
Cloud Considerations
Cloud platforms have first-class pipeline primitives:
AWS
- AWS Step Functions: Visual workflow orchestrator. Each state is a filter. Supports choice states (Tester), transformation states (Transformer via Lambda), and parallel states (fan-out). Native error handling with retry and catch.
- AWS Glue: Managed ETL service. Glue jobs are transformer filters; Glue triggers are producers/consumers.
- Amazon Kinesis Data Firehose: Streaming pipeline with built-in transformation (via Lambda filters) and delivery to S3/Redshift/OpenSearch.
- AWS Lambda: Serverless functions as filters — naturally stateless, independently scalable.
Azure
- Azure Data Factory: Managed ETL pipelines. Activities map directly to filters. Supports conditional routing, fan-out, and error handling natively.
- Azure Logic Apps: Low-code pipeline orchestration. Connectors as producers/consumers; transformations as filters.
GCP
- Cloud Dataflow: Apache Beam-based streaming and batch pipeline execution. Beam’s
PTransformis the filter abstraction. - Cloud Composer (Apache Airflow): DAG-based pipeline orchestration for complex dependency graphs.
Framework-level
- Apache NiFi: Visual pipeline builder. FlowFiles are the data, processors are filters, connections are pipes. First-class back-pressure, provenance, and error routing.
- Apache Kafka Streams: Stream processing with filter, map, and branch operations — stateful and stateless transformations as filter chains.
- Apache Camel: Enterprise Integration Patterns — router, transformer, splitter, aggregator maps directly to filter types.
Serverless filters (Lambda, Cloud Functions, Azure Functions) are the natural cloud implementation of the stateless transformer filter — pay-per-execution, auto-scaling, no server management.
Common Risks
Pipeline Performance Bottleneck (Slowest Filter Limits Throughput): The pipeline’s throughput is constrained by its slowest filter — this is Amdahl’s Law applied to pipelines. If one transformer takes 5 seconds per record and all others take 10ms, the pipeline throughput is limited to ~12 records/minute regardless of parallelism in other filters. Mitigation: profile per-filter latency, parallelize the slow filter, or decompose it into smaller sub-filters. In cloud implementations, independently scale the slow filter’s execution environment.
Error Handling Complexity (What to Do When a Filter Fails?): A malformed or unexpected record can fail a transformer. If the pipeline has no error handling, the failure propagates and may block the entire pipeline. Mitigation: implement dead-letter channels — a parallel pipe that catches failed records, preserves them for inspection, and allows the main pipeline to continue. Establish explicit retry policies (transient failures) vs. skip-and-log policies (poison messages).
Dead-Letter Queue Accumulation (Poison Messages): A record that consistently causes a filter to fail — a “poison message” — will block a queue-based pipeline indefinitely if not handled. Mitigation: implement a maximum retry count with automatic routing to a dead-letter queue after N failures. Monitor dead-letter queues actively; a growing DLQ is a canary for a systemic data quality or schema problem.
Schema Drift: When the data schema at a pipe changes (a new field added, a field renamed), all downstream filters that depend on that field are affected. In large pipelines, this creates a hidden coupling problem. Mitigation: use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) to enforce backward/forward compatibility at every pipe.
Governance
Statelessness fitness functions: Automated checks that verify filters do not maintain instance-level state. In Java/Kotlin, this can be enforced by requiring all filter classes to be @Stateless (EJB) or by scanning for instance variable usage outside constructors. In functional implementations (pure functions), statelessness is enforced by the type system.
Per-filter latency monitoring: Each filter should emit a processing latency metric (e.g., a Prometheus histogram per filter name). Governance rule: if any filter’s p99 latency exceeds a defined threshold, alert and investigate. This prevents the slowest-filter bottleneck from being invisible.
Pipeline throughput SLAs: End-to-end pipeline latency (from producer ingestion to consumer completion) should have an SLA (e.g., 99% of records processed within 5 minutes of ingestion). This is monitored via a timestamp attached to each record at the producer stage, compared at the consumer stage.
Dead-letter queue zero policy: The DLQ should be monitored for growth. A governance fitness function can alert when the DLQ depth exceeds N records — signaling a data quality regression or a schema-breaking change.
Team Topology
Filters can be owned by different teams. The pipe (data schema) is the team interface contract — analogous to an API contract in microservices:
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Data Eng Team │ │ Analytics Team │ │ ML Team │
│ │ │ │ │ │
│ [Producer] │──Schema──►[Transformer #1] │──Schema──►[Transformer #2] │
│ [Transformer] │ v1.2 │[Tester] │ v2.0 │[Consumer] │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Owns: ingestion Owns: enrichment Owns: ML features
Conway’s Law implication: If a single team owns the whole pipeline, filter decomposition is primarily a design/testability concern. If multiple teams own different filter segments, the pipe schemas become explicit cross-team contracts and must be versioned and evolved carefully. Schema registries and contract testing (Pact, Avro schema compatibility checks) become mandatory.
Team ownership of individual filters also enables independent deployment of filter logic — if filters run as serverless functions or containers, each team can update their filters without coordinating a full pipeline redeployment.
Architectural Characteristics Ratings
| Characteristic | Rating | Notes |
|---|---|---|
| Overall agility | ★★★★☆ | Filters are independently modifiable; adding a new filter is low-risk |
| Ease of deployment | ★★★★☆ | Individual filters can be deployed independently in distributed implementations |
| Testability | ★★★★★ | Stateless filters are the easiest possible thing to unit test — pure functions |
| Performance | ★★★☆☆ | Limited by slowest filter; serialization overhead in distributed pipes |
| Scalability | ★★★★☆ | Individual filters can be scaled independently in cloud/distributed pipelines |
| Ease of development | ★★★★★ | Simple mental model; each filter has one job; no distributed system concerns |
| Simplicity | ★★★★★ | The simplest distributed-processing mental model; flows in one direction only |
| Overall cost | ★★★★★ | Serverless filters cost only per execution; no idle infrastructure |
When to Use
- ETL and data transformation pipelines: Ingest → Validate → Transform → Enrich → Load. The canonical use case.
- Event and log processing: Stream of events from application logs, IoT sensors, or audit trails that need filtering, parsing, and routing.
- Batch processing workflows: Nightly report generation, billing computation, data migration jobs — fixed start/end, no user interaction required.
- Compiler/interpreter stages: Lexing → Parsing → Semantic Analysis → Code Generation — each stage is a filter.
- Media processing pipelines: Video transcoding, image resizing, audio normalization — a series of independent format-conversion steps.
- Data quality and validation workflows: Input validation → Normalization → Business rule checks → Output — stateless checks at each stage.
- Integration/messaging workflows: Inbound message → Decode → Route → Transform → Send. Apache Camel is purpose-built for this.
When Not to Use
- Interactive user interfaces: Users need bidirectional communication — a pipeline’s unidirectional model does not fit.
- Complex stateful business logic: If processing step N needs to know what happened in step N-3 (e.g., a fraud detection engine that maintains per-user session state), a stateless filter model is the wrong abstraction.
- Real-time bidirectional communication: Chat, collaborative editing, live dashboards requiring server push — these require request/response or publish-subscribe patterns, not pipelines.
- Transactional workflows with complex rollback: If a failure mid-pipeline requires compensating transactions across previous steps, the pipeline model lacks native support. Consider a saga pattern or orchestrated workflow instead.
- Low-latency, high-frequency request/response: A REST API handling 10,000 req/sec with sub-10ms SLA does not fit a pipeline — each request would traverse multiple filter stages with overhead at each step.
Examples and Use Cases
Data warehouse ETL: A nightly batch job ingests raw transaction data from an operational database (Producer), validates and normalizes records (Transformer), enriches with currency conversion rates from a reference table (Transformer), filters out test transactions (Tester), and loads the result into a data warehouse (Consumer). The stateless nature of each step makes it trivially restartable after failure.
Log processing pipeline: Application servers emit structured JSON logs to a Kafka topic (Producer). A first filter normalizes log levels across services (Transformer). A second filter routes error-level logs to a PagerDuty alert channel and warning-level logs to a slow analysis queue (Tester/Router). A final filter indexes all logs into Elasticsearch (Consumer). Apache NiFi or Kafka Streams is the natural implementation.
Image processing service: A user uploads a photo (Producer). A first filter scans for inappropriate content (Tester). A second filter resizes to thumbnail (Transformer). A third filter compresses and optimizes (Transformer). A fourth filter stores in S3 and updates the database (Consumer). Each filter runs as an independent Lambda function; the pipe is an SQS queue with dead-letter queue configuration.
Key Takeaways
- Four filter types: Producer (source), Transformer (converts), Tester (routes/filters), Consumer (sink). Every filter in every pipeline is one of these four types.
- Statelessness is mandatory: Each filter must produce the same output for the same input, regardless of what preceded it. This constraint is what makes filters independently testable, deployable, and scalable.
- Testability is the standout characteristic (★★★★★): A stateless transformer is a pure function. Pure functions are trivially unit tested with zero setup. No mocks, no containers, no test doubles needed.
- Slowest filter governs throughput: Pipeline performance is bounded by the slowest stage. Monitor per-filter latency; independently scale or optimize the bottleneck filter.
- Dead-letter queues are essential, not optional: Any production pipeline must handle poison messages explicitly. An unhandled filter failure can silently stall the entire pipeline indefinitely.
- Pipes define team contracts: In multi-team pipelines, the data schema on the pipe is the team interface contract. Schema registries and contract testing are the pipeline equivalent of API versioning.
- Cloud has native pipeline primitives: AWS Step Functions, Azure Data Factory, Apache NiFi, Kafka Streams, and Apache Camel are purpose-built implementations of this pattern. You rarely need to implement pipes and filters from scratch.
- Simplicity is the primary advantage: The unidirectional mental model is easy to understand, debug, and onboard new developers to. Every filter has exactly one job.
- Not for interactive or stateful workflows: If your system needs bidirectional communication, user interaction, or complex per-entity state accumulation, a pipeline is the wrong architectural choice.
Related Resources
- ch09-architecture-styles-foundations — Monolith vs. distributed, fallacies of distributed computing
- ch15-event-driven-architecture — The more complex style for event-driven needs; compare broker topology to pipeline
- ch19-choosing-architecture-style — Style selection decision criteria
- ch20-architectural-patterns — Orchestration vs. Choreography — compare to pipeline’s unidirectional control flow
Last Updated: 2026-05-29