Chapter 12 Flashcards — Pipeline Architecture Style

flashcards fsa pipeline-architecture


What is the pipeline architecture style?
?
An architecture style where data flows unidirectionally through a sequence of independent processing steps (filters) connected by data channels (pipes). Each filter performs a single, stateless operation. Also called the Pipes and Filters pattern.


What are the four filter types in a pipeline architecture?
?

  1. Producer — data source; entry point of the pipeline
  2. Transformer — converts, enriches, or reformats data
  3. Tester — routes or filters data based on a condition (no transformation)
  4. Consumer — data sink; terminal output of the pipeline

What is a Producer filter?
?
A Producer is the entry point of a pipeline. It generates or ingests data and places it into the first pipe. It receives no data from any upstream filter. Examples: reading a CSV file, polling a database table, consuming from a Kafka topic.


What is a Transformer filter?
?
A Transformer receives data from an upstream pipe, applies a transformation, and places the result in the downstream pipe. It must be stateless — same input always produces same output. Examples: parsing JSON, normalizing fields, currency conversion, format encoding.


What is a Tester filter?
?
A Tester (or Router) evaluates a condition on incoming data and routes it to one of several downstream pipes, drops it, or sends it to a dead-letter channel. It does not transform the data’s content — it only routes it. Examples: quality check pass/fail routing, filtering out invalid records.


What is a Consumer filter?
?
A Consumer is the terminal filter. It receives data from the final pipe and produces the pipeline’s output — writes to a database, file, or downstream system. It does not place data into any downstream pipe.


What is a pipe in the pipeline architecture style?
?
A pipe is a unidirectional data channel connecting two filters. Data flows from one filter to the next, never backward. Pipes define the data schema contract between filters. Implementations: in-memory queues, Kafka topics, SQS queues, message broker channels.


What is the statelessness requirement for filters, and why is it fundamental?
?
Each filter must produce the same output for the same input with no dependency on previous inputs or external mutable state. This constraint makes filters independently testable (pure functions), independently deployable, and independently scalable. Stateful logic in a filter breaks all three properties.


Why does the pipeline style score ★★★★★ on testability?
?
Stateless transformer filters are pure functions — given input X, they always return output Y. Pure functions require zero setup to unit test: no mocks, no containers, no database stubs, no test doubles. This is the simplest possible thing to test in software.


Why does the pipeline style score ★★★☆☆ on performance?
?
Pipeline throughput is bounded by the slowest filter (Amdahl’s Law). Additionally, distributed pipe implementations (Kafka, SQS) add serialization and network overhead at each stage. For workloads that need sub-millisecond latency, the per-filter overhead accumulates.


What is the slowest filter bottleneck risk and how is it mitigated?
?
The slowest filter governs the entire pipeline’s throughput — no matter how fast other filters are, you can’t exceed the pace of the bottleneck. Mitigation: monitor per-filter latency, parallelize the slow filter’s execution (run N instances), or decompose it into smaller sub-filters with finer granularity.


What is a dead-letter queue (DLQ) and why is it essential in pipelines?
?
A dead-letter queue is a separate channel that receives records that a filter repeatedly fails to process (poison messages). Without a DLQ, a single malformed record can block the entire pipeline indefinitely. A DLQ preserves failed records for inspection while allowing the main pipeline to continue.


What is a poison message in a pipeline?
?
A poison message is a record that consistently causes a filter to throw an error — usually due to malformed data, unexpected schema, or a bug in the filter. It will loop through retries indefinitely without a max-retry policy. Mitigation: set a maximum retry count; after N failures, route the record to the dead-letter queue automatically.


What is schema drift and how is it managed in pipelines?
?
Schema drift occurs when the data schema at a pipe changes (field added, renamed, or removed), breaking downstream filters that depend on that field. Managed with a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) that enforces backward/forward compatibility on every pipe’s schema.


How does the pipeline architecture differ from event-driven architecture?
?
Pipeline: unidirectional, sequential, batch/stream processing, data transformation focus, no feedback loops, stateless filters. Event-driven: bidirectional communication possible, event producers don’t control downstream processing, filters/handlers can trigger new events, supports complex stateful orchestration. Pipelines are simpler; event-driven is more flexible.


Name four cloud-native pipeline implementations.
?

  1. AWS Step Functions — visual workflow; states = filters
  2. Azure Data Factory — managed ETL; activities = filters
  3. Apache NiFi — visual pipeline builder; processors = filters
  4. Kafka Streams — stream processing with map/filter/branch operations as filter chains

What is AWS Step Functions and how does it map to pipeline concepts?
?
AWS Step Functions is a visual workflow orchestrator. Each state is a filter: Choice states are Testers, Task states calling Lambda are Transformers, and terminal states are Consumers. It supports parallel states (fan-out pipes), retry policies, and error catching — first-class pipeline primitives.


What is fan-out in a pipeline?
?
Fan-out is when one filter’s output is duplicated into multiple parallel downstream pipes, allowing independent parallel processing branches. Example: after parsing an order, fan-out to simultaneously update inventory, trigger fulfillment, and send a confirmation email — three independent Consumer paths.


When should you use the pipeline architecture style? (give three use cases)
?
(1) ETL pipelines: Ingest → Validate → Transform → Load. (2) Event/log processing: filtering, parsing, and routing streams of events. (3) Batch workflows: nightly report generation, billing computation, media transcoding — any fixed-start/fixed-end processing chain with no user interaction.


When should you not use the pipeline architecture style? (give three conditions)
?
(1) Interactive UIs — users require bidirectional communication; pipelines are unidirectional. (2) Complex stateful logic — if a processing step must know about prior steps’ state, stateless filters are the wrong model. (3) Real-time bidirectional communication — chat, live dashboards, collaborative editing require pub-sub or request-response patterns.


What does the pipeline style score on ease of development, and why?
?
★★★★★ — The mental model is the simplest in all of software architecture: data flows in one direction, each filter has exactly one job. New developers understand the system immediately from a flow diagram. No distributed systems concepts required for basic implementation.


What is the Conway’s Law implication for multi-team pipelines?
?
When multiple teams own different filter segments, the pipe schema becomes the team interface contract — analogous to an API contract in microservices. Schema registries and contract testing (e.g., Avro compatibility checks) become mandatory. Each team can deploy their filter segment independently without coordinating with others.


What makes serverless functions a natural implementation for pipeline filters?
?
Serverless functions (Lambda, Cloud Functions, Azure Functions) are stateless, independently deployable, auto-scaling, and pay-per-execution — all of which map directly to the pipeline filter’s requirements. The function IS the filter; the trigger (SQS, Kinesis, HTTP) IS the pipe.


What are the Characteristics Ratings for the pipeline architecture? (summarize)
?
Testability ★★★★★, Ease of development ★★★★★, Simplicity ★★★★★, Cost ★★★★★, Overall agility ★★★★☆, Ease of deployment ★★★★☆, Scalability ★★★★☆, Performance ★★★☆☆.


What is the primary architectural limitation of the pipeline style?
?
Performance: bounded by the slowest filter. The style is not suited for low-latency, high-frequency request/response workloads. It is optimized for throughput-oriented batch and streaming use cases where individual record latency is less critical than total data volume processed.


Priority: MEDIUM — Pipeline architecture is a foundational pattern tested on architecture style selection questions; focus on filter types, statelessness, and use cases.

Last Updated: 2026-05-29