Data Pipeline
Data pipelines move data from one place to another. They pass data packets one by one.
The key considerations for designing a data pipeline are:
- Latency:
- throughout: How much data can be fed into the pipe per unit of time.
Stage
-
Data Extraction: extract data from one source or multiples
-
Data Ingestion: ingest data into the pipeline
-
Data Transformation
-
Data Loading: load data into the destination
-
Scheduling or Triggering
-
Monitoring
-
Maintaining and Optimization
Monitoring keys
-
Lantency
-
Throughput
-
Warnings, errors, and failures: from network, source, destination, etc
-
Utilization rate: How fully the resource are used by the pipeline.
-
Logging and alerting
Performance
There are some key points impact the performance of the pipeline
-
unbalanced load
Some stages cost more time to process than others. It will lower the performance of the entire pipeline.
Solution: Parallelize the workers for the heavy step.
-
stage synchronization
The data flows to one stage to another won't ber smooth all the time.
Solution: put an I/O buffer between the stages. It will regulate the flow, improve the throughout, distribute loads on parallelized stages.
Batch vs Streaming
Batch pipelines:
-
operate on the batch data
-
could running periodically
-
could be initiated by the data size or other triggers
-
focus on accurancy, not latency
Streaming pipelines:
-
ingest data in a rapid succession (e.g. credit card transactions, social media events)
-
for real-time results
-
events need to be processed when they happened
-
events can be loaded to the storage
-
users can publish/subscribe to the event steam
Lambda Architecture
It combines batch and streaming data. It balances the accurancy and latency.
-
batch data -> batch layer -> Serving layer
-
streaming data -> speed layer -> Serving layer
Modern data pipeline features
-
Automation: Pipeline is fully automated
-
Ease of use: ETL rule recommendations
-
Drag-and-Drop UI
-
Transformation support: enable complex calculations
-
Security and compliance: Data encryption and Compliance (GDPR, HIPAA, etc)