Data Pipeline
Data pipelines move data from one place to another. They pass data packets one by one.
The key considerations for designing a data pipeline are:
- Latency:
- throughout: How much data can be fed into the pipe per unit of time.
Stage
- 
Data Extraction: extract data from one source or multiples 
- 
Data Ingestion: ingest data into the pipeline 
- 
Data Transformation 
- 
Data Loading: load data into the destination 
- 
Scheduling or Triggering 
- 
Monitoring 
- 
Maintaining and Optimization 
Monitoring keys
- 
Lantency 
- 
Throughput 
- 
Warnings, errors, and failures: from network, source, destination, etc 
- 
Utilization rate: How fully the resource are used by the pipeline. 
- 
Logging and alerting 
Performance
There are some key points impact the performance of the pipeline
- 
unbalanced load Some stages cost more time to process than others. It will lower the performance of the entire pipeline. Solution: Parallelize the workers for the heavy step. 
- 
stage synchronization The data flows to one stage to another won't ber smooth all the time. Solution: put an I/O buffer between the stages. It will regulate the flow, improve the throughout, distribute loads on parallelized stages. 
Batch vs Streaming
Batch pipelines:
- 
operate on the batch data 
- 
could running periodically 
- 
could be initiated by the data size or other triggers 
- 
focus on accurancy, not latency 
Streaming pipelines:
- 
ingest data in a rapid succession (e.g. credit card transactions, social media events) 
- 
for real-time results 
- 
events need to be processed when they happened 
- 
events can be loaded to the storage 
- 
users can publish/subscribe to the event steam 
Lambda Architecture
It combines batch and streaming data. It balances the accurancy and latency.
- 
batch data -> batch layer -> Serving layer 
- 
streaming data -> speed layer -> Serving layer 
Modern data pipeline features
- 
Automation: Pipeline is fully automated 
- 
Ease of use: ETL rule recommendations 
- 
Drag-and-Drop UI 
- 
Transformation support: enable complex calculations 
- 
Security and compliance: Data encryption and Compliance (GDPR, HIPAA, etc)