ETL
ETL stands for: Extraction, Transformation, and Loading.
Extraction
The data source can be:
- paper
- web
- audio/video
- finanical reports
- transactions
- social media
- weather
- IoT
- ...
There are some use cases
-
Integrating disparated structured data sources via APIs
-
Capaturing events via APIs
-
Monitoring with edge computing
Transformation
Transformation includes
- Data Typing: (e.g. typing a record of CSV into a
UserData
) - Data structuring (e.g. json file to DB table)
- Anonymizing, encrypting
- Cleaning
- Normalizing
- Filtering, sorting, aggregating
- Joining data source
Schema-on-read vs Schema-on-write
Schema-on-read predefines schema while schema-on-write doesn't.
Loading
The following aspects should be considered while designing loading.
Full vs Increamental
Full loading loads an initial history into a database.
Increamental loading inserts data.
Scheduled vs On-demand
Batch vs Streaming
In between, we have micro-batch loading.