Cleaning data
Dirty data
Dirty data is data that's incomplete, incorrect, or irrelevant to the problem you're trying to solve.
The types of dirty data are:
-
Duplicate data
-
Outdated data
-
Incomplete data
-
Incorrect/inaccurate data
-
Inconsistent data
Person who works on cleaning data
-
Data engineers:
They transform data into a useful format for analysis and give it a reliable infrastructure.
They develop, maintain, and test databases, data processors and related systems.
-
Data warehousing specialists:
They develop processes and procedures to effectively store and organize data.
They make sure that data is available, secure, and backed up to prevent loss .
Common data-cleaning pitfalls
-
Not checking for spelling errors
-
Forgetting to document errors
-
Not checking for misfielded values
-
Overlooking missing values
-
Only looking at a subset of the data
-
Losing track of business objectives
-
Not fixing the source of the error
-
Not analyzing the system prior to data cleaning
-
Not backing up your data prior to data cleaning
-
Not accounting for data cleaning in your deadlines/process
Working with multiply data sources
-
Data merging: the process of combining two or more datasets into a single dataset
-
Compatibility: describes how well two or more datasets are able to work together