Cleaning data

Dirty data

Dirty data is data that's incomplete, incorrect, or irrelevant to the problem you're trying to solve.

The types of dirty data are:

  • Duplicate data

  • Outdated data

  • Incomplete data

  • Incorrect/inaccurate data

  • Inconsistent data

Person who works on cleaning data

  • Data engineers:

    They transform data into a useful format for analysis and give it a reliable infrastructure.

    They develop, maintain, and test databases, data processors and related systems.

  • Data warehousing specialists:

    They develop processes and procedures to effectively store and organize data.

    They make sure that data is available, secure, and backed up to prevent loss .

Common data-cleaning pitfalls

  • Not checking for spelling errors

  • Forgetting to document errors

  • Not checking for misfielded values

  • Overlooking missing values

  • Only looking at a subset of the data

  • Losing track of business objectives

  • Not fixing the source of the error

  • Not analyzing the system prior to data cleaning

  • Not backing up your data prior to data cleaning

  • Not accounting for data cleaning in your deadlines/process

Working with multiply data sources

  • Data merging: the process of combining two or more datasets into a single dataset

  • Compatibility: describes how well two or more datasets are able to work together