Skip to content

Lecture 5 (Processing Data)

Resources

You can find examples and motivation in the resources.

Summary

In this lecture we will discuss how to clean and structure datasets. In reality, the raw data we are presented with is typically fraught with erros and almost always in a form that is very difficult to work with. We would like to process the data in a way that removes erros and is easy to work with. We call this preparing the data. To prepare the data we have to have a few things in mind and the focus of this lecture will be on structuring data, dealing missing data and dealing outliers

Structuring the Data

There are many ways of structuring the data, some methods are better than others, this becomes more apparent over time when you work with data. The most popular format follow these 3 rules:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value.

Unfortunately, most of the data that you will find is not of this form. To comply with these rules we will have to wrangle the data in some fashion. Mostly, by pivoting. Data that are complient with the rules above are often easier to work with. Furthermore, most libraries work much easier with this format. As you might have realised, the data that has been used so far is already of this form and that is because it has processed already. An added benefit of structuring data in this way is that databases such as SQL, store data in this form.

Cleaning the data

Now, that we have structured our data into a form that easy to work with, we need to think about cleaning it. In most real world data, there are always some observations that have errors. For instance, data is missing or the data dosen't make sense for one reason or the other. To finish processing the data we need to do a few more things.

Missing Data

As you might have noticed already, the data we work with contains missing values. So far, we have ignored them for the sake of simplicity. However, ignoring missing values without justification is bad practice. One should always have a reason for removin data. This requires some thought and analysis, maybe even a few rounds of EDA to really understand the data. If you have a reasonable guess about the value of the missing data then imputing the value is a fine. If there is no information to be found, then discarding the entry or variable is fine. Either way, making sure that you document what and why is important.

Outliers

An outliers is an entry that differs from the other entries significantly, you will often find these type of values in the data and the reason for the outlier can vary. For instance, it can be a data entry error, measurement error, sampling error, processing error and there are ofcourse natural outliers. Many statistical methods are employed for detecting outliers. We will only employ EDA to detect and delete/transform/impute/accept outliers. Once again, whatever you choose be transparent and document what you have done.

Remarks

Real world data is often "ugly" and a great amount of work is needed to make it nice and understandable. Try and think about all the things that you have to account for when you are working with raw data. For instance, duplicate observations and so on. Use your best judgement.