Lecture 4 (Exploratory Data Analysis)
Resources
- P4DA: chapter 13
- R4DS: chapter 7
- Youtube Video ~40 min
Summary
So far in the course, we have gone through the machineary and basic tools to transform data and generate graphs. We are now ready dive into the topic of Exploratory Data Analyis (EDA).
EDA is the process of systematically generating an understanding of the data using the tools we have studied. That is, transforming and visualising data. We have already, unknowingly, performed EDA, for instance, when we use the function head
, it gives us a quick glance of the structure of the data. By using head
, we have answered the question: what is the structure of the data?
Asking questions is key to EDA. That is how we unwrap the complexity of the data, by asking and answering questions.
There are no routine statistical questions, only questionable statistical routines.”
— Sir David Cox
(quote borrowed from R4DS
All questions of the data will lead to some deeper understanding of the data. What are questions that we can ask (and answer) to fill out the gaps of knowledge we have about the data? That ofcourse, depends on the data we are working with. Therefore, performing EDA is not a strict framework, rather a research methodology based on curiosity. E.g, can one of the variables explain the other? Why or why not?. Answering these type of questions will give us a deep insight into the data. We will not be able to answer all questions but knowing what type of questions those are, will give us insight into the limits of our dataset.
The process of EDA is iterative. Once a set of questions have been answered, new questions will arise. This continues until there are no more questions to ask or until you are satisfied with the understanding you have gained. You will get better and more efficient at posing questions with practice. So, practice X 3.
Look at the resources above for examples of EDA.