Lecture 6 (Combining Data)
Resources
You can find examples and motivation in the resources.
Summary
In this lecture, we are discussing joins. Joining tables allow us to combine different sources of data into a single dataframe. As a data scientist, you will this happens more often than not. As mentioned previously, data is never nice and in the form we want to work with, in most cases it is scattered over multiple data sources, for instance, databases, the internet, experimental data etc. Joining enables us to combine many datasets into a single one.
Joins
Join is the operation of combining different tables into a single table. There are many different types of joins that are relevant an each has the effect of mutating the table to increase or decrease the number of variables in the final table. The 4 most common merges we will encounter are left join
, right join
, inner join
and outer join
each serving a distinct rule for combining the tables. The following image illustrates these operations.
Image from: https://www.alphacodingskills.com/sql/img/sql-join.PNG
Here, we are always working refereing to combining two tables. In the case where there are many more tables to be combined, we can split it up into combining two tables at a time.
The most common way to work is to iteratively expand a dataset by combining other datasets. That is why, we will work left join
a lot. Nevertheless, combining the other methods enable us to create more complex combinations of datasets.
Learning how to join tables naturally allows us to work with SQL type databases. Joins are essential, when working SQL, since storing data efficently involves splitting data up into different tables in a certain way. We will discuss more about this in the comming lecture.