Lecture 7 (Basics of SQL)
Resources
- W3, SQL Basics
- R4DS chapter 19
- SQLite Python, R
You can find examples and motivation in the resources.
Summary
In this lecture we will go through the basics of SQL. This includes the basic syntax for making queries to the database. This includes both retrieving and storing data in the database.
SQL
What is SQL? Structured Query Language (SQL) is a computer language that enables us to communicate and manipulate SQL databases(db) and it is one of the most widely used languages for db management.
There are many flavours (versions) of SQL. However, they all share the basic syntax as below and often only differ in the way the basic syntax is extended.
SQL is a Relational Database Management System (RDBMS), meaning that the data is stored in tables that are seperated based on the scope of the database. A nice illustration of how data is stored can be seen in the figure below.
Image borrowed from:https://r4ds.hadley.nz/
Here, the tables are seperated to reflect differring objects. We also see that the tables are linked or related to each other through specific variables (keys). This is where the name relational database comes from.
Basic Syntax
Writing an SQL query(a question or a requests) is similiar to speaking with the database. Once we get a grasp of the basic syntax we will be able to query the database for data in whatever form that is.
We use SELECT
to determine the variables we want and FROM
to specify the table we want it from. Following the image above, we can make the following query.
SELECT carrier, dest, tailnum
FROM flights;
SELECT carrier, dest, tailnum
FROM flights;
Make sure to end the query with ;
We can use SELELCT *
to get all variables in the table. The query above is the simplest query one can make in SQL.
We can add a layer of complexity by filtering the table. We can do this by using the WHERE
command. Continuing the example, we can make the following query.
SELECT carrier, dest, tailnum
FROM flights
WHERE dest='sweden';
SELECT carrier, dest, tailnum
FROM flights
WHERE dest='sweden';
Furthermore, we can use the logical operators AND
, OR
, NOT
to perform logical chaining of conditions. The basic syntax is the following.
SELECT variable1, variable2, ... -- * if you want all variables
FROM table
WHERE condition1 AND/OR/NOT condition2 AND/OR/NOT condition3 ...;
SELECT variable1, variable2, ... -- * if you want all variables
FROM table
WHERE condition1 AND/OR/NOT condition2 AND/OR/NOT condition3 ...;
To aggregate multiple tables into a single table we can use ... JOIN
. The syntax is similar to the previous lecture. ON
specifies the key(s) to join on. Continuing the example above the syntax is the following.
SELECT carrier, dest, tailnum, airlines.names
FROM flights
LEFT JOIN airlines ON flights.carrier = airlines.carrier;
SELECT carrier, dest, tailnum, airlines.names
FROM flights
LEFT JOIN airlines ON flights.carrier = airlines.carrier;
Using the syntax described above, we can chain multiple operations to generate the desired table. You can see more operations on W3.
Since the SQL syntax has been adopted by both pandas
and tidyverse
, it will not be difficult to learn the content above.
Storing Data
To start storing data, we need to create a table to store it in. We can do this using the CREATE
command. The query is as follows.
CREATE TABLE table_name (
...
);
CREATE TABLE table_name (
...
);
To our table, we need to add columns (variable), we can do this by specifying the name of the column and datatype. We can also specify constraints on the variable. The query is as follows.
CREATE TABLE table_name (
Variable1 datatype constraint,
Variable2 datatype constraint,
...
);
CREATE TABLE table_name (
Variable1 datatype constraint,
Variable2 datatype constraint,
...
);
Common constrains are NOT NULL
, UNIQUE
, PRIMARY KEY
, FOREIGN KEY
. There are more constraints that can be added. They can be found on W3.
The unique identifier (id) of a row should be constrained with UNIQUE
and it must also be filled (NOT NULL
). The PRIMARY KEY
incorporates both constraints. Therefore, it is good practice to specify the row id using the PRIMARY KEY
constraint.
An example of creating a table can look something like the following quer query.
CREATE TABLE Persons (
name VARCHAR(255) NOT NULL,
age INT, -- It does not need to be specified
id_number VARCHAR PRIMARY KEY
);
CREATE TABLE Persons (
name VARCHAR(255) NOT NULL,
age INT, -- It does not need to be specified
id_number VARCHAR PRIMARY KEY
);
Specific datatypes can be looked up on W3.
To change a table that already exists we can use the ALTER
command. Specific ways to alter tables can be found on W3.
Putting it all together, the SQL query looks as follows.
CREATE TABLE Persons (
Name VARCHAR(255) NOT NULL,
Age INT, -- It does not need to be specified
ID_number VARCHAR(255) PRIMARY KEY
);
CREATE TABLE Computers (
Computer_id INT PRIMARY KEY,
Model VARCHAR(255),
Brand VARCHAR(255),
Owner VARCHAR(255) FOREIGN KEY REFERENCES Persons(ID_number)
)
CREATE TABLE Persons (
Name VARCHAR(255) NOT NULL,
Age INT, -- It does not need to be specified
ID_number VARCHAR(255) PRIMARY KEY
);
CREATE TABLE Computers (
Computer_id INT PRIMARY KEY,
Model VARCHAR(255),
Brand VARCHAR(255),
Owner VARCHAR(255) FOREIGN KEY REFERENCES Persons(ID_number)
)
Here, we create two tables and use FOREIGN KEY
to link the owner of the computer to a specific person using the id-number. If we later realise that we want a person to have a variable specifying occupation, we can use the following command.
ALTER TABLE Persons
ADD Occupation VARCHAR(255);
ALTER TABLE Persons
ADD Occupation VARCHAR(255);
With a table created, we can add entries to the table. This can be done using the INSERT
command. The full syntax looks as follows.
INSERT INTO table_name (variable1, variable2, ...)
VALUES (value1, value2, ...)
INSERT INTO table_name (variable1, variable2, ...)
VALUES (value1, value2, ...)
Note that the order of the values should follow the order of the variables. Furthermore, if the constraints are violated we will get an error.
In the example above the query may look like
INSERT INTO Persons (Name, ID_number)
VALUES ('Taariq', '19000101-xxxx');
INSERT INTO Persons (Name, ID_number)
VALUES ('Taariq', '19000101-xxxx');
Since, the Age variable does not need to be filled in, the query will populate the Persons
table and have an empty cell for the Age column.
We are only scratching the surface of SQL, there is a lot more to learn. However, with this basic knowledge we will be able to interact with databases. You can find more information in the listed resources (W3).
There are various libraries in Python and R to interact with SQL databases. We will primarily be working with SQLite databases since they are more accessible. Popular libraries are sqlite3 for Python and dbplyr for R.