TDM 20100: Project 12 — SQL
Motivation: We have used two SQL databases but we have not (yet) built a database of our own.
Context: It is straightforward to build a new database from a collection of csv files.
Scope: In SQLite, we demonstrate the setup for building a new database.
Dataset(s)
This project will use the following dataset:
-
/anvil/projects/tdm/data/flights/subset/*(flight data)
Questions
Question 1 (2 pts)
First, open a terminal and combine the data from the subset flight csv files as follows. (We are storing the resulting file in the $SCRATCH directory because it is very large. We are also removing NA values (using awk) and removing the header from each file (using grep).
|
The file that we are about to build on the next line will be large, and so it will take a few minutes to run. |
cat /anvil/projects/tdm/data/flights/subset/[12]*.csv | awk -F, -v OFS=, '{for (i=1; i<=NF; i++) if ($i == "NA") $i=""};1' | grep -v Year >$SCRATCH/myflightdata.csv
The airports.csv data has a few extra commas:
cat /anvil/projects/tdm/data/flights/subset/airports.csv | sed 's.Union County, Troy Shelton .Union County Troy Shelton.g' | sed 's.Savage, Sr.Savage Sr.g' | sed 's.Baton Rouge Metropolitan, Ryan .Baton Rouge Metropolitan Ryan.g' | sed 's.Lawrence County Airpark,Inc.Lawrence County Airpark Inc.g' | sed 's.Westport, NY.Westport NY.g' | sed 's.Pullman/Moscow,ID.Pullman/Moscow ID.g' | sed 's.Reading Muni,Gen Carl A Spaatz.Reading Muni Gen Carl A Spaatz.g' | sed 's.Richard Lloyd Jones, Jr.Richard Lloyd Jones Jr.g' | sed 's.Toccoa, R G Le Tourneau .Toccoa R G Le Tourneau .g' | sed 's.\\"Bud\\" Barron .Bud Barron.g' | sed 's."..g' >$SCRATCH/mycleanairports.csv
The carriers.csv data has double quotes that we do not want:
cat /anvil/projects/tdm/data/flights/subset/carriers.csv | sed 's."..g' | awk -F, '{if (NF == 3) {print $1","$2 $3} else {print $0}}' >$SCRATCH/mycleancarriers.csv
The plane-data.csv sometimes only has 1 column, and sometimes has 9 columns. We clean this up too:
cat /anvil/projects/tdm/data/flights/subset/plane-data.csv | awk -F, '{if (NF == 9) {print $0} else {print $1",,,,,,,,"}}' >$SCRATCH/mycleanplanedata.csv
Now, also in the terminal, make a new SQLite file. We also make this file in the $SCRATCH directory, so that we do not fill up your home directory:
sqlite3 $SCRATCH/newflightdatabase.db
(Whenever we want to quit the sqlite3 program, we can hit CONTROL-D but do NOT YET hit CONTROL-D, because we still need to build the database.)
Now we tell SQLite that our files are in ASCII format:
.mode ascii
and the files to be imported are comma separated:
.separator "," "\n"
and we make tables for the data, first for the flight data:
CREATE TABLE flights(
"Year" INTEGER,
"Month" INTEGER,
"DayofMonth" INTEGER,
"DayOfWeek" INTEGER,
"DepTime" INTEGER,
"CRSDepTime" INTEGER,
"ArrTime" INTEGER,
"CRSArrTime" INTEGER,
"UniqueCarrier" TEXT,
"FlightNum" INTEGER,
"TailNum" TEXT,
"ActualElapsedTime" INTEGER,
"CRSElapsedTime" INTEGER,
"AirTime" INTEGER,
"ArrDelay" INTEGER,
"DepDelay" INTEGER,
"Origin" TEXT,
"Dest" TEXT,
"Distance" INTEGER,
"TaxiIn" INTEGER,
"TaxiOut" INTEGER,
"Cancelled" INTEGER,
"CancellationCode" INTEGER,
"Diverted" INTEGER,
"CarrierDelay" INTEGER,
"WeatherDelay" INTEGER,
"NASDelay" INTEGER,
"SecurityDelay" INTEGER,
"LateAircraftDelay" INTEGER
);
and for the airports data:
CREATE TABLE airports(
"iata" TEXT,
"airport" TEXT,
"city" TEXT,
"state" TEXT,
"country" TEXT,
"lat" NUMERIC,
"long" NUMERIC
);
and for the carriers data:
CREATE TABLE carriers(
"Code" TEXT,
"Description" TEXT
);
and for the plane data:
CREATE TABLE planes(
"tailnum" TEXT,
"type" TEXT,
"manufacturer" TEXT,
"issue_date" TEXT,
"model" TEXT,
"status" TEXT,
"aircraft_type" TEXT,
"engine_type" TEXT,
"year" INTEGER
);
Next, import the actual data into the tables that we created above. The first one will take a few minutes to run!
|
The first import statement will take all of the data from the huge file we built at the start, and put that data into our database. So it will take a few minutes to run. |
|
In all 4 of these import statements, |
.import --skip 1 /anvil/scratch/x-mdw/myflightdata.csv flights
and the airports data:
.import --skip 1 /anvil/scratch/x-mdw/mycleanairports.csv airports
and the carriers data:
.import --skip 1 /anvil/scratch/x-mdw/mycleancarriers.csv carriers
and the planes data:
.import --skip 1 /anvil/scratch/x-mdw/mycleanplanedata.csv planes
Next, we want to build indices for the flight data:
CREATE INDEX ix_flights_covering ON flights(Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);
and for the airports data:
CREATE INDEX ix_airports_covering ON airports(iata,airport,city,state,country,lat,long);
and for the carriers data:
CREATE INDEX ix_carriers_covering ON carriers(Code,Description);
and for the planes data:
CREATE INDEX ix_planes_covering ON planes(tailnum,type,manufacturer,issue_date,model,status,aircraft_type,engine_type,year);
Finally, you can exit from SQLite by typing: CONTROL-D.
Afterwards, check the size of the file that you created, and indicate the size of the file in your solutions (it should be approximately 17 GB)
ls -la --block-size=G $SCRATCH/newflightdatabase.db
-
Because all of the work for Question 1 happens in the terminal, the only thing that we are asking you to put into the Jupyter Lab notebook for Question 1 is the output from this command:
ls -la --block-size=G $SCRATCH/newflightdatabase.dbwhich prints the file size for the database that you built in Question 1. This line should show that your database is approximately 17 GB.
Question 2 (2 pts)
|
Back in the regular Jupyter Lab notebook, using the
but (of course) change the |
Join the flights and the airports table, matching the Origin column to the iata column. Find the total number of flights in the database for each Origin airport that is located in Texas. For each Origin airport in Texas, print the total number of flights and the 3-letter Origin airport code.
-
For each
Originairport in Texas, print the total number of flights and the 3-letterOriginairport code.
Question 3 (2 pts)
-
From the
flightstable, find the 10 most popularTailNumvalues, according to how many times that eachTailNumappears in theflightstable. For each of these top 10TailNum, list theTailNumand the number of flights on thatTailNum. -
Notice that the 5 most popular
TailNumvalues are: (blank), UNKNOW, 0, NKNO, 000000. Ignoring these top 5 most popular values, in part b, we want you to consider (only) the 6th most popularTailNumvalue, which should beN525. You can read about this 6th most popular airplane here: www.flightaware.com/live/flight/N525 For only this 6th most popular airplane, withTailNumequal toN525, please make a separate query of theflightstable that shows the top 5Originairports for this plane’s flights. (Hint: This airplane has departed 2952 times from Dallas Love FieldDALand also 2146 times from Phoenix’s Sky Harbor International AirportPHX.)
-
For each of these top 10
TailNum, list theTailNumand the number of flights on thatTailNum. -
After identifying the 6th most popular airplane (from part a; which is the first valid airplane; it should have
tailnumequal toN525), now find the top 5Originairports for this specific plane’s flights. For each of these top 5Originairports for this plane, find the three-letter code of theOriginairport and the number of times that this specific airplane departed from each suchOrigin.
Question 4 (2 pts)
Now let’s revisit question 3, but this time we will JOIN the flights table and the planes table ON the TailNum value. Group the results according to the TailNum and find the 10 most popular values, listing the TailNum value and the number of flights for each such TailNum.
|
Notice that the invalid tail numbers from question 3 are gone (because they do not appear in the |
-
JOIN the
flightstable and theplanestable, to find the 10 most popular values, listing theTailNumvalue and the number of flights for each suchTailNum.
Question 5 (2 pts)
Join the flights and the carriers table, matching the UniqueCarrier column to the Code column. Find the total number of flights in the database for each UniqueCarrier. For each UniqueCarrier, print the UniqueCarrier value, the Description value, and also the total number of flights for that UniqueCarrier. (Hint: Your query results should have 29 rows altogether.)
-
For each
UniqueCarrier, print theUniqueCarriervalue, theDescriptionvalue, and also the total number of flights for thatUniqueCarrier.
Submitting your Work
We have now built on the same skills that we learned for the movies database and the baseball database, but this time, we developed our own database of airplane flights and answered questions about the database that we built!
-
firstname-lastname-project12.ipynb
|
You must double check your You will not receive full credit if your |