Introduction to data

Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. Since this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Getting started

The data

The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.

First, we’ll load the the nycflights dataset. Download the data here, and load it into JASP.

The data set nycflights that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable.

We will refer to this data format a data frame, which is a term that will be used throughout the labs. For this data set, each observation is a single flight.

The codebook (description of the variables) can be accessed by loading this page.

One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.

carrier: Two letter carrier abbreviation.
- 9E: Endeavor Air Inc.
- AA: American Airlines Inc.
- AS: Alaska Airlines Inc.
- B6: JetBlue Airways
- DL: Delta Air Lines Inc.
- EV: ExpressJet Airlines Inc.
- F9: Frontier Airlines Inc.
- FL: AirTran Airways Corporation
- HA: Hawaiian Airlines Inc.
- MQ: Envoy Air
- OO: SkyWest Airlines Inc.
- UA: United Air Lines Inc.
- US: US Airways Inc.
- VX: Virgin America
- WN: Southwest Airlines Co.
- YV: Mesa Airlines Inc.

The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:

How delayed were flights that were headed to Los Angeles?
How do departure delays vary by month?
Which of the three major NYC airports has the best on time percentage for departing flights?

Analysis

Departure delays

Let’s start by examining the distribution of departure delays of all flights with a histogram. Select Descriptives, then move the variable dep_delay to the Variables box. Click Plots, then check the Distribution plots box. (The values on the y-axis are in scientific notation, where 2.0e+04 means \(2.0*10^4\), or 20,000).

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. JASP attempts to choose a reasonable number of bins. (They are working to allow the user to change the number of bins, though this feature is not available yet.)

If you want to visualize only on delays of flights headed to Los Angeles, you need to first filter the data for flights with that destination. To do this, return back to the data, where you will see an icon that looks like a funnel at the top left part of the data (to the left of V1 and above row 1).

Click this icon to bring up the filter menu. We want to only use arrivals where the destination is Los Angeles, so our condition is dest=LAX. Drag the dest variable to the middle of the filter menu, then click the equal sign (or type =), then type LAX after the equal sign. Click Apply pass-through filter at the bottom of the filter menu to filter your data, and you will see parts of your data greyed out, leaving only the requested data.

You can return back to your histogram of dep_delay, and will see that it now only uses flights where the destination is Los Angeles. You could also create a new graph (or any other analysis) and it will use only the data that is included using the filter.

You can also obtain numerical summaries for these flights in the Descriptive Statistics table, created by default when you select the dep_delay variable in the Descriptives menu.

Filters can also be made based on multiple conditions. Suppose you are interested in flights headed to San Francisco (SFO) in February. Bring up the filter menu, where we will give it two criteria. Similarly to the last filter, we will enter dest=SFO. If you apply this filter and look at the very bottom of the JASP window, you will see that of the original 32,735 flights, we are now only considering 1,345 of them. To add the next condition, we click month, which starts a second line in our filter menu, then finish the line to make month=2.

This leaves us with 68 flights.

You can create more complicated filters as well. If you are interested in either flights headed to SFO or in February, you can use the “or” operator, which looks like a capital V (in the same row as the equal sign). By dragging variables and typing in values, a formula of \((dest=SFO) \lor (month=2)\) will give this. To do this, you will need to create \(dest=SFO\) and \(month=2\) on their own lines, which can then be dragged to the left and to the right of the \(\lor\) operator.

Filter the data to include only flights headed to LAX in March. How many flights meet these criteria?
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

Another useful technique is quickly calculating summary statistics for various groups in your data frame. We’ll go back to including all of our data, so you can double click the trash can in the filter menu to remove the filter, and we’ll go back to the Descriptives menu.

To get the summary stats departure delays for each origin airport, we can use dep_delay as our variable, and put origin in the Split box.

Note that we could also add a histogram here, and JASP will create a histogram for each origin airport.

Click the Statistics menu below the Plots menu within Descriptives. Check the IQR option in Dispersion and Median in Central Tendency to add these values to our table. Which carrier has the most variable arrival delays?

Departure delays by month

Which month would you expect to have the highest average delay departing from an NYC airport?

Let’s think about how you could answer this question, and illustrate the importance of types of variables. You could get descriptive statistics for the departure delays using month as your split variable. If you try to do this, you will see that JASP will not accept month in the Split box. This is because Split only allows ordinal and nominal variables. Go back to your data, click the name of the month variable to switch it to ordinal, then create your summary table of departure delays by month.

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

On time departure rate for NYC airports

Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Also supposed that for you, a flight that is delayed for less than 5 minutes is basically “on time.” You consider any flight delayed for 5 minutes of more to be “delayed”.

In order to determine which airport has the best on time departure rate, you can first classify each flight as “on time” or “delayed”, then calculate the percentage of flights that are on time for each origin airport.

We’ll begin by making a new variable. Click the plus sign the right of the other columns, and select Text, since we will make this a nominal variable. Give it the name dep_type, and click Create Column. We’re going to create this variable based on criterion using an if-then structure, so on the right hand side of the Computed Column panel, scroll down until you see ifElse(y) and click that. You will see in the filter box ifElse(test,then,else). Our test is dep_delay<5, so enter that in the spot that says test. To do this, click dep_delay (which will start a new line), then <, then type 5 in the spot after the less than sign. Click and drag the less than sign to move the entire line into the spot that says test. Now, we replace then with what we want to happen when the test condition is satisfied, so replace that with on time. Otherwise, we want the label to be delayed, so type that into the else spot. Then click Computer column to see the new column created.

Open the Descriptives, and create a table using the newly created variable dep_type and split by origin. Click the box for Frequency tables to see summaries if the variable by each origin.

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

More Practice

Create a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance.
Replicate the following plot. This plot uses the ggplot2 color palette, but you can use any color pallet you like.

Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines. You can create a filter, and use the “or” (\(\lor\)) to select only flights that are from those airlines. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.