Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. Since this is a large dataset, along the way you’ll also learn the indispensable skills of data processing and subsetting.
The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.
As usual, you can import this data from Rguroo’s OpenIntro
Dataset Repository. The dataset we want to import is nycflights
. Locate the dataset in the repository and click on the information icon to the right of the dataset name to read some information about the dataset. In particular, you can see the codebook (description of the variables) on that page. One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.
carrier
: Two letter carrier abbreviation.
9E
: Endeavor Air Inc.AA
: American Airlines Inc.AS
: Alaska Airlines Inc.B6
: JetBlue AirwaysDL
: Delta Air Lines Inc.EV
: ExpressJet Airlines Inc.F9
: Frontier Airlines Inc.FL
: AirTran Airways CorporationHA
: Hawaiian Airlines Inc.MQ
: Envoy AirOO
: SkyWest Airlines Inc.UA
: United Air Lines Inc.US
: US Airways Inc.VX
: Virgin AmericaWN
: Southwest Airlines Co.YV
: Mesa Airlines Inc.When you are done reading the information about the data, import the dataset to your Data toolbox.
Tip: Sometimes, when you get lists, like the ones shown in the figure above, the columns’ elements are not alphabetically ordered. To order them alphabetically, click on the column header. In the screenshot above, we have clicked on the headings Repository
and Dataset
in the first column of the top and bottom panels to make the column elements alphabetically ordered.
The dataset nycflights that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable. Rguroo calls this data format a Rguroo dataset or data frame, which is a term that will be used throughout the labs. For this dataset, each observation (or each row) is a single flight.
To view the names of the variables and some summary information, right-click the nycflights dataset and select Dataset Summary. You can also double-click the dataset name to view the first 100 rows and 15 columns of the data frame.
The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:
Let’s start by examining the distribution of departure delays of all flights with a histogram. Since this is a plot, we need to open the Plots toolbox, click Create Plot, and select Histogram.
We start by selecting the nycflights Dataset
and the Variable
dep_delay to plot.
Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. Each software has a default method for determining the bin locations and widths. Rguroo’s default method is the “Freedman-Diaconis” method, named after two famous statisticians, David Freedman and Persi Diaconis (see the Method
dropdown in the Bins & Bars
section of the Histogram dialog box, shown in the figure above). click the Preview
button to see the default histogram.
A note about the border color: The default border color for bins in Rguroo is light gray. In this example, the bins are so small that with the default border color the bars blend into the background! So, we have changed the border color to black in the screenshot above. You should do the same for this example.
In Rguroo you can easily define the bin width you want to use. To define your own bins, click and open the Bins, Bars, Smoothing
section. In the Bins & Bars
tab go to the Bin Breakpoints
section. In this section there are different methods for setting the bin locations and bin widths. Let’s select the Center
and Width
option. As shown in the figure below, type 100 in the Center
box and type 15 in the Width
box. click the Preview
button to see the histogram.
Once you have seen the histogram, go back to the Bin Breakpoints
section in the dialog box and change to a bin width of 150, as shown below.
Tip: To view all three histograms at the same time, either copy and paste each graph to your report separately, or make each one in a separate tab by selecting the Histogram dialog box three times, with each dialog box consisting of one of the three options.
If you want to visualize only delays of flights headed to Los Angeles, you need to first filter the data for flights with the destination (LAX). There are two options: One is to filter the data within the Histogram function. Another option is to use the Subset function in the Data toolbox to obtain the subset of the data consisting of only flights with LAX destination, and then use this dataset to plot your histogram. We explain both methods below.
To filter the data with the destination to LAX, we use the R code dest == "LAX"
. Specifically, in the Variable
box in the dialog box of Histogram, we write dep_delay[dest == "LAX"]
, as shown in the figure below. LAX
is in quotation marks since it is a character string.
Let’s understand this command. Square brackets that follow a variable name are used to extract a subset of elements of the variable. For example, if you write dep_delay[3:10]
, the result is a new variable that consists of the third through the tenth component of the variable dep_delay
. In our example, the command dest == "LAX"
is a new variable with logical values of TRUE
corresponding to every destination that is LAX and FALSE
corresponding to destinations that are not LAX. So, the final result of dep_delay[dest == "LAX"]
in Rguroo is that only elements with destination LAX will be included from the variable dep_delay
. That is, the values corresponding to the FALSE
elements of dest == "LAX"
will be dropped.
As noted earlier, another way to get a histogram of departure delays for LAX is to get a subset of the data that only includes data values with destination LAX, and then use that dataset in the Histogram function to plot the variable dep_delay. Below, we show two approaches for obtaining a subset of the data: one utilizes the filtering functionality in Rguroo’s dataset editor, while the other employs Rguroo’s Subset function. Note that subsetting a dataset (typically, to select specific observations) is also referred to as filtering. For straightforward subsets like the one in this lab, using the dataset editor is more convenient. However, Rguroo’s subset editor provides the capability to subset data based multiple criteria.
To obtain a dataset containing only data values with the destination LAX using Rguroo’s dataset editor, follow these steps:
Open the dataset editor by going to the Data toolbox and double-clicking on the name of the dataset, nycflights. Alternatively, you can right-click on the dataset name and select the Edit
option.
In the dataset editor, locate the filter tab , which is vertically displayed on the right-hand side of the editor. You will see a list of the names of the variables.
Expand the dest dropdown menu to display the list of available destinations and uncheck the (Select All)
checkbox to remove all current selections.
Check the checkbox next to LAX. You can also type “LAX” in the search box to quickly find and select it.
Save the newly created subset of data as lax_flights, by typing the name in the the Save As textbox.
If you have already used the dataset editor to subset your data and do not wish to learn about Rguroo’s Subset function, you can skip this subsection and proceed to the next section, “Making a histogram.”
To subset the data using Rguroo’s subset function, go to the Data toolbox, select Functions, and then click on Subset. A dialog box titled Data Subset should appear.
As usual, we first need to select the nycflights Dataset
. Next, we want to filter to get specific observations in the dataset (rows), so we look at the Row Selection
section. There are two buttons in this box. The Sequence
button is used when we know the specific row numbers we want to include. The Logical Expression
button is used when we know the condition(s) that the rows that we want to include from the dataset should satisfy. In our case here, we want the Logical Expression
option. Specifically, we want to subset based on the logical expression dest == "LAX"
.
In the Logical Expression Creator
dialog, click the green plus sign at the bottom right to add a logical expression. The figure below shows the logical expression that we need to specify.
Let’s take a second to decipher the logical expression we’ve created.
dest
is the name of the variable we want to filter on. You can select this directly from the left dropdown menu.==
means “if it’s equal to”. You can select this directly from the middle dropdown menu.LAX
is the value we want dest
to be equal to. You can usually select this directly from the right dropdown menu, but there are so many flight destinations that Rguroo won’t list them. Instead, we can directly type the value in. We write it inside single quotation marks since it is a character string and not a number.Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the Data Subset dialog with the Logical Expression Creator
and a series of logical operators. The most commonly used logical operators for data analysis are as follows:
==
means “equal to”!=
means “not equal to”>
or <
means “greater than” or “less than”>=
or <=
means “greater than or equal to” or “less than or equal to”Save your new data subset as lax_flights.
Once you have saved a subset in Rguroo, you can use it as an Rguroo dataset. For example, you can make a histogram of the departure delays of only LAX flights, using the lax_flights dataset that you created.
You can also obtain a custom numerical summary of the departure delays for these flights by going to the Data toolbox, clicking Function and then selecting Summary Statistic:
Summary statistics: Some useful summary statistics for a single numerical variable are as follows:
Mean
Median
Standard Deviation
Variance
IQR
(interquartile range)Minimum
Maximum
See if you can find each of those summary statistics in the Rguroo dialog. Check the boxes for those statistics, then View the output. Note that you will have to click the + sign next to “All Observations” to view the values.
You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February. Again, you can use the dataset editor to filter the data or Rguroo’s subset function. We show both methods in the subsections below.
To filter the nycflights dataset and obtain a subset that includes only data values with the destination San Francisco (SFO) in February you can use the dataset editor and follow these steps:”
Open the dataset editor by going to the Data toolbox and double-clicking on the name of the dataset, nycflights. Alternatively, you can right-click on the dataset name and select the Edit
option.
In the dataset editor, locate the filter tab , which is vertically displayed on the right-hand side of the editor. You will see a list of the names of the variables.
Expand the dest dropdown menu to display the list of destinations and uncheck the (Select All)
checkbox to remove all current selections.
Check the checkbox next to SFO. You can also type “SFO” in the search box to quickly find and select it.
Similarly, expand the month dropdown menu to display the list of months, uncheck the (Select All)
checkbox to remove all current selections, then check the checkbox next to 2.
Save the newly created subset of data as sfo_feb_flights, by typing the name in the the Save As textbox.
If you have already used the dataset editor to subset your data and do not wish to learn about Rguroo’s subset function, youcan skip this subsection and proceed to the next subsection.
To use Rguroo’s Subset function to filter the nycflights dataset and obtain a subset that includes only data values with the destination San Francisco (SFO) in February, you need to find flights that are headed to SFO AND are in February. This consists of a combination of two logical expressions. To set this up, in the Data toolbox, click Functions, select Subset, select the nycflights dataset, then click Logical Expression
, just like we did earlier. First, we’ll put in a logical expression for flights headed to SFO, and then we’ll put in another logical expression for the flights in February (month 2). Note that we’ll have to type in both ‘SFO’ and the number 2 in the Value
column, as shown in the figure below.
Once you have each of the logical expressions in place, you need to combine them with an AND. To combine logical expressions, we use the Logical Expression Calculator
at the bottom of the dialog box. Click the down arrow in the Pick
column for the logical expression corresponding to the SFO flights to move it to the Logical Expression Calculator
. Then, click the AND
button, and finally move the logical expression for February down to the Logical Expression Calculator
. When you’re done, click Done
and click the Preview
button to view the output. Save the new output as sfo_feb_flights.
About the Logical Expression Calculator: The Logical Expression Calculator
can be used to combine logical expressions using logical operations AND
, OR
, or NOT
. For example, if you are interested in either flights headed to SFO or in February, you would combine the two expressions using the OR
button.
How many flights are headed to SFO in February? Hint: The answer is less than 100, so you can simply look at the number of rows in the Data Viewer.
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can use the Summary Statistic dialog on the sfo_feb_flights dataset but tell Rguroo to group the departure delays by the origin airport by selecting the variable origin in the Factor 1
dropdown menu, as shown below.
To view the summary statistics of departure delays based on the origin airport, click the + sign to the left of origin in the output:
Which month would you expect to have the highest average delay departing from an NYC airport?
Let’s think about how you could answer this question. We need to do three things:
group
by monthssummarize
the average departure delays for each monthsort
the months based on their average departure delaysWe now know how to do the first two of these things, but there’s a bit of a problem. When we open the Summary Statistic dialog, we can’t select month as Factor 1
. The reason for this is that Rguroo has classified month as a numerical variable, not a categorical variable. To overwrite this classification, we need to right-click the nycflights dataset and select Open to open the nycflights in the Dataset Editor (alternatively, you can double click on the name of the dataset.). Click the Variable Properties icon
and in the dialog that pops up, select Variable Properties
. Select the month variable and change its Type
to Nominal
, as shown below.
Because you have made changes to the dataset after creating and saving output based on the dataset, you cannot overwrite the original dataset. Use the Save As textbox at the top of the tab to save the new dataset as nycflights1. Now open a new Summary Statistic dialog, select the dataset nycflights1, and month will now appear as a Factor 1
option, as shown in the dialog below.
Use the Save As text box at the top of the output tab to save the summary as nycflights_month. We’ll use this dataset to perform the last step of sorting the dataset. In the Data toolbox, click Functions, then select Sort. Select nycflights_month as the Dataset
. Click the plus sign to add a variable to sort by. We want to sort by the Variable
Mean in Descending Order
:
We obtain the following output:
Note that there are actually 13 rows in this output: one for each month and one for All Observations. While this is not always useful, in this case we can also easily see which months were above or below the average departure delay for the entire dataset.
Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Also suppose that for you, a flight that is delayed for less than 5 minutes is basically “on time.”” You consider any flight delayed for 5 minutes of more to be “delayed”.
In order to determine which airport has the best on time departure rate, you can
Let’s start with classifying each flight as “on time” or “delayed” by creating a new variable. In the Data toolbox, click Functions and select Transform. Then add a new variable called dep_type to the nycflights Dataset
using the same sequence we learned in the previous lab. First, click the plus sign to add the variable, then give it an informative name (dep_type), then type some code in the middle box to indicate how to create dep_type. In this case, the code ifelse(dep_delay < 5, "on time", "delayed")
indicates to classify the flight as "on time"
if the flight is delayed by less than 5 minutes and "delayed"
if not, i.e., if the flight is delayed by 5 or more minutes.
Click on the Preview button and Save the new dataset as nycflights_ontime.
With the new dataset saved, we can find the proportion of on-time flights out of each airport. In the Analytics toolbox, select Tabulation \(\rightarrow\) Categorical and then select the dataset nycflights_ontime. Click the to add a table. We want to find the distribution of dep_type conditional on origin, so select those variables as Factor 1
and Factor 2
, respectively. Check the conditional box labeled as Cond.
and check the Proportions
box to tell Rguroo to calculate conditional proportions.
You can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
You may want to play around with changing from Side by side
to Stacked
and/or changing from Counts
to Proportions
to find the best visualization.
Using the Transform function, add a new variable to the dataset that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance.
Replicate the following plot. Hint: The plot shows only flights from American Airlines (AA), Delta Airlines (DL), and United Airlines (UA), and the points are colored based on the Factor
variable carrier. Rather than using Subset, you can click and use the Factor Level Editor attached to the plot to drop the unused levels and change the colors/symbols. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Rguroo.com, the Rguroo.com logo, and all other trademarks, service marks, graphics and logos used in connection with Rguroo.com or the Website are trademarks or registered trademarks of Soflytics Corp. in the USA and other countries and are not included under the CC-BY-SA license.