Introduction to data

Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. Since this is a large dataset, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Getting started

The data

The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.

As usual, you can import this data from Rguroo’s OpenIntro Dataset Repository. The dataset we want to import is nycflights. Locate the dataset in the repository and click on the information icon to the right of the dataset name to read some information about the dataset. In particular, you can see the codebook (description of the variables) on that page. One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.

carrier: Two letter carrier abbreviation.
- 9E: Endeavor Air Inc.
- AA: American Airlines Inc.
- AS: Alaska Airlines Inc.
- B6: JetBlue Airways
- DL: Delta Air Lines Inc.
- EV: ExpressJet Airlines Inc.
- F9: Frontier Airlines Inc.
- FL: AirTran Airways Corporation
- HA: Hawaiian Airlines Inc.
- MQ: Envoy Air
- OO: SkyWest Airlines Inc.
- UA: United Air Lines Inc.
- US: US Airways Inc.
- VX: Virgin America
- WN: Southwest Airlines Co.
- YV: Mesa Airlines Inc.

When you are done reading the information about the data, import the dataset to your Data toolbox.

Rguroo Dataset Repository - Open Intro to import nycflights

Tip: Sometimes, when you get lists, like the ones shown in the figure above, the columns’ elements are not alphabetically ordered. To order them alphabetically, click on the column header. In the screenshot above, we have clicked on the headings Repository and Dataset in the first column of the top and bottom panels to make the column elements alphabetically ordered.

The dataset nycflights that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable. Rguroo calls this data format a Rguroo dataset or data frame, which is a term that will be used throughout the labs. For this dataset, each observation (or each row) is a single flight.

To view the names of the variables and some summary information, right-click the nycflights dataset and select Dataset Summary. You can also double-click the dataset name to view the first 100 rows and 15 columns of the data frame.

The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:

How delayed were flights that were headed to Los Angeles?
How do departure delays vary by month?
Which of the three major NYC airports has the best on time percentage for departing flights?

Analysis

Departure delays

Let’s start by examining the distribution of departure delays of all flights with a histogram. Since this is a plot, we need to open the Plots toolbox, click Create Plot, and select Histogram.

We start by selecting the nycflights Dataset and the Variable dep_delay to plot.

Histogram Basics dialog box

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. Each software has a default method for determining the bin locations and widths. Rguroo’s default method is the “Freedman-Diaconis” method, named after two famous statisticians, David Freedman and Persi Diaconis (see the Method dropdown in the Bins & Bars section of the Histogram dialog box, shown in the figure above). click the Preview button eye to see the default histogram.

A note about the border color: The default border color for bins in Rguroo is light gray. In this example, the bins are so small that with the default border color the bars blend into the background! So, we have changed the border color to black in the screenshot above. You should do the same for this example.

In Rguroo you can easily define the bin width you want to use. To define your own bins, click Details and open the Bins, Bars, Smoothing section. In the Bins & Bars tab go to the Bin Breakpoints section. In this section there are different methods for setting the bin locations and bin widths. Let’s select the Center and Width option. As shown in the figure below, type 100 in the Center box and type 15 in the Width box. click the Preview button eye to see the histogram.

Selecting Bin Center 100 and Bin Width 15

Once you have seen the histogram, go back to the Bin Breakpoints section in the Details dialog box and change to a bin width of 150, as shown below.

Selecting Bin Center 100 and Bin Width 150

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

Tip: To view all three histograms at the same time, either copy and paste each graph to your report separately, or make each one in a separate tab by selecting the Histogram dialog box three times, with each dialog box consisting of one of the three options.

If you want to visualize only delays of flights headed to Los Angeles, you need to first filter the data for flights with the destination (LAX). There are two options: One is to filter the data within the Histogram function. Another option is to use the Subset function in the Data toolbox to obtain the subset of the data consisting of only flights with LAX destination, and then use this dataset to plot your histogram. We explain both methods below.

Subsetting in the Histogram function directly

To filter the data with the destination to LAX, we use the R code dest == "LAX". Specifically, in the Variable box in the Basics dialog box of Histogram, we write dep_delay[dest == "LAX"], as shown in the figure below. LAX is in quotation marks since it is a character string.

Dialog box for histogram of departure dalays at LAX

Let’s understand this command. Square brackets that follow a variable name are used to extract a subset of elements of the variable. For example, if you write dep_delay[3:10], the result is a new variable that consists of the third through the tenth component of the variable dep_delay. In our example, the command dest == "LAX" is a new variable with logical values of TRUE corresponding to every destination that is LAX and FALSE corresponding to destinations that are not LAX. So, the final result of dep_delay[dest == "LAX"] in Rguroo is that only elements with destination LAX will be included from the variable dep_delay. That is, the values corresponding to the FALSE elements of dest == "LAX" will be dropped.

Subsetting (filtering) the data

As noted earlier, another way to get a histogram of departure delays for LAX is to get a subset of the data that only includes data values with destination LAX, and then use that dataset in the Histogram function to plot the variable dep_delay. Below, we show two approaches for obtaining a subset of the data: one utilizes the filtering functionality in Rguroo’s dataset editor, while the other employs Rguroo’s Subset function. Note that subsetting a dataset (typically, to select specific observations) is also referred to as filtering. For straightforward subsets like the one in this lab, using the dataset editor is more convenient. However, Rguroo’s subset editor provides the capability to subset data based multiple criteria.

Filtering using Rguroo’s dataset editor

To obtain a dataset containing only data values with the destination LAX using Rguroo’s dataset editor, follow these steps:

Open the dataset editor by going to the Data toolbox and double-clicking on the name of the dataset, nycflights. Alternatively, you can right-click on the dataset name and select the Edit option.
In the dataset editor, locate the filter tab , which is vertically displayed on the right-hand side of the editor. You will see a list of the names of the variables.
Expand the dest dropdown menu to display the list of available destinations and uncheck the (Select All) checkbox to remove all current selections.
Check the checkbox next to LAX. You can also type “LAX” in the search box to quickly find and select it.
Save the newly created subset of data as lax_flights, by typing the name in the the Save As textbox.

Filtering data using Rguroo’s dataset editor

Subsetting (filtering) the data by using the Subset function

If you have already used the dataset editor to subset your data and do not wish to learn about Rguroo’s Subset function, you can skip this subsection and proceed to the next section, “Making a histogram.”

To subset the data using Rguroo’s subset function, go to the Data toolbox, select Functions, and then click on Subset. A dialog box titled Data Subset should appear.

As usual, we first need to select the nycflights Dataset. Next, we want to filter to get specific observations in the dataset (rows), so we look at the Row Selection section. There are two buttons in this box. The Sequence button is used when we know the specific row numbers we want to include. The Logical Expression button is used when we know the condition(s) that the rows that we want to include from the dataset should satisfy. In our case here, we want the Logical Expression option. Specifically, we want to subset based on the logical expression dest == "LAX".

In the Logical Expression Creator dialog, click the green plus sign plus at the bottom right to add a logical expression. The figure below shows the logical expression that we need to specify.

Dialog box for subsetting data using logical expressions

Let’s take a second to decipher the logical expression we’ve created.

dest is the name of the variable we want to filter on. You can select this directly from the left dropdown menu.
== means “if it’s equal to”. You can select this directly from the middle dropdown menu.
LAX is the value we want dest to be equal to. You can usually select this directly from the right dropdown menu, but there are so many flight destinations that Rguroo won’t list them. Instead, we can directly type the value in. We write it inside single quotation marks since it is a character string and not a number.

Logical operators: Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the Data Subset dialog with the Logical Expression Creator and a series of logical operators. The most commonly used logical operators for data analysis are as follows:

== means “equal to”
!= means “not equal to”
> or < means “greater than” or “less than”
>= or <= means “greater than or equal to” or “less than or equal to”

Save your new data subset as lax_flights.

Graphical and numerical summaries of a subset

Once you have saved a subset in Rguroo, you can use it as an Rguroo dataset. For example, you can make a histogram of the departure delays of only LAX flights, using the lax_flights dataset that you created.

Histogram dialog box to plot the departure delays for flights to LAX

You can also obtain a custom numerical summary of the departure delays for these flights by going to the Data toolbox, clicking Function and then selecting Summary Statistic:

Summary Statistic dialog box

Summary statistics: Some useful summary statistics for a single numerical variable are as follows:

Mean
Median
Standard Deviation
Variance
IQR (interquartile range)
Minimum
Maximum

See if you can find each of those summary statistics in the Rguroo dialog. Check the boxes for those statistics, then View eye the output. Note that you will have to click the + sign next to “All Observations” to view the values.

Filtering based on multiple criteria

You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February. Again, you can use the dataset editor to filter the data or Rguroo’s subset function. We show both methods in the subsections below.

Multiple criteria filtering using the dataset editor

To filter the nycflights dataset and obtain a subset that includes only data values with the destination San Francisco (SFO) in February you can use the dataset editor and follow these steps:”

Open the dataset editor by going to the Data toolbox and double-clicking on the name of the dataset, nycflights. Alternatively, you can right-click on the dataset name and select the Edit option.
In the dataset editor, locate the filter tab , which is vertically displayed on the right-hand side of the editor. You will see a list of the names of the variables.
Expand the dest dropdown menu to display the list of destinations and uncheck the (Select All) checkbox to remove all current selections.
Check the checkbox next to SFO. You can also type “SFO” in the search box to quickly find and select it.
Similarly, expand the month dropdown menu to display the list of months, uncheck the (Select All) checkbox to remove all current selections, then check the checkbox next to 2.
Save the newly created subset of data as sfo_feb_flights, by typing the name in the the Save As textbox.

Multiple criteria filtering

Multiple criteria filtering using the Subset function

If you have already used the dataset editor to subset your data and do not wish to learn about Rguroo’s subset function, youcan skip this subsection and proceed to the next subsection.

To use Rguroo’s Subset function to filter the nycflights dataset and obtain a subset that includes only data values with the destination San Francisco (SFO) in February, you need to find flights that are headed to SFO AND are in February. This consists of a combination of two logical expressions. To set this up, in the Data toolbox, click Functions, select Subset, select the nycflights dataset, then click Logical Expression, just like we did earlier. First, we’ll put in a logical expression for flights headed to SFO, and then we’ll put in another logical expression for the flights in February (month 2). Note that we’ll have to type in both ‘SFO’ and the number 2 in the Value column, as shown in the figure below.

Combining logical expressions

Once you have each of the logical expressions in place, you need to combine them with an AND. To combine logical expressions, we use the Logical Expression Calculator at the bottom of the dialog box. Click the down arrow pick in the Pick column for the logical expression corresponding to the SFO flights to move it to the Logical Expression Calculator. Then, click the AND button, and finally move the logical expression for February down to the Logical Expression Calculator. When you’re done, click Done and click the Preview button eye to view the output. Save the new output as sfo_feb_flights.

About the Logical Expression Calculator: The Logical Expression Calculator can be used to combine logical expressions using logical operations AND, OR, or NOT. For example, if you are interested in either flights headed to SFO or in February, you would combine the two expressions using the OR button.

How many flights are headed to SFO in February? Hint: The answer is less than 100, so you can simply look at the number of rows in the Data Viewer.
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can use the Summary Statistic dialog on the sfo_feb_flights dataset but tell Rguroo to group the departure delays by the origin airport by selecting the variable origin in the Factor 1 dropdown menu, as shown below.

Summary Statistic dialog to get summary by flight origin

To view the summary statistics of departure delays based on the origin airport, click the + sign to the left of origin in the output:

Rguroo output of Summary Statistic

Use the Summary Statistic dialog to calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

Departure delays by month

Which month would you expect to have the highest average delay departing from an NYC airport?

Let’s think about how you could answer this question. We need to do three things:

group by months
summarize the average departure delays for each month
sort the months based on their average departure delays

We now know how to do the first two of these things, but there’s a bit of a problem. When we open the Summary Statistic dialog, we can’t select month as Factor 1. The reason for this is that Rguroo has classified month as a numerical variable, not a categorical variable. To overwrite this classification, we need to right-click the nycflights dataset and select Open to open the nycflights in the Dataset Editor (alternatively, you can double click on the name of the dataset.). Click the Variable Properties icon and in the dialog that pops up, select Variable Properties. Select the month variable and change its Type to Nominal, as shown below.

Data Editor to modify variable types

Because you have made changes to the dataset after creating and saving output based on the dataset, you cannot overwrite the original dataset. Use the Save As Save As textbox at the top of the tab to save the new dataset as nycflights1. Now open a new Summary Statistic dialog, select the dataset nycflights1, and month will now appear as a Factor 1 option, as shown in the dialog below.

Summary Statistic dialog to get summary by month

Use the Save As text box at the top of the output tab to save the summary as nycflights_month. We’ll use this dataset to perform the last step of sorting the dataset. In the Data toolbox, click Functions, then select Sort. Select nycflights_month as the Dataset. Click the plus sign plus to add a variable to sort by. We want to sort by the Variable Mean in Descending Order:

Data Sort dialog box

We obtain the following output:

Data summary sorted in descending order by the Mean

Note that there are actually 13 rows in this output: one for each month and one for All Observations. While this is not always useful, in this case we can also easily see which months were above or below the average departure delay for the entire dataset.

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

On time departure rate for NYC airports

Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Also suppose that for you, a flight that is delayed for less than 5 minutes is basically “on time.”” You consider any flight delayed for 5 minutes of more to be “delayed”.

In order to determine which airport has the best on time departure rate, you can

first classify each flight as “on time” or “delayed”,
then group flights by origin airport,
then calculate on time departure rates for each origin airport,
and finally sort the airports in descending order for on time departure percentage.

Let’s start with classifying each flight as “on time” or “delayed” by creating a new variable. In the Data toolbox, click Functions and select Transform. Then add a new variable called dep_type to the nycflights Dataset using the same sequence we learned in the previous lab. First, click the plus sign plus to add the variable, then give it an informative name (dep_type), then type some code in the middle box to indicate how to create dep_type. In this case, the code ifelse(dep_delay < 5, "on time", "delayed") indicates to classify the flight as "on time" if the flight is delayed by less than 5 minutes and "delayed" if not, i.e., if the flight is delayed by 5 or more minutes.

Creating the variable dep_delay using the Transform function

Click on the Preview button eye and Save the new dataset as nycflights_ontime.

With the new dataset saved, we can find the proportion of on-time flights out of each airport. In the Analytics toolbox, select Tabulation \(\rightarrow\) Categorical and then select the dataset nycflights_ontime. Click the plus sign to add a table. We want to find the distribution of dep_type conditional on origin, so select those variables as Factor 1 and Factor 2, respectively. Check the conditional box labeled as Cond. and check the Proportions box to tell Rguroo to calculate conditional proportions.

Tabulating New York flights based on delay_type and origin

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

You can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.

Bar graph of New York flights based on delay_type and origin

You may want to play around with changing from Side by side to Stacked and/or changing from Counts to Proportions to find the best visualization.

More Practice

Using the Transform function, add a new variable to the dataset that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance.
Replicate the following plot. Hint: The plot shows only flights from American Airlines (AA), Delta Airlines (DL), and United Airlines (UA), and the points are colored based on the Factor variable carrier. Rather than using Subset, you can click and use the Factor Level Editor attached to the plot to drop the unused levels and change the colors/symbols. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

Scatterplot of arrival delays as a function of departure delays

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Rguroo.com, the Rguroo.com logo, and all other trademarks, service marks, graphics and logos used in connection with Rguroo.com or the Website are trademarks or registered trademarks of Soflytics Corp. in the USA and other countries and are not included under the CC-BY-SA license.