The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it’s political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
In this lab, you’ll be analyzing data from Human Freedom Index reports for 2016. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.
The data we’re working with is in the OpenIntro
Dataset Repository and it’s called hfi, short for Human Freedom Index.
This dataset has a lot of variables, but we are only interested in a few. There are also nine years of data in this dataset, but we are only interested in 2016. We can use Rguroo’s Subset function to obtain only the variables of interest for 2016, but it is simpler to use the built-in filters in the Data Editor.
In the Data toolbox, right-click the dataset hfi and select Edit.
On the very right, select the Filters tab and open the year dropdown.
Uncheck the (Select All) to de-select everything, then check the 2016 box to select only the year 2016.
Again, on the very right, select to choose columns in the dataset. Uncheck the checkbox next to the Search textbox to de-select everything, then check the boxes next to the variables we will use in this lab: pf_score, pf_expression_control, hf_score, and countries. It may be easiest to search for these variables individually in the search box rather than scroll through the entire list.
In the Save As textbox, enter hfi_2016 as the name of the dataset. Finally, click the Save As button to save the dataset.
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient. In the Rguroo tab showing your graph, click and check the Show Correlation and LS Equation
box.
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. However, it is quite easy to obtain the model in Rguroo. In your plot, click and check the LS Line
button. Your plot should now show the least squares line and, if you also checked the box earlier to show the correlation, the equation of the line should be shown above the plot.
To get more information about the line, in the Analytics toolbox section, select Linear Regression then Simple and Multiple Regression. In the dialog, select hfi_2016 as the Dataset
. You can specify the Response
variable for our model by either typing pf_score in the text box or selecting pf_score from the dropdown menu. The Formula
box specifies the explanatory variable(s) in the model. You can either type pf_expression_control in the box, or double-click pf_expression_control in the Variables
list on the right to move it into the box automatically.
Once your dialog looks like the screenshot below, Preview
the output.
The default Rguroo output contains a table reporting the number of cases (rows) used in the analysis, the Regression Coefficients Estimates table, a Model Summary giving additional information about the model, an Analysis of Variance table, and two diagnostic plots (Residuals Versus Fit and Normal Probability Plot: Residuals) useful for checking conditions for inference.
We will start by investigating the Regression Coefficients Estimates table:
The first column in this table identifies each Term in our model. A simple linear regression model has two terms: a y-intercept and a term containing the predictor variable. So according to the Term column, the first row contains information about the y-intercept term and the second row contains information about the pf_expression_control term. The second column displays the actual values of the y-intercept and the slope coefficient corresponding to pf_expression_control. With this table, we can write down the equation of the least squares regression line for the linear model:
\[ \widehat{pf\_score} = 4.2838 + 0.54185 \times pf\_expression\_control \]
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). This value can be found in the table helpfully labeled Model Summary: Coefficient of Determination (R-Squared). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 71.4% of the variability in pf_score is explained by pf_expression_control.
In Rguroo, the model and report need to be saved separately. Click the Save button, then save both the Model and Report as hfi_model1.
Go back to the scatterplot with the least squares line laid on top.
This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom score for a country with a 7 rating for pf_expression_control?
The United States in 2016 had a 7 rating for pf_expression_control and a pf_score of 8.75. Use the equation of your least-squares regression line to predict the personal freedom score for the United States. Is this prediction an overestimate or an underestimate, and by how much? How is your answer related to the residual for this prediction?
Now we will check your answer to the previous exercise.
Go back to your Linear Regression output for the hfi_model1 model. Click , go to the Diagnostic Indices Table
section, and check Include Diagnostics Table
. First, we will select countries as our ID Variable
, which will make it easy to identify which country each residual corresponds to. Then click Preview
to see the updated output, which now contains a table labeled Predicted Values, Residuals, and Diagnostic Indices.
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability of residuals. We also need to check for (4) independent observations, but that is more difficult to check using model diagnostics, so we will not perform that check in this lab.
You already checked if the relationship between pf_score and pf_expression_control is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.
This plot can be found in Rguroo’s Linear Regression output as the plot labeled Residuals Versus Fit.
The red dashed line is a horizontal line at \(y = 0\) (to help us check whether the residuals are distributed around 0), and the green line is a moving average of sorts (to help us look for underlying patterns).
To check the nearly normal residuals condition, we can look at a histogram of the residuals. To do this, we need to save the residuals as an Rguroo dataset. Click , open the Diagnostics Indices Table
section again and Save Table as
hfi_residuals.
Once you have the dataset saved, you can make a Histogram of the hfi_residuals dataset, using the Variable
Residuals.
In Rguroo, it is much less work to look at a normal probability plot of the residuals, as the plot is shown in the default Rguroo output and labeled Normal Probability Plot: Residuals.
In the Normal Distribution lab, you compared the normal probability plot to plots of simulated data from known normal distributions. This is a lot of work to just check an assumption visually, so the Rguroo output helps to streamline this process. If we simulated from a normal distribution with mean 0 and standard deviation equal to the standard deviation of the residuals, we would expect 95% of the points to lie within the dashed red lines.
To check the constant variability condition, look again at the Residuals Versus Fit plot you examined earlier.
Choose another freedom variable and a variable you think would strongly correlate with it. Create a Subset of the hfi dataset containing only those two variables and the variable countries for the year 2016. Using this subset, produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
How does this relationship compare to the relationship between pf_expression_control and pf_score? Use the \(R^2\) values from the two model summaries to compare. Does your predictor variable in this model seem to predict your response better than in the model we fit earlier? Why or why not?
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Rguroo.com, the Rguroo.com logo, and all other trademarks, service marks, graphics and logos used in connection with Rguroo.com or the Website are trademarks or registered trademarks of Soflytics Corp. or Soflytics Corp. licensors and are not included under the CC-BY-SA license.