Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
The dataset is called yrbss and can be imported to your Rguroo workspace from Rguroo’s OpenIntro
Dataset Repository. Before importing the dataset, remind yourself of the variables in the dataset by clicking on the information icon , shown to the right of the dataset name.
Dataset
yrbss, and tabulate the values for the variable text_while_driving_30d. If you need help using the Data Tabulation dialog, click the video icon Note that text_while_driving_30d has a level NA. Because this NA value represents missing data, you should click and drop the NA level for text_while_driving_30d.
What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets? Hint: Select the two relevant factors in the Data Tabulation dialog, and select the option Proportions
. Make sure to remove the NA levels for each factor using the .
Of those who never wore a helmet, what proportion have texted while driving every day in the past 30 days? Hint: Here, you are asked to compute a conditional proportion. That is, you need to find proportion of those who have texted while driving everyday in the past 30 days, conditional on never wearing a helmet. See the Data Tabulation dialog below, where the option Cond
is checked. To get a cleaner output, you can click and drop all levels of helmet_12m except for the level of interest, never.
Using Data Tabulation to get a conditional proportion
Let’s calculate the above proportions in another way, namely by subsetting the data. When we calculated the conditional proportion in the Exercise above, we were only concerned with those who had stated that they never wore a helmet. Also, it will be easier if we create a new variable, text_ind, that specifies whether the individual has texted every day while driving over the past 30 days or not. We could use Rguroo’s Subset function to get the subset with only those that never wore a helmet, and then use the Transform function on the subsetted data to create the variable text_ind. However, by using the Transform function cleverly, we can accomplish both in the Transform function.
Go to the Data toolbox and select Functions \(\rightarrow\) Transform. Select the Dataset
yrbss, click the plus sign , and create the variable text_ind by typing in the following code in the center panel:
ifelse(text_while_driving_30d == "30", "yes", "no")
Click the plus icon again, and add a another variable, call it never_only, and in the middle panel type the following R code:
ifelse(helmet_12m == "never", "never", NA)
The never_only variable will have the value “never” whenever the value of the variable helmet_12m is never, and it will be NA otherwise. NA is the missing data code in R. So, by selecting Complete Cases Only
in the Data Transform dialog, all the cases that are NA will be removed when you preview the data. Thus, in our example, only cases with the value of never for the helmet_12m variable remain. This is a trick in getting a subset of a dataset using the Transform function.
Since we only need the text_ind variable to compute our proportion, we move all other variables to the Excluded Variable
list (See the Data Transform dialog below). We have also retained the never_only variable, mostly as a reminder. Click the Preview
button , then Save the new dataset as never_helmet.
Using Transform to subset the yrbss data and create the text_ind variable
Now, right-click on the never_helmet dataset in the Data toolbox and select Dataset Summary. Then, in the Categorical Variables table, you will see the counts of the variable text_ind for its levels yes and no. Use these values to see if you can come up with the same conditional proportion that you computed in Exercise 3.
When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population parameters. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.
Recall from the previous lab that we can use a confidence interval to obtain an interval estimate of a parameter. Here we construct a confidence interval for the population proportion of non-helmet wearers who have consistently texted while driving the past 30 days. To do so, we use the dataset never_helmet that we created, which consists of data only for non-helmet wearers, and the text_ind variable, which has the value “yes” for those who texted each day for the past 30 days and “no” otherwise.
To construct a confidence interval for a population proportion, open the Analytics toolbox, then click on Analysis \(\rightarrow\) Proportion Inference \(\rightarrow\) One Population. In the One Population Proportion Inference dialog, select the dataset never_helmet. Select text_ind in the Factor
dropdown and choose yes from the Success
dropdown. The three gray textboxes showing the sample size (Sample Size
), number of successes (# of Succ.
), and proportion of successes (Prop. of Succ.
) auto-fill with their corresponding value. The value for the proportion of successes should be familiar!
Rguroo has nine different options for constructing confidence intervals for a population proportion, although only three are shown in the main One Population Proportion Inference dialog. To see all options, click . In the Advanced Features dialog, select the section
Confidence Interval and Test of Hypothesis
, and in the tab Confidence Interval
you will find all nine options. Select the options Bootstrap (Percentile)
and Bootstrap (SE)
, then click the Preview
button to see the result.
Constructing confidence intervals using bootstrap methods
Rguroo’s Bootstrap Method Defaults: The default number of bootstrap samples in Rguroo is 10,000 and can be changed in the Advanced Features dialog () by changing the value in the
Replication
textbox. There, you can also change the seed from the default value of 100 in the Seed
textbox. The default confidence level is 95%, and you can change it in the One Population Proportion Inference dialog () in the
Confidence Level
textbox.
Rguroo’s Bootstrap Computations: As we saw in a previous lab, the bootstrap sample consists of simulated values from a population with a probability of success equal to the observed sample proportion \(\hat p\) (i.e., the proportion of the successes that we observe in our sample). The Bootstrap (Percentile)
option computes the lower and upper confidence levels of a \(100(1-\alpha)%\) confidence interval, respectively, from the \(\alpha/2\) and \((1-\alpha)/2\) quantiles of the simulated bootstrap values. For the option Bootstrap (SE)
, the lower and upper confidence levels are computed using the formula \(\hat p \pm z^\star \times SE_{\hat p}\), where \(z^\star\) is the \((1-\alpha)/2\) quantile of the standard normal distribution, and \(SE_{\hat p}\) is the standard deviation of the simulated bootstrap proportions (not the standard deviation given by the formula in the previous lab!). Recall that the quantity \(z^\star \times SE_{\hat p}\) is called the margin of error.
With 10,000 replications and 95% confidence level, the lower and upper confidence levels for the Percentile method are respectively the \(250^{th}\) and the \(9,750^{th}\) values in the ascending sorted bootstrap sample. For the SE method with 95% confidence level, the \(z^\star\) value is 1.96.
What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey? Hint: you should be able to figure this out using the output for the Bootstrap (SE) method.
Using Rguroo’s One Population Proportion Inference, calculate confidence intervals for two other categorical variables. You’ll need to decide which level to call “success”, and report the associated margins of error. Interpret the intervals in context of the data. It may be helpful to create new data sets for each of the two categorical variables first, and then use these data sets to construct the confidence intervals.
Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,. \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).
Import the dataset values_of_p from Rguroo’s OpenIntro
Dataset Repository. Double-click on the dataset name to view the data. This dataset consists of a single variable “p” with 101 values that start from 0 and go to 1 in increments of 0.01 (the values are 0, 0.01, 0.02, …, 0.99, 1). Now, open the Transform function from the Data toolbox, select the Dataset
values_of_p, click , and define a new variable called ME. Now, use the Transform function to compute the margin of error by typing the following R code in the center box:
= 1000
n = 1.96 * sqrt(p * (1 - p)/n) ME
Since the sample size is irrelevant to this discussion, we set it to \(n = 1000\). Click the Preview
button . You should see the new variable ME in the dataset. Save this dataset with the name margin_of_error.
Computing margin of error as a function of p
Lastly, you can plot the two variables against each other to reveal their relationship. To do so, go to the Plots toolbox and select Create Plots \(\rightarrow\) Scatterplot. Select the Dataset
margin_of_error, then set Predictor (x)
to p and Response (y)
to ME. Set the X-Axis
label to read Population Proportion and the Y-Axis
label to read Margin of Error. Finally, look in the Superimpose
section of the dialog and check the option Line Graph
. If you want to remove the dots from the graph, click , and in the top menu
Attributes of Scatterplot Points, LS Line, LOESS and Identified Points
find the Point-Line
tab and uncheck the box Show Points
; however, this is not necessary.
Graphing the relationship between p and margin of error
In this section of the lab, you will conduct a hypothesis test to compare two proportions. In such cases, you have a response variable that is categorical, and an explanatory variable that is also categorical, and you are comparing the proportion of success of the response variable across the levels of the explanatory variable.
As a first step to answering this question using Rguroo, we should use the Transform function to create two variables from the yrbss data that are indicators for sleeping 10+ hours and strength training every day. Call the variables sleep_10_plus and train_every_day. For sleep_10_plus, in the center box type the R code:
ifelse(school_night_hours_sleep == "10+", "sleep_10_plus_hours","not_sleep_10_plus_hours")
For train_every_day, in the center box type the R code:
ifelse(strength_training_7d == 7, "train_every_day", "not_train_every_day")
Using Transform to obtain a subset of the yrbss dataset
Save the transformed dataset as training_and_sleep.
In the Analytics toolbox, select Proportion Inference \(\rightarrow\) Two Populations. Fill out the Two Population Proportion Inference dialog as shown in the screenshot below. Once you select the appropriate levels for Population 1 and Population 2, you will need to click the buttons to get the correct numbers to appear.
Entering data to test a hypothesis about a difference of two population proportions.
Click on the Test of Hypothesis
tab, enter the appropriate alternative hypothesis, select Permutation Test
, and Preview
the result. In the output you will find the p-value.
Specifying the test method and the alternative hypothesis
Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? Write out your conclusion based on the results of your hypothesis test.
Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probablity that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.
For inference on proportions, the sample proportion can be assumed to have a nearly normal sampling distribution if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes you wonder: what’s so special about the number 10?
The short answer is: nothing. You could argue that you would be fine with 9 or that you really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) both reach 10, the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on the normal approximation.
We should emphasize the importance of investigating your assumptions about the sampling distribution before making inference. If your assumptions are not met, then the results of your inference will be misleading. For example, if the sampling distribution of the sample proportion is not close to normal, the result of a confidence interval or hypothesis test procedure based on the normal approximation will not be very accurate.
You can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. Let’s see how we can visualize the sampling distribution of \(\hat p\) for a given sample size and a value of the population proportion \(p\). The easiest approach is to simulate values from the so-called Bernoulli random variable.
A Bernoulli random variable has an associated success probability \(p\); we denote this random variable by \(Bernoulli(p)\). Each simulated value from a \(Bernoulli(p)\) random variable, is either 1 (success) with probability \(p\) or 0 (failure) with probability \(1-p\). Say we would like to investigate the sampling distribution of the sample proportion \(\hat p\), for a sample of size \(n=300\) and population probability of success \(p = 0.1\). Let’s begin by simulating one sample of size 300.
In the Probability-Simulation toolbox, select Probability \(\rightarrow\) Multiple Distribution Generator. In the Multiple Distribution Random Generator dialog that opens click to define your simulation. Change the name Sample_1 to Bern (this name can be anything, as long as it has no spaces). The Multiple Distribution Generator function allows simulating from various types of random variables. For our example, select Bernoulli from the
Distribution
dropdown, set the Prob. of Success
to 0.1, and set the Sample Size
to 300. Here we have kept the default Seed
value of 100, but you should change it to another number of your choice. Click the Preview
button to see the result.
Generating a sample of size 300 from Bernoulli(p=0.1)
You should have 300 rows of data (by default 100 is shown), with most of the values (approximately 90%) being 0, and the others (approximately 10%) being 1. Thus, we have simulated one sample of size 300 from a population with a proportion of successes 0.1.
Now click the button to reopen the Multiple Distribution Random Generator dialog. Leave everything as is, except change the value of
Replications
from 1 to 5, and click the Preview
button . This time you should see five columns labeled Bern_1, Bern_2 … Bern_5 with each column being a sample of size 300 from \(Bernoulli(p=0.1)\).
Now let’s have Rguroo compute the sample proportions \(\hat p\) for each of the samples. Again, click the button to reopen the Multiple Distribution Random Generator dialog. Click the
Custom Statistic
button at the bottom of the dialog to open the Custom Statistic dialog. Click and change the name Statistic_1 to p_hat, and in the text area to the right type in the following R code:
sum(x) / length(x)
The command sum(x)
adds all the values in each sample (in our example the zeros and ones), thus its result is the total number of successes for each sample. The command length(x)
counts the number of values in each sample, which is the sample size. Dividing these two quantities give the sample proportion of successes. Click the Preview
button to see the result.
Computing the sample proportion by creating a Custom Statistic
The result is five values, under the variable name Bern_p_hat that are the proportions of successes (or 1’s) for each of the five samples that we simulated. What are these values close to?
Now that we know how to simulate sample proportions \(\hat p\), we can investigate its sampling distribution for various values of \(n\) and \(p\) by simulating a large number of \(\hat p\)’s.
Click to reopen the dialog, increase the number of
Replications
to 1000, Preview
, and Save the resulting dataset as p_hat_n300. It’s best to Save the process as well; call it simulating_p_hat. When you save the process, you can come back to the dialog box and change values of
Sample Size
(\(n\)) and Prob. of Success
(\(p\)) to see other sampling distributions.
By right-clicking on the dataset name p_hat_n300 and selecting Dataset Summary, you can see the mean and standard deviation of the variable Bern_p_hat. Compare these values to the values of \(p\) and \(SE = \sqrt{p(1-p)/n}\).
To see the shape of the distribution, in the Plots toolbox select Create Plot \(\rightarrow\) Dotplot. In the Dotplot dialog select the Dataset
p_hat_n300 and in the Numerical Variables
section of the dialog move the variable name Bern_p_hat to the Selected
column. Click the Preview
button to see the distribution of \(\hat p\) for \(n=300\) and \(p=10\).
Creating a dotplot of the simulated p-hat values
Describe the sampling distribution of sample proportions at \(n = 300\) and \(p = 0.1\). Be sure to note the center, spread, and shape.
Keep \(n=300\) constant and change \(p\). How does the shape, center, and spread of the sampling distribution vary as \(p\) changes? For example try the following values for \(p\): 0.005, 0.01, 0.2, 0.5, 0.7, 0.99, and 0.995. Hint: Go back to the simulating_p_hat simulation that you saved to the Probability-Simulation toolbox. For each of the above values of \(p\), click and create a new Bernoulli
Distribution
variable with Sample Size
equal to 300, 1000 Replications
, and Prob. of Success
equal to the value of \(p\). Make sure to click Custom Statistic
and add the p_hat variable (= sum(x)/length(x)
) before clicking the button to add the next simulation. If you have added all seven simulations correctly, when you
Preview
the result, you should see eight columns of data, each corresponding to one of the probability values.
How to see all plots simultaneously: If you want to see all of the distributions at the same time, use the option Stacked with Sample IDs
in the Multiple Distribution Random Generator dialog box, and plot the values by the Factor Variable
ID in the Dotplot function.
Prob. of Success
\(p = 0.1\) and try the following values of \(n\) (Sample Size
): 10, 50, 100, 300, and 1000. Again, if you open up the original simulating_p_hat simulation and add all five simulations correctly, when you Preview
Earlier in the lab, we performed a hypothesis test to determine if those who sleep 10+ hours per day are more likely to strength train every day of the week. Before we perform inference about the difference of population proportions, we should investigate the assumptions in each population separately. However, in this case, we do not know or hypothesize a value for \(p\), so instead we use an estimate of \(p\) to check our assumptions. A natural estimate of the value of \(p\) is the pooled proportion of success; that is, the overall proportion of success in the sample looking at both groups combined.
To obtain a confidence interval for the difference in the proportion of people who strength train every day between those who sleep 10+ hours and those who do not, we can select the Confidence Interval
tab and check the method Bootstrap (Percentile)
.
Obtaining a bootstrap confidence interval for difference of two proportions
Report, and write a sentence interpreting, a 95% confidence interval for the difference in population proportions of people who strength train every day of the week between those who sleep 10+ hours per day and those who do not.
Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between \(p\) and margin of error. This question does not require using a dataset.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Rguroo.com, the Rguroo.com logo, and all other trademarks, service marks, graphics and logos used in connection with Rguroo.com or the Website are trademarks or registered trademarks of Soflytics Corp. in the USA and other countries and are not included under the CC-BY-SA license.