Inference for categorical data

Getting Started

The data

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

The dataset is called yrbss and can be imported to your Rguroo workspace from Rguroo’s OpenIntro Dataset Repository. Before importing the dataset, remind yourself of the variables in the dataset by clicking on the information icon info , shown to the right of the dataset name.

What are the counts within each category for the number of days these students have texted while driving within the past 30 days? Hint: Go to the Analytics toolbox, select Analysis \(\rightarrow\) Tabulation \(\rightarrow\) Categorical, select the Dataset yrbss, and tabulate the values for the variable text_while_driving_30d. If you need help using the Data Tabulation dialog, click the video icon here or on the dialog box to watch the video tutorial Tabulation and Distribution of Categorical Data.

Note that text_while_driving_30d has a level NA. Because this NA value represents missing data, you should click Level Editor and drop the NA level for text_while_driving_30d.

What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets? Hint: Select the two relevant factors in the Data Tabulation dialog, and select the option Proportions. Make sure to remove the NA levels for each factor using the .
Of those who never wore a helmet, what proportion have texted while driving every day in the past 30 days? Hint: Here, you are asked to compute a conditional proportion. That is, you need to find proportion of those who have texted while driving everyday in the past 30 days, conditional on never wearing a helmet. See the Data Tabulation dialog below, where the option Cond is checked. To get a cleaner output, you can click and drop all levels of helmet_12m except for the level of interest, never.

Using Data Tabulation to get a conditional proportion

Let’s calculate the above proportions in another way, namely by subsetting the data. When we calculated the conditional proportion in the Exercise above, we were only concerned with those who had stated that they never wore a helmet. Also, it will be easier if we create a new variable, text_ind, that specifies whether the individual has texted every day while driving over the past 30 days or not. We could use Rguroo’s Subset function to get the subset with only those that never wore a helmet, and then use the Transform function on the subsetted data to create the variable text_ind. However, by using the Transform function cleverly, we can accomplish both in the Transform function.

Go to the Data toolbox and select Functions \(\rightarrow\) Transform. Select the Dataset yrbss, click the plus sign plus , and create the variable text_ind by typing in the following code in the center panel:

ifelse(text_while_driving_30d == "30", "yes", "no")

Click the plus icon plus again, and add a another variable, call it never_only, and in the middle panel type the following R code:

ifelse(helmet_12m == "never", "never", NA)

The never_only variable will have the value “never” whenever the value of the variable helmet_12m is never, and it will be NA otherwise. NA is the missing data code in R. So, by selecting Complete Cases Only in the Data Transform dialog, all the cases that are NA will be removed when you preview the data. Thus, in our example, only cases with the value of never for the helmet_12m variable remain. This is a trick in getting a subset of a dataset using the Transform function.

Since we only need the text_ind variable to compute our proportion, we move all other variables to the Excluded Variable list (See the Data Transform dialog below). We have also retained the never_only variable, mostly as a reminder. Click the Preview button eye , then Save the new dataset as never_helmet.

Using Transform to subset the yrbss data and create the text_ind variable

Now, right-click on the never_helmet dataset in the Data toolbox and select Dataset Summary. Then, in the Categorical Variables table, you will see the counts of the variable text_ind for its levels yes and no. Use these values to see if you can come up with the same conditional proportion that you computed in Exercise 3.

Confidence Intervals for Proportions

When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population parameters. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.

Recall from the previous lab that we can use a confidence interval to obtain an interval estimate of a parameter. Here we construct a confidence interval for the population proportion of non-helmet wearers who have consistently texted while driving the past 30 days. To do so, we use the dataset never_helmet that we created, which consists of data only for non-helmet wearers, and the text_ind variable, which has the value “yes” for those who texted each day for the past 30 days and “no” otherwise.

To construct a confidence interval for a population proportion, open the Analytics toolbox, then click on Analysis \(\rightarrow\) Proportion Inference \(\rightarrow\) One Population. In the One Population Proportion Inference dialog, select the dataset never_helmet. Select text_ind in the Factor dropdown and choose yes from the Success dropdown. The three gray textboxes showing the sample size (Sample Size), number of successes (# of Succ.), and proportion of successes (Prop. of Succ.) auto-fill with their corresponding value. The value for the proportion of successes should be familiar!

Rguroo has nine different options for constructing confidence intervals for a population proportion, although only three are shown in the main One Population Proportion Inference dialog. To see all options, click Details . In the Advanced Features dialog, select the section Confidence Interval and Test of Hypothesis, and in the tab Confidence Interval you will find all nine options. Select the options Bootstrap (Percentile) and Bootstrap (SE), then click the Preview button eye to see the result.

Constructing confidence intervals using bootstrap methods

Rguroo’s Bootstrap Method Defaults: The default number of bootstrap samples in Rguroo is 10,000 and can be changed in the Advanced Features dialog ( Details ) by changing the value in the Replication textbox. There, you can also change the seed from the default value of 100 in the Seed textbox. The default confidence level is 95%, and you can change it in the One Population Proportion Inference dialog ( Basics ) in the Confidence Level textbox.

Rguroo’s Bootstrap Computations: As we saw in a previous lab, the bootstrap sample consists of simulated values from a population with a probability of success equal to the observed sample proportion \(\hat p\) (i.e., the proportion of the successes that we observe in our sample). The Bootstrap (Percentile) option computes the lower and upper confidence levels of a \(100(1-\alpha)%\) confidence interval, respectively, from the \(\alpha/2\) and \((1-\alpha)/2\) quantiles of the simulated bootstrap values. For the option Bootstrap (SE), the lower and upper confidence levels are computed using the formula \(\hat p \pm z^\star \times SE_{\hat p}\), where \(z^\star\) is the \((1-\alpha)/2\) quantile of the standard normal distribution, and \(SE_{\hat p}\) is the standard deviation of the simulated bootstrap proportions (not the standard deviation given by the formula in the previous lab!). Recall that the quantity \(z^\star \times SE_{\hat p}\) is called the margin of error.

With 10,000 replications and 95% confidence level, the lower and upper confidence levels for the Percentile method are respectively the \(250^{th}\) and the \(9,750^{th}\) values in the ascending sorted bootstrap sample. For the SE method with 95% confidence level, the \(z^\star\) value is 1.96.

What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey? Hint: you should be able to figure this out using the output for the Bootstrap (SE) method.
Using Rguroo’s One Population Proportion Inference, calculate confidence intervals for two other categorical variables. You’ll need to decide which level to call “success”, and report the associated margins of error. Interpret the intervals in context of the data. It may be helpful to create new data sets for each of the two categorical variables first, and then use these data sets to construct the confidence intervals.

How does the proportion affect the margin of error?

Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.

Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,. \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).

Import the dataset values_of_p from Rguroo’s OpenIntro Dataset Repository. Double-click on the dataset name to view the data. This dataset consists of a single variable “p” with 101 values that start from 0 and go to 1 in increments of 0.01 (the values are 0, 0.01, 0.02, …, 0.99, 1). Now, open the Transform function from the Data toolbox, select the Dataset values_of_p, click plus , and define a new variable called ME. Now, use the Transform function to compute the margin of error by typing the following R code in the center box:

n = 1000
ME = 1.96 * sqrt(p * (1 - p)/n)

Since the sample size is irrelevant to this discussion, we set it to \(n = 1000\). Click the Preview button eye . You should see the new variable ME in the dataset. Save this dataset with the name margin_of_error.

Computing margin of error as a function of p

Lastly, you can plot the two variables against each other to reveal their relationship. To do so, go to the Plots toolbox and select Create Plots \(\rightarrow\) Scatterplot. Select the Dataset margin_of_error, then set Predictor (x) to p and Response (y) to ME. Set the X-Axis label to read Population Proportion and the Y-Axis label to read Margin of Error. Finally, look in the Superimpose section of the dialog and check the option Line Graph. If you want to remove the dots from the graph, click Details , and in the top menu Attributes of Scatterplot Points, LS Line, LOESS and Identified Points find the Point-Line tab and uncheck the box Show Points; however, this is not necessary.

Graphing the relationship between p and margin of error

Describe the relationship between p and ME. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size (such as \(n = 1000\)), for which value of p is the margin of error maximized?

Hypothesis Testing for Proportions

In this section of the lab, you will conduct a hypothesis test to compare two proportions. In such cases, you have a response variable that is categorical, and an explanatory variable that is also categorical, and you are comparing the proportion of success of the response variable across the levels of the explanatory variable.

Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? Write out the null and alternative hypothesis for this test.

As a first step to answering this question using Rguroo, we should use the Transform function to create two variables from the yrbss data that are indicators for sleeping 10+ hours and strength training every day. Call the variables sleep_10_plus and train_every_day. For sleep_10_plus, in the center box type the R code:

ifelse(school_night_hours_sleep == "10+", "sleep_10_plus_hours","not_sleep_10_plus_hours")

For train_every_day, in the center box type the R code:

ifelse(strength_training_7d == 7, "train_every_day", "not_train_every_day")

Using Transform to obtain a subset of the yrbss dataset

Save the transformed dataset as training_and_sleep.

In the Analytics toolbox, select Proportion Inference \(\rightarrow\) Two Populations. Fill out the Two Population Proportion Inference dialog as shown in the screenshot below. Once you select the appropriate levels for Population 1 and Population 2, you will need to click the refresh buttons to get the correct numbers to appear.

Entering data to test a hypothesis about a difference of two population proportions.

Click on the Test of Hypothesis tab, enter the appropriate alternative hypothesis, select Permutation Test, and Preview eye the result. In the output you will find the p-value.

Specifying the test method and the alternative hypothesis

Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? Write out your conclusion based on the results of your hypothesis test.
Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probablity that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.

Investigating Assumptions About the Sampling Distribution

For inference on proportions, the sample proportion can be assumed to have a nearly normal sampling distribution if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes you wonder: what’s so special about the number 10?

The short answer is: nothing. You could argue that you would be fine with 9 or that you really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) both reach 10, the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on the normal approximation.

We should emphasize the importance of investigating your assumptions about the sampling distribution before making inference. If your assumptions are not met, then the results of your inference will be misleading. For example, if the sampling distribution of the sample proportion is not close to normal, the result of a confidence interval or hypothesis test procedure based on the normal approximation will not be very accurate.

You can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. Let’s see how we can visualize the sampling distribution of \(\hat p\) for a given sample size and a value of the population proportion \(p\). The easiest approach is to simulate values from the so-called Bernoulli random variable.

A Bernoulli random variable has an associated success probability \(p\); we denote this random variable by \(Bernoulli(p)\). Each simulated value from a \(Bernoulli(p)\) random variable, is either 1 (success) with probability \(p\) or 0 (failure) with probability \(1-p\). Say we would like to investigate the sampling distribution of the sample proportion \(\hat p\), for a sample of size \(n=300\) and population probability of success \(p = 0.1\). Let’s begin by simulating one sample of size 300.

In the Probability-Simulation toolbox, select Probability \(\rightarrow\) Multiple Distribution Generator. In the Multiple Distribution Random Generator dialog that opens click plus to define your simulation. Change the name Sample_1 to Bern (this name can be anything, as long as it has no spaces). The Multiple Distribution Generator function allows simulating from various types of random variables. For our example, select Bernoulli from the Distribution dropdown, set the Prob. of Success to 0.1, and set the Sample Size to 300. Here we have kept the default Seed value of 100, but you should change it to another number of your choice. Click the Preview button eye to see the result.

Generating a sample of size 300 from Bernoulli(p=0.1)

You should have 300 rows of data (by default 100 is shown), with most of the values (approximately 90%) being 0, and the others (approximately 10%) being 1. Thus, we have simulated one sample of size 300 from a population with a proportion of successes 0.1.

Now click the Simulate button to reopen the Multiple Distribution Random Generator dialog. Leave everything as is, except change the value of Replications from 1 to 5, and click the Preview button eye . This time you should see five columns labeled Bern_1, Bern_2 … Bern_5 with each column being a sample of size 300 from \(Bernoulli(p=0.1)\).

Now let’s have Rguroo compute the sample proportions \(\hat p\) for each of the samples. Again, click the Simulate button to reopen the Multiple Distribution Random Generator dialog. Click the Custom Statistic button at the bottom of the dialog to open the Custom Statistic dialog. Click plus and change the name Statistic_1 to p_hat, and in the text area to the right type in the following R code:

sum(x) / length(x)

The command sum(x) adds all the values in each sample (in our example the zeros and ones), thus its result is the total number of successes for each sample. The command length(x) counts the number of values in each sample, which is the sample size. Dividing these two quantities give the sample proportion of successes. Click the Preview button eye to see the result.

Computing the sample proportion by creating a Custom Statistic

The result is five values, under the variable name Bern_p_hat that are the proportions of successes (or 1’s) for each of the five samples that we simulated. What are these values close to?

Now that we know how to simulate sample proportions \(\hat p\), we can investigate its sampling distribution for various values of \(n\) and \(p\) by simulating a large number of \(\hat p\)’s.

Click Simulate to reopen the dialog, increase the number of Replications to 1000, Preview eye , and Save the resulting dataset as p_hat_n300. It’s best to Save the process as well; call it simulating_p_hat. When you save the process, you can come back to the dialog box and change values of Sample Size (\(n\)) and Prob. of Success (\(p\)) to see other sampling distributions.

By right-clicking on the dataset name p_hat_n300 and selecting Dataset Summary, you can see the mean and standard deviation of the variable Bern_p_hat. Compare these values to the values of \(p\) and \(SE = \sqrt{p(1-p)/n}\).

To see the shape of the distribution, in the Plots toolbox select Create Plot \(\rightarrow\) Dotplot. In the Dotplot dialog select the Dataset p_hat_n300 and in the Numerical Variables section of the dialog move the variable name Bern_p_hat to the Selected column. Click the Preview button eye to see the distribution of \(\hat p\) for \(n=300\) and \(p=10\).

Creating a dotplot of the simulated p-hat values

Describe the sampling distribution of sample proportions at \(n = 300\) and \(p = 0.1\). Be sure to note the center, spread, and shape.
Keep \(n=300\) constant and change \(p\). How does the shape, center, and spread of the sampling distribution vary as \(p\) changes? For example try the following values for \(p\): 0.005, 0.01, 0.2, 0.5, 0.7, 0.99, and 0.995. Hint: Go back to the simulating_p_hat simulation that you saved to the Probability-Simulation toolbox. For each of the above values of \(p\), click and create a new Bernoulli Distribution variable with Sample Size equal to 300, 1000 Replications, and Prob. of Success equal to the value of \(p\). Make sure to click Custom Statistic and add the p_hat variable (= sum(x)/length(x)) before clicking the button to add the next simulation. If you have added all seven simulations correctly, when you Preview the result, you should see eight columns of data, each corresponding to one of the probability values.

How to see all plots simultaneously: If you want to see all of the distributions at the same time, use the option Stacked with Sample IDs in the Multiple Distribution Random Generator dialog box, and plot the values by the Factor Variable ID in the Dotplot function.

Now also change \(n\). How does \(n\) appear to affect the distribution of \(\hat{p}\)? For example, keep Prob. of Success \(p = 0.1\) and try the following values of \(n\) (Sample Size): 10, 50, 100, 300, and 1000. Again, if you open up the original simulating_p_hat simulation and add all five simulations correctly, when you Preview the result, you should see six columns of data, each corresponding to one of the sample size values.

Earlier in the lab, we performed a hypothesis test to determine if those who sleep 10+ hours per day are more likely to strength train every day of the week. Before we perform inference about the difference of population proportions, we should investigate the assumptions in each population separately. However, in this case, we do not know or hypothesize a value for \(p\), so instead we use an estimate of \(p\) to check our assumptions. A natural estimate of the value of \(p\) is the pooled proportion of success; that is, the overall proportion of success in the sample looking at both groups combined.

What is the pooled proportion of people in the sample who strength train every day (looking at both those who sleep 10+ hours and those who do not)? Use this estimate of \(p\) to investigate our assumptions in the sample of those who sleep 10+ hours and the sample of those who do not. Is the sampling distribution of the sample proportion \(\hat{p}\) approximately normal in each group? Hint: The sample sizes for the two groups are not equal.

More Practice

To obtain a confidence interval for the difference in the proportion of people who strength train every day between those who sleep 10+ hours and those who do not, we can select the Confidence Interval tab and check the method Bootstrap (Percentile).

Obtaining a bootstrap confidence interval for difference of two proportions

Report, and write a sentence interpreting, a 95% confidence interval for the difference in population proportions of people who strength train every day of the week between those who sleep 10+ hours per day and those who do not.
Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between \(p\) and margin of error. This question does not require using a dataset.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Rguroo.com, the Rguroo.com logo, and all other trademarks, service marks, graphics and logos used in connection with Rguroo.com or the Website are trademarks or registered trademarks of Soflytics Corp. in the USA and other countries and are not included under the CC-BY-SA license.