Getting Started

The data

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Load the yrbss data set, which can be found here into your workspace.

There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by visiting this page.

  1. What are the cases in this data set? How many cases are there in our sample?

Exploratory data analysis

You will first start with analyzing the weight of the participants in kilograms: weight.

Using visualization and summary statistics, describe the distribution of weights.

  1. How many observations are we missing weights from?

Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

First, let’s create a new variable physical_3plus, which will be coded as either “yes” if the student is physically active for at least 3 days a week (physically_active_7d>2), and “no” if not. (Make sure the variable is set to be nominal.)

  1. Make a side-by-side violin plots of physical_3plus and weight. (In the plots pull down in Descriptives.) Is there a relationship between these two variables? What did you expect and why?

Box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions. In a descriptive statistics table, determine the mean value for weight for each group in the physical_3plus variable.

There is an observed difference, but is this difference large enough to deem it “statistically significant”? In order to answer this question we will conduct a hypothesis test.

Inference

  1. Are all conditions necessary for inference satisfied? Comment on each. You can determine the group sizes in the descriptive statistics table.

  2. Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

Open the distrACTION menu (click the plus sign and check the box if you don’t see the menu at the top), and select T Distribution. For degrees of freedom, we need to enter 2 less than the total number of valid observations. Since some individuals are missing values, we can’t just use the total number of observations. Use a descriptives table with both weight and physical_3plus and determine the total number of valid observations, then subtract 2 from that value. Enter this in the df box in the T-Distribution menu.

The plot you see shows how likely a difference of means would be under the null hypothesis. How does the value we found for the actual difference compare to the values in the plot? (You might need to change the range of x values to better suit your comparison.)

In the options, click to highlight Compute probability, then replace \(x_1\) with your observed difference. (The choice of \(P(X \leq x_1)\) is what we want to keep.) You will see a (rounded) value show up, and the area will be highlighted (if enough is visible to be highlighted). (This is actually half of the p-value, since we did not specify a direction, and so are doing a t-tailed test.)

While we were able to determine an approximate p-value using this package, jamovi will let us calculate a more precise value directly. Open the T-Tests menu, then select Independent Samples T-Test. Select weight as your dependent variable and physical_3plus as your Grouping Variable. This will create a table on the right hand side.

  1. What is the value in the t column?

  2. What is the p-value? How does this compare to the value from the t-distribution graph?

Next we can add a confidence interval to our table. Check the Mean difference box, and then the Confidence interval box, and leave the value at 95.0. You will now see the values for the confidence interval in your table.

  1. Record the confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.

More Practice

  1. Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

  2. Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

  3. Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

  4. Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

  5. Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your \(\alpha\) level, and conclude in context.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.