A dataset on Delta Variant Covid-19 cases in the UK. This dataset gives a great example of Simpson's Paradox. When aggregating results without regard to age group, the death rate for vaccinated individuals is higher – but they have a much higher risk population. Once we look at populations with more comparable risks (breakout age groups), we see that the vaccinated group tends to be lower risk in each risk-bucketed group and that many of the higher risk patients had gotten vaccinated. The dataset was brought to OpenIntro's attention by Matthew T. Brenneman of Embry-Riddle Aeronautical University. Note: some totals in the original source differ as there were some cases that did not have ages associated with them.
Format
A data frame with 286,166 rows and 3 variables:
- age_group
Age of the person. Levels:
under 50
,50 +
.- vaccine_status
Vaccination status of the person. Note: the vaccinated group includes those who were only partially vaccinated. Levels:
vaccinated
,unvaccinated
- outcome
Did the person die from the Delta variant? Levels:
death
andsurvived
.
Examples
library(dplyr)
library(scales)
# Calculate the mortality rate for all cases by vaccination status
simpsons_paradox_covid |>
group_by(vaccine_status, outcome) |>
summarize(count = n()) |>
ungroup() |>
group_by(vaccine_status) |>
mutate(total = sum(count)) |>
filter(outcome == "death") |>
select(c(vaccine_status, count, total)) |>
mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |>
select(-c(count, total))
#> `summarise()` has grouped output by 'vaccine_status'. You can override using
#> the `.groups` argument.
#> # A tibble: 2 × 2
#> # Groups: vaccine_status [2]
#> vaccine_status mortality_rate
#> <chr> <chr>
#> 1 unvaccinated 0.17%
#> 2 vaccinated 0.41%
# Calculate mortality rate by age group and vaccination status
simpsons_paradox_covid |>
group_by(age_group, vaccine_status, outcome) |>
summarize(count = n()) |>
ungroup() |>
group_by(age_group, vaccine_status) |>
mutate(total = sum(count)) |>
filter(outcome == "death") |>
select(c(age_group, vaccine_status, count, total)) |>
mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |>
select(-c(count, total))
#> `summarise()` has grouped output by 'age_group', 'vaccine_status'. You can
#> override using the `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups: age_group, vaccine_status [4]
#> age_group vaccine_status mortality_rate
#> <chr> <chr> <chr>
#> 1 50 + unvaccinated 5.96%
#> 2 50 + vaccinated 1.68%
#> 3 under 50 unvaccinated 0.03%
#> 4 under 50 vaccinated 0.02%