Tutorial illustration

Tutorial description

When your dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. In this tutorial, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual? By the end of the tutorial, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.

Learning objectives

Lessons

1 - Visualizing categorical data

Creating graphical and numerical summaries of two categorical variables, primarily using two R packages: ggplot2 and dplyr.

  • Graphical representation of two categorical variables
    • Side-by-side bar charts
    • Stacked bar charts
    • To normalize or not to normalize
  • Tabular representation of two categorical variables
    • Computation of margins
    • Counts vs proportions
    • Law of total probability
  • Graphical representation of one categorical variable
    • Marginal vs conditional
    • Bar chart
      • ordering
      • data integrity check for levels
    • Pie chart

2 - Visualizing numerical data

Learn useful statistics for describing distributions of data.

  • Graphical representation of one categorical and one numerical variable
  • Side-by-side boxplots
  • Faceted histograms
  • Colored density curves
  • Graphical representation of one numerical variable
  • Marginal vs conditioning
  • Histogram
  • binwidth
  • Density plot
  • bandwidth
  • Boxplot
  • outlier detection

3 - Summarizing with statistics

Statistics for describing distributions of data.

  • Center: mean, median, mode
  • Shape: skewness, modality
  • Spread: range, IQR, SD, variance
  • Unusual observations
  • Transformations: Logarithm and sqrt to reduce skew in graphics and ease comparisons.

4 - Case study

Apply what you’ve learned to explore and summarize a real world dataset in this case study of email spam.

Additional references

Instructor

Andrew Bray

Andrew Bray

Reed College

Andrew Bray is an Assistant Professor of Statistics at Reed College and lover of all things statistics and R.